Expand and Contract Methodology
The Expand and Contract Methodology is the operational standard for executing zero-downtime database migrations. By decoupling additive and destructive schema changes into discrete, reversible phases, teams eliminate table locks and prevent application downtime. This workflow integrates directly with established Zero-Downtime Schema Evolution Patterns to guarantee backward compatibility across deployment cycles.
Phase 1: Expand (Additive Schema Changes)
Introduce new columns, tables, or indexes while preserving legacy structures. The database must simultaneously support old and new application versions.
Execution Protocol
Wrap DDL in explicit transactions where supported. Enforce DEFAULT NULL or explicit defaults to prevent legacy write failures. Defer constraint validation on large tables to avoid full-table scans.
BEGIN;
ALTER TABLE orders ADD COLUMN fulfillment_status VARCHAR(50) DEFAULT NULL;
-- PostgreSQL: defer validation to prevent exclusive locks on large tables
ALTER TABLE orders ADD CONSTRAINT chk_fulfillment
CHECK (fulfillment_status IN ('pending', 'shipped', 'delivered')) NOT VALID;
COMMIT;
Dry-Run & Validation
- Schema diff:
pg_dump --schema-only --no-owner -f expand_phase.sql - Migration tool dry-run:
flyway migrate -dryRunOutput=expand_phase.sql - Query plan baseline:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE fulfillment_status = 'pending';
Environment Context
| Environment | Execution Policy | Validation Scope |
|---|---|---|
| Dev | Immediate DDL application | ORM mapping sync, unit test coverage |
| Staging | Load-test with production data volume | Connection pool saturation, index bloat simulation |
| Prod | Deploy during low-write windows | Replication lag < 500ms, zero write errors |
Compatibility Window: V1 (legacy) and V2 (new) application binaries must coexist. Active from Phase 1 deployment until Phase 2 traffic shift completes.
Forward Path: Deploy V2 application code that reads/writes the new column.
Rollback Path: Revert to V1 application binary. If DDL must be reverted, execute ALTER TABLE orders DROP COLUMN fulfillment_status; only if zero writes occurred post-deploy. For high-throughput systems, consult Implementing Expand-Contract for High-Traffic Tables to configure connection pooling and query routing before applying DDL.
Phase 2: Data Synchronization & Traffic Routing
Backfill historical records, establish parallel write paths, and incrementally shift read traffic.
Execution Protocol Implement dual-write logic at the application layer. Backfill using chunked, idempotent queries to prevent long-running transactions and lock escalation.
-- Chunked backfill (PostgreSQL/MySQL compatible)
DO $$
DECLARE
batch_size INT := 1000;
last_id BIGINT := 0;
BEGIN
LOOP
UPDATE orders
SET fulfillment_status = COALESCE(fulfillment_status, 'legacy_mapped')
WHERE id > last_id AND id <= last_id + batch_size AND fulfillment_status IS NULL;
GET DIAGNOSTICS last_id = ROW_COUNT;
EXIT WHEN last_id = 0;
COMMIT; -- Commit per batch to release locks
last_id := last_id + batch_size;
END LOOP;
END $$;
Dry-Run & Validation
- Checksum validation:
SELECT COUNT(*), MD5(STRING_AGG(fulfillment_status::text, '' ORDER BY id)) FROM orders WHERE id % 20 = 0; - Feature flag toggle test:
curl -X POST /api/config/flags/dual-write-orders -d '{"enabled": true}'
Environment Context
| Environment | Execution Policy | Validation Scope |
|---|---|---|
| Dev | Mock traffic routing | Feature flag toggle latency, dual-write mock |
| Staging | Chaos testing, network partition simulation | Replication lag, conflict resolution logic |
| Prod | Gradual rollout (5% → 25% → 50% → 100%) | Error rate < 0.1%, p99 latency stable |
Compatibility Window: Both legacy and new read/write endpoints active. Traffic split governed by feature flags. Duration: Until 100% read routing is verified stable. Forward Path: Shift 100% read traffic to new schema. Disable legacy read paths. Rollback Path: Immediately toggle feature flags to legacy endpoints. Halt backfill jobs. Revert application to V1. Implement Dual-Write Synchronization to guarantee consistency during the transition window. Use Feature Flag Rollouts to gradually shift read traffic to the new schema while monitoring replication lag and error rates.
Phase 3: Contract (Destructive Cleanup)
Decommission legacy columns, tables, and synchronization logic only after verifying zero anomalies.
Execution Protocol Drop columns/tables during maintenance windows. Remove dual-write application logic. Validate deferred constraints before finalizing.
BEGIN;
-- Drop legacy column after confirming zero references
ALTER TABLE orders DROP COLUMN legacy_status;
-- Finalize constraint validation
ALTER TABLE orders VALIDATE CONSTRAINT chk_fulfillment;
COMMIT;
Dry-Run & Validation
- Lock contention simulation:
pt-online-schema-change --dry-run --alter "DROP COLUMN legacy_status" D=prod_db,t=orders - Query plan regression check:
EXPLAIN (ANALYZE) SELECT * FROM orders WHERE legacy_status IS NULL;(should be removed from codebase)
Environment Context
| Environment | Execution Policy | Validation Scope |
|---|---|---|
| Dev | Immediate cleanup | ORM schema sync, migration manifest update |
| Staging | Full schema diff validation | Index fragmentation, storage reclamation |
| Prod | Scheduled maintenance window | Lock queue depth < 10, zero deadlocks |
Compatibility Window: V1 application must be fully decommissioned. Zero legacy code references in production. Duration: 48-hour observation post-contract.
Forward Path: Finalize schema removal. Archive migration manifest.
Rollback Path: Halt DROP execution. Restore from point-in-time recovery (PITR) snapshot. Re-add columns via Phase 1 scripts. Re-enable legacy app version. For multi-terabyte tables, apply Handling Large Table Alterations Without Locks using batched cleanup routines.
Safety Checks & Environment Parity
Maintain strict parity between staging and production. Every migration script must be reversible.
Pre-Flight Validation Matrix
| Check | Command/Tool | Pass Criteria |
|---|---|---|
| Foreign Key Integrity | SELECT * FROM information_schema.table_constraints WHERE constraint_type = 'FOREIGN KEY'; |
Zero orphaned references |
| Index Bloat | pg_stat_user_indexes / SHOW INDEX |
Bloat ratio < 1.5x |
| Query Plan Regression | EXPLAIN (ANALYZE) vs baseline |
No sequential scans on indexed columns |
| Schema Diff | schema-diff staging prod |
100% parity before Phase 3 |
Environment Parity Enforcement
- Dev: Ephemeral databases, rapid iteration. No strict promotion gates.
- Staging: Production topology replica. Automated load testing. Mandatory approval gates between phases.
- Prod: Immutable deployment pipeline. Pre-flight checks block promotion if thresholds fail. Replicate production topology in staging to validate migration scripts under identical load profiles.
Rollback Paths & Incident Response
Define explicit rollback triggers for each phase. Never skip validation gates.
Incident Decision Matrix
| Phase | Trigger | Immediate Action | Recovery Path |
|---|---|---|---|
| Phase 1 | Legacy write failures (>0.5% error rate) | Halt deployment, revert app | Drop new column, restore V1 binary |
| Phase 2 | Replication lag > 2s or checksum mismatch | Disable dual-write, route to legacy | Re-run backfill with reduced batch size |
| Phase 3 | Lock contention > 30s or deadlock spike | ABORT transaction, pause cleanup |
PITR restore, re-add legacy column, extend observation window |
Maintain automated PITR snapshots and versioned migration manifests to guarantee reversible deployments. Enforce mandatory 48-hour monitoring windows post-contract to catch delayed batch jobs or cached queries.