Implementing Expand-Contract for High-Traffic Tables
Execution Context
Modifying schema definitions on tables processing thousands of TPS requires decoupling structural changes from application deployments. This guide operationalizes the Zero-Downtime Schema Evolution Patterns framework for relational databases (PostgreSQL 12+, MySQL 8.0+, SQL Server 2019+). All commands assume execution via a privileged DBA account or CI/CD migration runner with connection pooling (PgBouncer/ProxySQL) active and feature flag infrastructure deployed.
Symptom
- PostgreSQL:
ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) - MySQL:
Deadlock found when trying to get lock; try restarting transaction - ORM Layer:
ColumnNotFoundExceptionor mapping mismatch during dual-write transition - Application: P99 latency spikes >500ms during
ALTER TABLEexecution - Replication: Replica lag exceeding 10s due to long-running DDL or backfill transactions
Root Cause
Exclusive AccessExclusive or metadata locks acquired during traditional DDL block concurrent DML, causing connection queue buildup and timeout cascades. Backfill scripts lacking explicit transaction boundaries or chunking trigger row-level lock escalation and replication lag. ORM mapping failures occur when schema changes deploy before dual-write routing is active, or when feature flags desynchronize across service instances.
Immediate Mitigation
- Terminate blocking sessions:
- PG:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%ALTER TABLE%'; - MySQL:
KILL QUERY <process_id>;
- Activate circuit breakers: Route write traffic to a read-only replica or degrade gracefully to cached responses.
- Revert routing: Flip feature flags to legacy schema routing immediately.
- Halt backfill:
SIGTERMrunning migration workers. Release held locks viaCOMMITorROLLBACK. - Flush connection pools:
RECONNECTPgBouncer/ProxySQL to clear stale metadata caches and prevent queue buildup during transient unavailability.
Permanent Resolution
Execute the Expand and Contract Methodology using strict phased gates.
Phase 1: Expand (Additive Schema)
Execution: Add new columns as NULLABLE. Deploy dual-write application code. Configure ORM to write to both legacy and new fields while reading from legacy.
Command:
ALTER TABLE high_traffic_table ADD COLUMN new_field VARCHAR(255) NULL;
Rollback:
ALTER TABLE high_traffic_table DROP COLUMN IF EXISTS new_field;
-- Revert application deployment to previous release
-- Reset feature flags to legacy-only routing
Phase 2: Backfill & Synchronization
Execution: Run chunked, idempotent backfill. Throttle to <20% IOPS/CPU. Implement exponential backoff on transient lock conflicts. Diagnostic Query (Monitor Progress & Lock Impact):
-- PostgreSQL: Check active backfill queries and lock waits
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state = 'active' AND query LIKE '%UPDATE%';
-- MySQL: Check replica lag during backfill
SHOW SLAVE STATUS\G -- Monitor Seconds_Behind_Master
Rollback:
-- Pause backfill worker
-- Clear partial writes if idempotency fails
UPDATE high_traffic_table SET new_field = NULL WHERE backfill_flag = 1;
Phase 3: Contract (Remove Legacy Structure)
Execution: Verify data parity. Switch reads to new_field. Drop legacy columns and indexes incrementally. Remove dual-write logic from ORM.
Diagnostic Query (Parity Check):
SELECT COUNT(*) AS mismatch_count
FROM high_traffic_table
WHERE (legacy_field IS NULL AND new_field IS NOT NULL)
OR (legacy_field IS NOT NULL AND new_field IS NULL);
Command:
ALTER TABLE high_traffic_table DROP COLUMN legacy_field;
ALTER TABLE high_traffic_table ALTER COLUMN new_field SET NOT NULL;
Rollback:
-- Add legacy column back as NULLABLE
ALTER TABLE high_traffic_table ADD COLUMN legacy_field VARCHAR(255) NULL;
-- Revert read routing to legacy_field
-- Re-deploy application with dual-write enabled
Validation
Execute the following checks before closing the migration ticket:
NULLvalues innew_fieldpost-backfill (ifNOT NULLconstraint applied).
Final Integrity Check:
SELECT table_name, column_name, is_nullable, data_type
FROM information_schema.columns
WHERE table_name = 'high_traffic_table' AND column_name IN ('legacy_field', 'new_field');
Ensure legacy_field is absent and new_field matches target constraints. Archive migration scripts and lock schema version in registry.