Implementing Expand-Contract for High-Traffic Tables

Execution Context

Modifying schema definitions on tables processing thousands of TPS requires decoupling structural changes from application deployments. This guide operationalizes the Zero-Downtime Schema Evolution Patterns framework for relational databases (PostgreSQL 12+, MySQL 8.0+, SQL Server 2019+). All commands assume execution via a privileged DBA account or CI/CD migration runner with connection pooling (PgBouncer/ProxySQL) active and feature flag infrastructure deployed.

Symptom

  • PostgreSQL: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03)
  • MySQL: Deadlock found when trying to get lock; try restarting transaction
  • ORM Layer: ColumnNotFoundException or mapping mismatch during dual-write transition
  • Application: P99 latency spikes >500ms during ALTER TABLE execution
  • Replication: Replica lag exceeding 10s due to long-running DDL or backfill transactions

Root Cause

Exclusive AccessExclusive or metadata locks acquired during traditional DDL block concurrent DML, causing connection queue buildup and timeout cascades. Backfill scripts lacking explicit transaction boundaries or chunking trigger row-level lock escalation and replication lag. ORM mapping failures occur when schema changes deploy before dual-write routing is active, or when feature flags desynchronize across service instances.

Immediate Mitigation

  1. Terminate blocking sessions:
  • PG: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query LIKE '%ALTER TABLE%';
  • MySQL: KILL QUERY <process_id>;
  1. Activate circuit breakers: Route write traffic to a read-only replica or degrade gracefully to cached responses.
  2. Revert routing: Flip feature flags to legacy schema routing immediately.
  3. Halt backfill: SIGTERM running migration workers. Release held locks via COMMIT or ROLLBACK.
  4. Flush connection pools: RECONNECT PgBouncer/ProxySQL to clear stale metadata caches and prevent queue buildup during transient unavailability.

Permanent Resolution

Execute the Expand and Contract Methodology using strict phased gates.

Phase 1: Expand (Additive Schema)

Execution: Add new columns as NULLABLE. Deploy dual-write application code. Configure ORM to write to both legacy and new fields while reading from legacy. Command:

ALTER TABLE high_traffic_table ADD COLUMN new_field VARCHAR(255) NULL;

Rollback:

ALTER TABLE high_traffic_table DROP COLUMN IF EXISTS new_field;
-- Revert application deployment to previous release
-- Reset feature flags to legacy-only routing

Phase 2: Backfill & Synchronization

Execution: Run chunked, idempotent backfill. Throttle to <20% IOPS/CPU. Implement exponential backoff on transient lock conflicts. Diagnostic Query (Monitor Progress & Lock Impact):

-- PostgreSQL: Check active backfill queries and lock waits
SELECT pid, wait_event_type, wait_event, query 
FROM pg_stat_activity 
WHERE state = 'active' AND query LIKE '%UPDATE%';
-- MySQL: Check replica lag during backfill
SHOW SLAVE STATUS\G -- Monitor Seconds_Behind_Master

Rollback:

-- Pause backfill worker
-- Clear partial writes if idempotency fails
UPDATE high_traffic_table SET new_field = NULL WHERE backfill_flag = 1;

Phase 3: Contract (Remove Legacy Structure)

Execution: Verify data parity. Switch reads to new_field. Drop legacy columns and indexes incrementally. Remove dual-write logic from ORM. Diagnostic Query (Parity Check):

SELECT COUNT(*) AS mismatch_count
FROM high_traffic_table
WHERE (legacy_field IS NULL AND new_field IS NOT NULL)
 OR (legacy_field IS NOT NULL AND new_field IS NULL);

Command:

ALTER TABLE high_traffic_table DROP COLUMN legacy_field;
ALTER TABLE high_traffic_table ALTER COLUMN new_field SET NOT NULL;

Rollback:

-- Add legacy column back as NULLABLE
ALTER TABLE high_traffic_table ADD COLUMN legacy_field VARCHAR(255) NULL;
-- Revert read routing to legacy_field
-- Re-deploy application with dual-write enabled

Validation

Execute the following checks before closing the migration ticket:

  • NULL values in new_field post-backfill (if NOT NULL constraint applied).

Final Integrity Check:

SELECT table_name, column_name, is_nullable, data_type
FROM information_schema.columns
WHERE table_name = 'high_traffic_table' AND column_name IN ('legacy_field', 'new_field');

Ensure legacy_field is absent and new_field matches target constraints. Archive migration scripts and lock schema version in registry.