Using Feature Flags to Toggle Schema Changes Safely
A schema-backed feature went to 100% of traffic, error rates climbed, and your only rollback option was to reverse the DDL on a live table โ the slowest, riskiest recovery path there is. The whole reason to put a feature flag in front of a schema change is so that the rollback is a config write, not an ALTER TABLE. When a flag controls which schema path the application reads, you decouple three things that usually fail together: the database change, the application deploy, and the moment users actually see the new behavior. This page is the runbook for wiring that flag correctly so a bad rollout flips off in seconds without touching the database, and for diagnosing the failures that happen when DDL and code are not decoupled at all.
Flags are the control surface for the Expand and Contract Methodology: they gate the read switch between the legacy and new columns. The cleanup question โ how to retire the flag and the schema together without leaving dead code โ is covered in coupling schema changes to feature flags and removing both. Examples target PostgreSQL and MySQL 8.0.
Symptom / Error Signatures
The failures that drive an engineer to flag-gating are the ones caused by not having a flag โ DDL and code shipping coupled:
PostgreSQL ERROR: canceling statement due to lock timeoutduring a coupledALTER TABLE.MySQL ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction.- ORM migration timeouts:
ActiveRecord::StatementInvalid, DjangoOperationalError. - HTTP 500/503 spikes correlating precisely with DDL execution timestamps.
- Referential-integrity violations on a newly added column when legacy instances write rows the new constraint rejects.
When a rollout is already misbehaving, find the blocking session before you decide whether to flip the flag or kill a query:
-- PostgreSQL ยท read-only ยท identify the blocker behind a waiting statement
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
blocker.pid AS blocker_pid, blocker.query AS blocker_query,
now() - blocker.query_start AS lock_duration
FROM pg_stat_activity blocker
JOIN pg_stat_activity blocked
ON blocked.wait_event_type = 'Lock'
AND blocker.pid = ANY(pg_blocking_pids(blocked.pid));
-- MySQL ยท read-only ยท lock waits and the transactions causing them
SELECT r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query,
b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_query
FROM performance_schema.data_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_engine_transaction_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_engine_transaction_id;
Note: information_schema.innodb_lock_waits was removed in MySQL 8.0; use performance_schema.data_lock_waits.
Root Cause Analysis
Two distinct problems collapse into one symptom. The first is coupled DDL โ running an ALTER TABLE in the same deploy as the code that needs it. The DDL takes an exclusive metadata lock, concurrent connection pools pile up behind it, and threads exhaust into connection starvation while the new code on some instances hits a schema the rest of the fleet has not caught up to.
The second is flag misuse โ treating the flag as a deploy gate rather than a runtime kill switch. If the flag is read once at startup, or is cached without invalidation, flipping it off does not actually stop the new path, so you are back to reversing DDL under pressure. A correctly built flag is evaluated per request from centralized configuration, so disabling it takes effect immediately and globally without any database operation. This is the same instant-reversal contract that the Feature Flag Rollouts overview builds, and it depends on the schema already being additive so that turning the flag off leaves the database in a valid state for the old code.
Immediate Mitigation
When a flag-gated rollout is failing, the flag is your fastest lever โ but only after you confirm the schema is additive enough that disabling it is safe.
-
Flip the flag off. Set the rollout to 0% from centralized config; if the flag is evaluated per request, the new path stops fleet-wide within seconds and no DDL is touched.
-
Freeze the pipeline. Abort pending migration batches and pause deployment so no further coupled DDL ships while you triage.
-
Isolate write amplification. Route affected endpoints to read-only replicas or trip circuit breakers so failing writes stop compounding.
-
Clear stale locks if a coupled DDL is the cause. Flush pooled connections and, if necessary, terminate the blocker โ never an OS-level
SIGKILL.-- PostgreSQL ยท requires pg_signal_backend or superuser SELECT pg_terminate_backend(<blocker_pid>); -- MySQL ยท requires CONNECTION_ADMIN/SUPER KILL <blocking_thread_id>; -
Reconcile any orphaned writes the partial rollout produced before re-attempting, so the legacy path sees a consistent table.
Permanent Fix / Long-Term Pattern
Decouple the database change from the deploy from the user-visible switch, and make the flag a true runtime kill switch. This is the application of the Zero-Downtime Schema Evolution Patterns discipline to the read path.
-
Expand additively first. Add columns nullable or with a safe constant default using
ADD COLUMN IF NOT EXISTSplusCREATE INDEX CONCURRENTLY(PostgreSQL) orALGORITHM=INPLACE, LOCK=NONE(MySQL), so the schema never blocks and the old code stays valid.-- PostgreSQL ยท metadata-only add; index build must run OUTSIDE a transaction ALTER TABLE users ADD COLUMN IF NOT EXISTS new_column VARCHAR(255) NULL; CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_new_column ON users (new_column); -
Deploy dual-write code, then backfill. Ship code that writes both fields and reads legacy, then run rate-limited backfill workers that respect replication lag so the WAL/binlog never floods.
-
Gate reads behind the flag. Read the flag per request from centralized config and ramp 1% โ 10% โ 50% โ 100%, watching error and latency metrics at each step before advancing.
# Application read path ยท flag evaluated per request, not cached at startup # Returns instantly to legacy when the rollout percentage is lowered if flags.enabled("read_new_schema", user_id): value = row.new_column # new path else: value = row.legacy_col # safe default while ramping -
Wire automatic rollback triggers. Alert on error rate >1% or p95 latency SLA breach and auto-set the flag to 0% โ no DDL reversal required, because the schema is additive.
-
Contract only after a full clean cycle. Drop the legacy column once the flag has held at 100% through a complete business cycle with zero fallback invocations; retiring the flag and the legacy schema in lockstep is detailed in coupling schema changes to feature flags and removing both.
Verification Checklist
legacy_colandnew_columnis confirmed before ramping past 1%.
Frequently Asked Questions
Can a feature flag replace expand-contract? No โ it complements it. The flag controls when users see the change; expand-contract makes the underlying schema safe to toggle by keeping every state additive. A flag in front of a destructive, coupled DDL cannot save you, because flipping it off does not reverse a dropped column.
Why must the flag be evaluated per request? So that disabling it is an instant, fleet-wide kill switch. A flag read once at startup or cached without invalidation does not stop the new code path when you flip it, which forces you back to reversing DDL under incident pressure โ exactly the slow rollback the flag was meant to avoid.
When is it safe to drop the legacy column? Only after the flag has held at 100% through a full business cycle with audit logs showing zero fallback to the legacy path. Until then the legacy column is your rollback target; dropping it early removes the ability to flip back without a restore.