Feature Flag Rollouts

The slowest part of any schema rollback is the deploy. When a new query path starts timing out under production load, reverting the application image takes minutes — minutes during which the incident widens. Feature flag rollouts collapse that to a single toggle. By gating every schema-backed read and write behind a runtime flag, you separate two events that engineers usually conflate: the schema exists the moment the migration applies, but the application only uses it when the flag turns on. That separation buys you percentage rollout — 1% of traffic, then 25%, then 100% — and an instant kill-switch that needs no DDL and no redeploy.

This page sits alongside the rest of the Zero-Downtime Schema Evolution Patterns section and pairs naturally with the Expand and Contract Methodology: expand and contract makes the schema safe to carry both shapes, flags decide how fast traffic moves between them. It serves backend engineers writing the gated code, DBAs verifying the database tolerates both paths, and platform teams running the rollout dashboard. The examples target PostgreSQL 11+ and MySQL 8.0.

The flag is evaluated per request, so the rollback path is a config change that takes effect on the next request, not a redeploy.

Concept & Mechanism

A feature flag is a runtime predicate, not a deploy artifact. The application asks an evaluation service “is schema_v2 on for this request?” on every code path that touches the new schema, and the answer can change between one request and the next without restarting a process. That is the whole source of the pattern’s power: the migration and the code that uses it are already shipped and dormant, so turning them on or off is a data change in the flag store, which propagates in seconds.

The database is deliberately uninvolved in the toggle. The migration runs ahead of time and additively — a new nullable column, a new index built with CREATE INDEX CONCURRENTLY or ALGORITHM=INPLACE, LOCK=NONE — so the schema tolerates both the flag-on and flag-off paths simultaneously. Nothing about flipping the flag issues DDL. This is what makes rollback safe at any percentage: the kill-switch never has to reverse a schema change, only redirect application traffic, the same forward-only reversal contract enforced by Rollback Automation.

Percentage rollout works by hashing a stable key — user ID, tenant ID — into a bucket and comparing it to the rollout fraction. Because the key is stable, a given user stays on one path across requests, which keeps read-after-write consistent for that user even mid-rollout. The two correctness hazards are evaluation context and flag scope: a background job, a cron task, or a queue consumer evaluates the flag with no user context and can silently fall on the wrong path. Pinning those contexts explicitly is the subject of Using Feature Flags to Toggle Schema Changes Safely.

Prerequisites & Decision Criteria

Flags shine when you want graduated exposure and instant reversal. They add little when a change is purely additive and harmless, or when the read shift must be atomic. Confirm the following before gating a schema change.

The schema migration is additive and already applied, so flag-off and flag-on both run against the live database.
The flag store evaluates in-process or with a low-latency cache — a DB round-trip per evaluation will dominate p99.
You bucket on a stable key so a user does not flip paths between requests.
Every background job, cron, and queue consumer has an explicit evaluation context (not an empty default).
You have a default-off fallback if the flag service is unreachable, so an outage there fails safe to the legacy path.

Situation	Gate behind a flag?	Why
Routing reads to a new column	Yes	Percentage rollout plus instant revert
Enabling dual-write to a new column	Yes	Toggle write amplification off without deploy
A new nullable column nothing reads yet	No	Nothing to gate; the expand step alone suffices
Dropping the old column (contract)	No	DDL is not flaggable; gate the reads that precede it
A change that must cut over atomically	No	Percentage rollout implies a mixed-state window

Step-by-Step Procedure

The example gates reads and writes of orders.order_status (added in an earlier expand migration) behind a schema_v2 flag.

Step 1 — Apply the schema additively, flag default-off. Ship the migration well ahead of the rollout so the column and index exist while every instance still ignores them.

-- PostgreSQL · migration role · CREATE INDEX CONCURRENTLY must run OUTSIDE a transaction
-- Safe at any time: the index build is online and no code reads the column yet.
ALTER TABLE orders ADD COLUMN IF NOT EXISTS order_status VARCHAR(50);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_order_status ON orders (order_status);

Step 2 — Gate the write path. Behind the flag, dual-write both columns; with the flag off, write only the legacy column. This decouples write amplification from the read rollout and ties into Dual-Write Synchronization.

# Application code (Python) · evaluated per request · flag store cached in-process
# Context must carry a stable key; a job with no user falls to default-off.
def write_order(conn, order, flags, ctx):
    if flags.enabled("schema_v2", ctx):
        # PostgreSQL — dual-write keeps both columns consistent during rollout
        conn.execute("UPDATE orders SET status=%s, order_status=%s WHERE id=%s",
                     (order.status, order.status, order.id))
    else:
        conn.execute("UPDATE orders SET status=%s WHERE id=%s",
                     (order.status, order.id))

Verify before proceeding: with the flag forced on for a test tenant, confirm both columns receive the write.

Step 3 — Gate the read path and start the rollout. Behind the flag, read order_status; otherwise read status. Begin at a small fraction.

# Flag config (rollout service) · applied via API, no deploy · default-off on eval failure
schema_v2:
  default: false
  rollout:
    by: user_id          # stable bucket key — a user stays on one path
    percentage: 1        # start at 1%, raise on the dashboard
  contexts:
    batch_processor: false   # cron/queue consumers stay on legacy explicitly

Step 4 — Ramp by percentage against SLOs. Raise the fraction (1% → 5% → 25% → 50% → 100%) only while p95/p99 latency and error rate stay inside budget. Each bump is a config change, not a deploy.

Verify before proceeding: after each bump, hold for an observation interval and confirm the new path’s latency tracks the legacy baseline.

Step 5 — Hold at 100%, then remove the flag with the old path. Once stable, retire the flag and the legacy code together — the teardown sequence detailed in Coupling Schema Changes to Feature Flags and Removing Both.

Verification & Observability

The rollout is only safe if you can see which path traffic actually takes and that the database tolerates both.

-- PostgreSQL · monitoring role · read-only · are queries still hitting the legacy column?
-- A non-zero count after a 100% rollout means a context bypassed the flag.
SELECT count(*) FROM pg_stat_activity
WHERE query ILIKE '%orders.status%' AND query NOT ILIKE '%order_status%';

-- PostgreSQL · monitoring role · read-only · confirm the new index is actually used
-- idx_scan should climb as the rollout percentage rises.
SELECT indexrelname, idx_scan FROM pg_stat_user_indexes
WHERE indexrelname = 'idx_orders_order_status';

-- MySQL 8.0 · user with PROCESS privilege · read-only · watch for lock contention under both paths
SELECT * FROM performance_schema.data_locks WHERE OBJECT_NAME = 'orders';

Emit a metric tagged with the resolved flag value on every gated request. The single most useful dashboard is the ratio of new-path to legacy-path requests next to the per-path error rate — if the new-path error rate diverges, you flip the flag before the ramp continues.

Rollback Path

Rollback is the pattern’s reason for existing, and it has no DDL.

Instant revert (any percentage): set the flag to false. The next request on every instance takes the legacy path. Safe unconditionally as long as the legacy column is still written — which it is, because Step 2 keeps dual-write on for the flag-on path and single-write for flag-off.

# Operator action · flag API · takes effect on the next request, no deploy
# Safe because the migration is additive — turning the flag off issues zero DDL.
curl -X PATCH https://flags.internal/api/flags/schema_v2 -d '{"default": false, "percentage": 0}'

Partial revert: lower the percentage instead of zeroing it, to bleed traffic off the new path while you investigate.
What rollback does NOT do: it never drops the new column or index. The schema stays expanded; only application routing reverts. Reversing the schema is a separate, deliberate contract step gated behind a recovery snapshot, never an automatic consequence of a flag flip.

The one condition that makes revert unsafe is if you disabled dual-write and let the new path accept writes the legacy column never saw. Keep dual-write on for the entire rollout and the legacy column is always current, so revert is lossless.

Common Errors & Fixes

Background job reads the wrong schema — a cron or queue consumer hits the new path (or the wrong legacy path) because it evaluated the flag with no user context. Root cause: an empty evaluation context resolving to the default bucket instead of a pinned value. Fix: assign every non-request context an explicit flag value, as in the batch_processor: false rule above, and assert it in tests.

Read-after-write inconsistency mid-rollout — a user writes via one path and reads via the other within the same session. Root cause: bucketing on an unstable key (request ID, random) instead of a stable one. Fix: bucket on user_id or tenant_id so a given actor stays on one path for the whole rollout.

Flag service latency dominates p99 — gated endpoints slow down as the flag store is queried per request. Root cause: a network round-trip per evaluation instead of an in-process cache. Fix: use an SDK that streams flag state and evaluates locally; never put a DB or HTTP call on the hot evaluation path.

Flag service outage takes down the feature — when the flag store is unreachable, requests error or stall. Root cause: no fallback default. Fix: configure default-off (legacy path) on evaluation failure so a flag outage degrades to the known-good schema path rather than an error.

Stale flag left on forever — months later the flag still exists, branching dead code and confusing on-call. Root cause: no teardown step in the rollout plan. Fix: treat flag removal as a required final step and remove the flag with the legacy code path, covered in the child page below.

Child Page Index

This section splits into the two operations that bookend a flagged rollout. Using Feature Flags to Toggle Schema Changes Safely covers getting the toggle correct under production traffic — stable bucketing, explicit evaluation contexts for cron and queue workers, default-off fallback, and proving the database tolerates both paths. Coupling Schema Changes to Feature Flags and Removing Both covers the teardown: holding at 100%, then retiring the flag and the legacy code path together so no dead branch or orphaned column lingers. Both extend the parent Zero-Downtime Schema Evolution Patterns overview.

Frequently Asked Questions

Does flipping a feature flag run any DDL? No. The migration is applied additively ahead of time, and the schema tolerates both the flag-on and flag-off paths. Flipping the flag is a data change in the flag store that redirects application traffic; it never issues an ALTER, which is exactly why rollback is instant.

How do I keep reads consistent while only some traffic uses the new path? Bucket on a stable key such as user_id or tenant_id. Because the same user always hashes to the same bucket, they stay on a single path across requests, so their reads and writes remain consistent even while the global rollout sits at, say, 30%.

What happens to cron jobs and queue consumers during a percentage rollout? They evaluate the flag with no user context, so unless you pin them they fall to the default bucket unpredictably. Give every background context an explicit value — usually the legacy path until the rollout completes — so a job never reads a schema the rest of the fleet has not adopted.

When is it safe to remove the flag? After the rollout has held at 100% long enough to trust, and only as a paired change that deletes the flag and the now-dead legacy code path together. Leaving the flag in place indefinitely accumulates dead branches and on-call confusion.

Related Articles