Optimizing Backfill Scripts for Zero-Downtime Deploys
You added a nullable column on a live table, the schema change applied in milliseconds, and then the backfill job that populates it took your primary down. Replica lag climbed past five minutes, application writes started timing out with ERROR 1205, and your on-call pager fired during peak traffic. The schema migration was the easy part; the multi-million-row UPDATE that fills the new column is where most zero-downtime deploys actually fail. This page is the runbook for diagnosing a backfill that is hurting production and rewriting it into a throttled, idempotent, lag-aware job that finishes without anyone noticing. It assumes a standard primary-replica topology on PostgreSQL 12+ or MySQL 8.0 (InnoDB) with live application traffic and no maintenance window.
A correct backfill is the bridge between the expand step of the Expand and Contract Methodology and the contract step that finally enforces NOT NULL. Get it wrong and every downstream phase stalls.
Symptom / Error Signatures
A backfill that is degrading production shows up as one or more of the following. Run these read-only diagnostics against the primary to confirm which failure you are hitting.
| Symptom | Diagnostic query (read-only, safe on primary) |
|---|---|
ERROR 1205 (HY000): Lock wait timeout exceeded (MySQL) |
SELECT trx_id, trx_state, trx_query, trx_wait_started FROM information_schema.innodb_trx WHERE trx_state = 'LOCK WAIT'; |
canceling statement due to lock timeout (PostgreSQL) |
SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE wait_event_type = 'Lock'; |
| Replication lag climbing past the budget | SHOW REPLICA STATUS\G → Seconds_Behind_Source; PostgreSQL: SELECT now() - pg_last_xact_replay_timestamp() AS lag; |
| InnoDB row-lock contention spiking | SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%'; |
| Connection pool saturation | SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; or MySQL SHOW STATUS LIKE 'Threads_running'; |
On MySQL 8.0.22+, SHOW SLAVE STATUS is deprecated in favor of SHOW REPLICA STATUS, and Seconds_Behind_Source replaces Seconds_Behind_Master. The single most reliable early-warning signal is replica lag: it rises before the primary itself shows distress, which is why the fix in this page centers on lag as the throttle input. Choosing the right chunk size against that signal is detailed in tuning backfill batch size against replication lag.
Root Cause Analysis
A monolithic backfill runs one unbounded UPDATE … WHERE new_col IS NULL across the whole table. The storage engine holds every row lock it acquires until the single enclosing transaction commits, so a job touching millions of rows blocks concurrent application writes for the entire run, inflates the undo/redo log, and floods the binary log (MySQL) or WAL stream (PostgreSQL) with one enormous transaction the replica must replay serially. That serial replay is what drives Seconds_Behind_Source up.
Two secondary causes compound it. First, a missing index on the filter column (new_col IS NULL) or the chunking key forces a sequential scan on every iteration, multiplying I/O and CPU. Second, a fixed batch size tuned against the primary’s capacity ignores the slowest replica — the primary keeps up fine while a follower falls minutes behind, and any read traffic routed there serves stale data. Picking the pagination strategy that keeps each chunk on an index range is covered in cursor-based vs keyset pagination for large backfills. The deeper rationale for treating data population as a throttled background job, separate from the DDL, lives in the Backfill Optimization overview.
Immediate Mitigation
If a backfill is actively hurting the primary right now, stabilize before you optimize. Execute in order; primary stability outranks backfill completion.
-
Identify and stop the runaway statement. Find the offending session, then cancel it.
-- MySQL · run as a user with PROCESS/SUPER · KILL the thread id from innodb_trx KILL <thread_id>; -- PostgreSQL · cancel gracefully first; only terminate if cancel does not return SELECT pg_cancel_backend(<pid>); -- SELECT pg_terminate_backend(<pid>); -
Confirm the in-flight transaction rolled back. A killed long transaction must release its locks; verify no
LOCK WAITrows remain ininformation_schema.innodb_trxbefore continuing. -
Drain pending application writes by widening the lock-wait window briefly.
-- MySQL · session-level only, never SET GLOBAL on a hot primary in normal operation -- Temporary: lets queued app transactions finish instead of erroring out SET SESSION innodb_lock_wait_timeout = 60; -
Shed read load. Route health checks and read-heavy endpoints to a replica that is caught up, so the primary recovers headroom while you fix the script.
-
Restart the worker single-threaded. Drop to one worker and a small batch until
Innodb_row_lock_waitsfalls back to baseline, then resume tuning.
Permanent Fix / Long-Term Pattern
Replace the monolithic transaction with deterministic, bounded, idempotent batches that throttle on replica lag.
-
Chunk by primary-key range, not
OFFSET. Each batch touches a contiguous index range and theIS NULLguard makes re-runs safe.-- PostgreSQL/MySQL · run inside an application loop, one transaction per batch -- Idempotent: re-running a batch updates nothing because new_col is already set UPDATE target_table SET new_col = computed_value WHERE id BETWEEN :min_id AND :max_id AND new_col IS NULL; -
Commit every batch. Keep each transaction at 1,000–5,000 rows so lock hold time stays in milliseconds. ORM helpers (
ActiveRecord.find_each) orpt-archiver --commit-each 2000do this for you; otherwise commit explicitly between ranges. -
Index the filter and chunking columns so every batch is an index-range scan, never a sequential scan. A composite index on
(id)plus a partial indexWHERE new_col IS NULL(PostgreSQL) keeps the scan tight as the unfilled set shrinks. -
Throttle on lag with backoff. Before each batch, read replica lag; if it exceeds the budget, sleep or shrink the batch. Add exponential backoff on
LockWaitTimeout.# Shell · run as a separate post-deploy job, NOT inline with the migration step # Halts/sleeps whenever any replica exceeds --max-lag-seconds; safe to re-run from last cursor ./bin/backfill --table target_table --column new_col \ --batch 2000 --max-lag-seconds 2 --backoff-on-lock-wait -
Gate downstream phases on backfill health. Do not enable a feature flag, switch reads, or enforce
NOT NULLuntil lag is at zero andInnodb_row_lock_time_avgis under budget. Sizing batches precisely against a moving lag target is the subject of tuning backfill batch size against replication lag.
Verification Checklist
SELECT count(*) FROM target_table WHERE new_col IS NULL;returns0.CHECKSUM TABLE target_table(MySQL) orSELECT md5(string_agg(id::text, ',' ORDER BY id)) FROM target_table(PostgreSQL) matches the expected baseline.Seconds_Behind_Source = 0(MySQL) ornow() - pg_last_xact_replay_timestamp()under one second (PostgreSQL).information_schema.innodb_trxshows no long-running orLOCK WAITtransactions from the backfill user.NULL-fallback errors and5xxrate stays under 0.1%.
Frequently Asked Questions
What batch size should I start with?
Start small — 1,000 to 2,000 rows — and let the lag throttle, not a guessed number, set the ceiling. A batch is too large the moment it pushes any replica past your --max-lag-seconds budget. Measure on the slowest follower, never the primary, because the primary almost always keeps up while a replica falls behind.
Why prefer key-range chunking over LIMIT … OFFSET?
OFFSET forces the engine to scan and discard every prior row on each iteration, so the job gets quadratically slower as it progresses. Key-range chunking (WHERE id BETWEEN …) rides a B-tree index range and stays constant-time per batch. The trade-offs between the cursor approaches are laid out in cursor-based vs keyset pagination for large backfills.
Should the backfill run inside the deploy step? No. Run it as a separate post-deploy job. Inline backfills couple migration success to a long-running data operation, block the rollout, and have every replica race to replay one giant transaction. Apply the additive schema, deploy the code, then start the throttled backfill independently.