Optimizing Backfill Scripts for Zero-Downtime Deploys

You added a nullable column on a live table, the schema change applied in milliseconds, and then the backfill job that populates it took your primary down. Replica lag climbed past five minutes, application writes started timing out with ERROR 1205, and your on-call pager fired during peak traffic. The schema migration was the easy part; the multi-million-row UPDATE that fills the new column is where most zero-downtime deploys actually fail. This page is the runbook for diagnosing a backfill that is hurting production and rewriting it into a throttled, idempotent, lag-aware job that finishes without anyone noticing. It assumes a standard primary-replica topology on PostgreSQL 12+ or MySQL 8.0 (InnoDB) with live application traffic and no maintenance window.

A correct backfill is the bridge between the expand step of the Expand and Contract Methodology and the contract step that finally enforces NOT NULL. Get it wrong and every downstream phase stalls.

Symptom / Error Signatures

A backfill that is degrading production shows up as one or more of the following. Run these read-only diagnostics against the primary to confirm which failure you are hitting.

Symptom Diagnostic query (read-only, safe on primary)
ERROR 1205 (HY000): Lock wait timeout exceeded (MySQL) SELECT trx_id, trx_state, trx_query, trx_wait_started FROM information_schema.innodb_trx WHERE trx_state = 'LOCK WAIT';
canceling statement due to lock timeout (PostgreSQL) SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE wait_event_type = 'Lock';
Replication lag climbing past the budget SHOW REPLICA STATUS\GSeconds_Behind_Source; PostgreSQL: SELECT now() - pg_last_xact_replay_timestamp() AS lag;
InnoDB row-lock contention spiking SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%';
Connection pool saturation SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; or MySQL SHOW STATUS LIKE 'Threads_running';

On MySQL 8.0.22+, SHOW SLAVE STATUS is deprecated in favor of SHOW REPLICA STATUS, and Seconds_Behind_Source replaces Seconds_Behind_Master. The single most reliable early-warning signal is replica lag: it rises before the primary itself shows distress, which is why the fix in this page centers on lag as the throttle input. Choosing the right chunk size against that signal is detailed in tuning backfill batch size against replication lag.

Root Cause Analysis

A monolithic backfill runs one unbounded UPDATE … WHERE new_col IS NULL across the whole table. The storage engine holds every row lock it acquires until the single enclosing transaction commits, so a job touching millions of rows blocks concurrent application writes for the entire run, inflates the undo/redo log, and floods the binary log (MySQL) or WAL stream (PostgreSQL) with one enormous transaction the replica must replay serially. That serial replay is what drives Seconds_Behind_Source up.

Two secondary causes compound it. First, a missing index on the filter column (new_col IS NULL) or the chunking key forces a sequential scan on every iteration, multiplying I/O and CPU. Second, a fixed batch size tuned against the primary’s capacity ignores the slowest replica — the primary keeps up fine while a follower falls minutes behind, and any read traffic routed there serves stale data. Picking the pagination strategy that keeps each chunk on an index range is covered in cursor-based vs keyset pagination for large backfills. The deeper rationale for treating data population as a throttled background job, separate from the DDL, lives in the Backfill Optimization overview.

Lag-aware backfill control loop A backfill worker writes a batch to the primary, the replica reports lag, and a throttle compares lag to the budget to either continue, shrink the batch, or pause before the next batch. Lag-Aware Backfill Loop Worker writes one PK-range batch Primary commits, replica replays Measure lag vs budget (max-lag-seconds) lag under budget → continue · over budget → shrink batch or sleep The replica's lag, not the primary's load, sets the pace.
The backfill never advances faster than the slowest replica can replay — lag is the control input for batch size and sleep.

Immediate Mitigation

If a backfill is actively hurting the primary right now, stabilize before you optimize. Execute in order; primary stability outranks backfill completion.

  1. Identify and stop the runaway statement. Find the offending session, then cancel it.

    -- MySQL · run as a user with PROCESS/SUPER · KILL the thread id from innodb_trx
    KILL <thread_id>;
    
    -- PostgreSQL · cancel gracefully first; only terminate if cancel does not return
    SELECT pg_cancel_backend(<pid>);
    -- SELECT pg_terminate_backend(<pid>);
  2. Confirm the in-flight transaction rolled back. A killed long transaction must release its locks; verify no LOCK WAIT rows remain in information_schema.innodb_trx before continuing.

  3. Drain pending application writes by widening the lock-wait window briefly.

    -- MySQL · session-level only, never SET GLOBAL on a hot primary in normal operation
    -- Temporary: lets queued app transactions finish instead of erroring out
    SET SESSION innodb_lock_wait_timeout = 60;
  4. Shed read load. Route health checks and read-heavy endpoints to a replica that is caught up, so the primary recovers headroom while you fix the script.

  5. Restart the worker single-threaded. Drop to one worker and a small batch until Innodb_row_lock_waits falls back to baseline, then resume tuning.

Permanent Fix / Long-Term Pattern

Replace the monolithic transaction with deterministic, bounded, idempotent batches that throttle on replica lag.

  1. Chunk by primary-key range, not OFFSET. Each batch touches a contiguous index range and the IS NULL guard makes re-runs safe.

    -- PostgreSQL/MySQL · run inside an application loop, one transaction per batch
    -- Idempotent: re-running a batch updates nothing because new_col is already set
    UPDATE target_table
    SET    new_col = computed_value
    WHERE  id BETWEEN :min_id AND :max_id
      AND  new_col IS NULL;
  2. Commit every batch. Keep each transaction at 1,000–5,000 rows so lock hold time stays in milliseconds. ORM helpers (ActiveRecord.find_each) or pt-archiver --commit-each 2000 do this for you; otherwise commit explicitly between ranges.

  3. Index the filter and chunking columns so every batch is an index-range scan, never a sequential scan. A composite index on (id) plus a partial index WHERE new_col IS NULL (PostgreSQL) keeps the scan tight as the unfilled set shrinks.

  4. Throttle on lag with backoff. Before each batch, read replica lag; if it exceeds the budget, sleep or shrink the batch. Add exponential backoff on LockWaitTimeout.

    # Shell · run as a separate post-deploy job, NOT inline with the migration step
    # Halts/sleeps whenever any replica exceeds --max-lag-seconds; safe to re-run from last cursor
    ./bin/backfill --table target_table --column new_col \
      --batch 2000 --max-lag-seconds 2 --backoff-on-lock-wait
  5. Gate downstream phases on backfill health. Do not enable a feature flag, switch reads, or enforce NOT NULL until lag is at zero and Innodb_row_lock_time_avg is under budget. Sizing batches precisely against a moving lag target is the subject of tuning backfill batch size against replication lag.

Verification Checklist

  • SELECT count(*) FROM target_table WHERE new_col IS NULL; returns 0.
  • CHECKSUM TABLE target_table (MySQL) or SELECT md5(string_agg(id::text, ',' ORDER BY id)) FROM target_table (PostgreSQL) matches the expected baseline.
  • Seconds_Behind_Source = 0 (MySQL) or now() - pg_last_xact_replay_timestamp() under one second (PostgreSQL).
  • information_schema.innodb_trx shows no long-running or LOCK WAIT transactions from the backfill user.
  • NULL-fallback errors and 5xx rate stays under 0.1%.

Frequently Asked Questions

What batch size should I start with? Start small — 1,000 to 2,000 rows — and let the lag throttle, not a guessed number, set the ceiling. A batch is too large the moment it pushes any replica past your --max-lag-seconds budget. Measure on the slowest follower, never the primary, because the primary almost always keeps up while a replica falls behind.

Why prefer key-range chunking over LIMIT … OFFSET? OFFSET forces the engine to scan and discard every prior row on each iteration, so the job gets quadratically slower as it progresses. Key-range chunking (WHERE id BETWEEN …) rides a B-tree index range and stays constant-time per batch. The trade-offs between the cursor approaches are laid out in cursor-based vs keyset pagination for large backfills.

Should the backfill run inside the deploy step? No. Run it as a separate post-deploy job. Inline backfills couple migration success to a long-running data operation, block the rollout, and have every replica race to replay one giant transaction. Apply the additive schema, deploy the code, then start the throttled backfill independently.