Tuning Backfill Batch Size Against Replication Lag

The pager fires at 02:14: replica lag 47s on the analytics follower, and the on-call dashboard shows read latency climbing on every endpoint that routes to it. Nothing has deployed. The only thing running is the overnight backfill that populates a new orders.region_code column, churning through 50,000 rows per batch as fast as the primary will accept them. The primary is healthy — but each batch generates a wall of WAL (PostgreSQL) or binlog (MySQL) events that a single-threaded replica apply loop cannot keep up with, and the followers fall further behind with every commit. The fix is not “make the backfill slower” in the abstract; it is choosing a batch size that trades throughput against the lag your slowest replica can absorb, and backing off automatically the moment that lag rises.

This is the most common way a correct, idempotent backfill still causes an incident. The data is written perfectly; the availability of every read replica is destroyed as collateral. Below is how to recognize it, stop it right now, and replace fixed batch sizes with an adaptive loop that self-throttles.

Symptom / Error Signatures

The backfill itself never errors. The damage shows up on the replicas and on any traffic routed to them.

  • PostgreSQL: a growing gap in pg_stat_replication, specifically replay_lag measured in seconds rather than milliseconds, and a rising byte delta between pg_current_wal_lsn() and the replica’s replay_lsn.
  • MySQL: SHOW REPLICA STATUS reports Seconds_Behind_Master (or Seconds_Behind_Source on 8.0.22+) climbing into the tens of seconds and not recovering between batches.
  • Application logs show stale reads: a row written through the primary is not yet visible on a replica, so a read-after-write returns the old value or a 404.
  • Replica CPU is pinned on a single core — the SQL apply thread — while the primary sits at moderate load. The asymmetry is the tell.
  • Connection-pool saturation alarms on the replica as queries queue behind the lagging apply, the failure mode covered in Configuring Read Replicas for Seamless Schema Updates.

If the backfill stops and Seconds_Behind_Master drains back to zero within a minute, you have confirmed the backfill — not a query or a network fault — is the source.

Root Cause Analysis

Replication is asymmetric. The primary applies a large UPDATE using all the parallelism its query planner can muster, but the replica replays the resulting change stream more serially. On PostgreSQL the WAL receiver writes and a single startup/recovery process replays records in commit order. On MySQL, even with replica_parallel_workers set, parallelism is bounded by how the transactions group, and a tight backfill loop of large single-transaction batches gives the workers little to parallelize. The result: the primary can produce change faster than any follower can consume it.

Batch size is the lever that controls production rate. A 50,000-row batch commits one large transaction, emitting one large burst of WAL or binlog that the replica must replay before it can advance its position. A 1,000-row batch emits a smaller burst, gives the apply loop a commit boundary to breathe at, and lets lag drain between iterations. The total work is the same; the peak instantaneous lag is what changes.

Factor Large batch (e.g. 50,000) Small batch (e.g. 1,000)
Throughput on primary Higher (fewer round trips) Lower (more round trips)
Peak replica lag High — one big WAL/binlog burst Low — replica catches up between commits
Lock/visibility window per batch Long Short
Time to react to a lag spike Poor — committed before you can stop Good — pause between any two batches

The naive fix is to pick a small fixed batch and hope. That fails because the right size is not constant: it depends on the row width, current write traffic, the replica’s hardware, and whether a second job is competing for apply bandwidth. A batch size that is safe at 03:00 starves the replica at 09:00. The durable answer is a loop that reads the live lag and adapts, the same throttling principle that underpins Optimizing Backfill Scripts for Zero-Downtime Deploys.

Adaptive batch-size feedback loop A backfill worker writes a batch to the primary, reads replication lag from the replica, and either grows the batch when lag is low or shrinks and pauses when lag exceeds the ceiling. Lag-Aware Backfill Loop Write batch to primary Read lag from replica Decide grow / shrink / pause next batch size set by measured lag If lag > ceiling: shrink batch and sleep until lag < resume threshold.
Each iteration measures real lag before choosing the next batch size, so the job paces itself against the slowest follower instead of a guessed constant.

Immediate Mitigation

When lag is already climbing, act in this order.

  1. Pause the backfill, do not kill it. A correctly designed job is resumable. Signal it to stop after the current batch commits so you do not leave a half-written batch — though with idempotent upserts even an interrupted batch is safe to re-run.

  2. Confirm lag drains. Watch the live gap recover before doing anything else.

-- PostgreSQL · run on the PRIMARY as a monitoring role · read-only, safe anytime
SELECT application_name,
       (pg_current_wal_lsn() - replay_lsn) AS replay_bytes,
       replay_lag
FROM   pg_stat_replication
ORDER  BY replay_lag DESC NULLS LAST;
-- MySQL · run on each REPLICA as a monitoring role · read-only, safe anytime
-- Look at Seconds_Behind_Source (8.0.22+) or Seconds_Behind_Master.
SHOW REPLICA STATUS\G
  1. Cut the batch size before resuming. If you were running 50,000, drop to 2,000–5,000. Smaller batches give the replica commit boundaries to recover at.

  2. Add an inter-batch sleep tied to live lag. Do not resume at full speed. Resume only while measured lag is below your ceiling, sleeping otherwise.

# Resume the backfill with a hard lag ceiling and a smaller batch.
# Context: run as a single post-deploy worker, not in parallel; safe to re-run.
./bin/backfill --table orders --column region_code \
  --batch 2000 --max-lag-seconds 2 --pause-seconds 1
  1. If lag will not drain at all, the replica apply is fully saturated by ordinary traffic plus the backfill. Stop the backfill entirely until off-peak and shift it to a low-write window.

Permanent Fix / Long-Term Pattern

A fixed --batch flag is the root problem — it cannot know the live state of the replica. Replace it with an adaptive loop that measures lag every iteration and treats batch size as a controlled variable, growing it when there is headroom and shrinking it when lag rises. This is additive-increase, multiplicative-decrease applied to backfills: ramp up gently, back off hard.

# Adaptive backfill loop. Context: single worker, post-deploy job, idempotent writes.
# Engine-agnostic shape; lag() queries the appropriate engine view below.
LAG_CEIL   = 2.0   # seconds: above this, back off
LAG_RESUME = 0.5   # seconds: below this, grow again
batch      = 2000
last_id    = 0

while True:
    rows, last_id = upsert_next_batch(last_id, batch)   # keyset cursor, not OFFSET
    if rows == 0:
        break
    lag = max_replica_lag_seconds()
    if lag > LAG_CEIL:
        batch = max(500, batch // 2)        # multiplicative decrease
        sleep_until(lambda: max_replica_lag_seconds() < LAG_RESUME)
    elif lag < LAG_RESUME:
        batch = min(20000, batch + 1000)    # additive increase

The max_replica_lag_seconds() helper must read the worst follower, not an average — one slow replica is enough to serve stale reads.

-- PostgreSQL · run on PRIMARY · returns worst replay lag in seconds across all replicas
SELECT COALESCE(MAX(EXTRACT(EPOCH FROM replay_lag)), 0) AS max_lag_s
FROM   pg_stat_replication;
-- MySQL · run via a probe against each replica; take the MAX across them.
-- Performance Schema exposes lag without parsing SHOW REPLICA STATUS:
SELECT MAX(LAG) FROM (
  SELECT TIMESTAMPDIFF(SECOND,
           LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
           NOW()) AS LAG
  FROM performance_schema.replication_applier_status_by_worker
) t;

Two structural choices keep this loop O(1) per page. First, iterate with a keyset cursor (WHERE id > :last_id ORDER BY id LIMIT :batch), never LIMIT/OFFSET, for the reasons detailed in Cursor-Based vs Keyset Pagination for Large Backfills. Second, write each batch as an idempotent upsert so a paused or retried job never double-applies. For the surrounding throttling and resumability concerns, the parent Backfill Optimization section is the reference.

Verification Checklist

  • pg_stat_replication.replay_lag / Seconds_Behind_Source) at least once per batch, not on a fixed timer.

Frequently Asked Questions

What batch size should I start with? Start small — 1,000 to 2,000 rows for typical row widths — and let the adaptive loop grow it. A fixed large batch optimizes throughput on the primary while ignoring the replica that actually constrains you. The correct size is whatever keeps your worst follower under the lag ceiling, and that value changes with traffic, so it should be discovered at runtime rather than hard-coded.

Why measure replica lag instead of just adding a fixed sleep between batches? A fixed sleep is tuned for one traffic level and is wrong at every other. During quiet hours it wastes time; during peak it is still too aggressive because ordinary writes already consume the replica’s apply bandwidth. Reading live lag lets the job run fast when there is headroom and stop when there is not, which a constant sleep cannot do.

Does increasing replica parallel workers remove the need to throttle? It raises the ceiling but does not remove it. On MySQL, replica_parallel_workers only helps when transactions can be applied independently; a tight backfill of large single-transaction batches gives the workers little to parallelize. On PostgreSQL the replay path is effectively serial. Throttling against measured lag is still required.

Should the backfill ever read from a replica to find the next batch? No. Drive the cursor from the primary so a lagging replica never feeds you a stale last_id and causes you to skip or re-scan rows. Use replicas only as the lag signal, never as the source of the work queue.