Making Data Backfills Idempotent With Upserts

A backfill job dies at row 2.4 million of 8 million — the worker pod was evicted, the connection dropped, or a replica-lag guard halted it mid-batch. You restart it, and now the dread sets in: did the rows it already wrote get written twice? Did a summary counter get double-incremented? If your backfill uses a plain INSERT or a blind UPDATE ... SET col = col + n, a retry is not a retry — it is a second, overlapping mutation of the same data. The fix is to make every write converge to the same state no matter how many times it runs, and the idiomatic tool for that is the upsert: INSERT ... ON CONFLICT on PostgreSQL and INSERT ... ON DUPLICATE KEY UPDATE on MySQL 8.0. This page shows how to structure a backfill so a partial failure and a full re-run land on byte-identical results.

Symptom / Error Signatures

A non-idempotent backfill announces itself when a retried or overlapping run hits rows the previous run already touched:

ERROR: duplicate key value violates unique constraint "pk_user_stats" (PostgreSQL) — the retry re-inserts a row the first pass already created.
ERROR 1062 (23000): Duplicate entry '40219' for key 'user_stats.PRIMARY' (MySQL) — same collision under InnoDB.
Silent data drift: a SUM(amount) or counter column reads roughly double the correct value after a restart, with no error logged at all — the most dangerous signature because nothing fails loudly.
Reconciliation queries report more populated rows than source rows: SELECT count(*) FROM target exceeds SELECT count(DISTINCT id) FROM source.
The job’s own progress cursor and the actual row count disagree after a crash, because the cursor advanced but the commit did not — or vice versa.

The throttling and batching mechanics that surround these writes are covered in Optimizing Backfill Scripts for Zero-Downtime Deploys; here the focus is making each individual write replay-safe.

Root Cause Analysis

A backfill is a long, batched mutation that runs outside a single transaction — it has to be, because one transaction over millions of rows would hold locks and bloat the write-ahead log past any reasonable budget. That means the job has many commit points, and a crash can land between any two of them. Recovery restarts from the last known cursor, but “known” is approximate: the cursor may have been persisted before the batch committed, or the batch may have committed before the cursor was saved. Either way, the recovering job will re-process some rows.

Re-processing is only dangerous when the write is not a pure function of its input. A plain INSERT is not: run it twice and you get a unique-constraint violation or a duplicate row. A relative UPDATE (SET total = total + :n) is not: run it twice and the increment applies twice. The upsert converts both into convergent operations. INSERT ... ON CONFLICT (key) DO UPDATE SET col = EXCLUDED.col either creates the row or overwrites it with the same computed value — the second run is a no-op in effect. The critical discipline is that the DO UPDATE clause must assign an absolute value derived from the source, never a relative delta.

The two engines diverge in conflict-target syntax and in how they expose the incoming row:

Behavior	PostgreSQL	MySQL 8.0 (InnoDB)
Upsert syntax	`INSERT ... ON CONFLICT (col) DO UPDATE`	`INSERT ... ON DUPLICATE KEY UPDATE`
Conflict target	Explicit: name the unique index/columns	Implicit: any `PRIMARY`/`UNIQUE` key
Reference incoming row	`EXCLUDED.col`	`VALUES(col)` (deprecated 8.0.20+) or row alias `new.col`
Skip instead of update	`ON CONFLICT ... DO NOTHING`	`INSERT IGNORE`
Reports rows changed	`RETURNING` available	`ROW_COUNT()`: 1 insert, 2 update

Because MySQL’s ON DUPLICATE KEY UPDATE triggers on any unique key, a table with two unique constraints can update an unexpected row — always know which keys are unique before relying on it. PostgreSQL forces you to name the conflict target, which removes that ambiguity.

The retried batch 3' overlaps the crashed batch 3, but each upsert overwrites with an absolute value, so the overlap converges instead of double-writing.

Immediate Mitigation

If a backfill has already crashed and you are not yet sure whether it double-wrote, do not blindly restart a relative-update job.

Freeze the job and quantify the damage. Compare target against source before touching anything.

-- PostgreSQL · read-only · safe to run anytime, but may be slow on large tables
-- Counts target rows that exceed the source's distinct keys (a sign of duplicates).
SELECT (SELECT count(*) FROM user_stats) AS target_rows,
       (SELECT count(DISTINCT user_id) FROM events) AS source_keys;

Convert the write to an upsert before resuming. On PostgreSQL, name the conflict target explicitly and assign absolute values from the source.

-- PostgreSQL · run as backfill role · execute in batches OUTSIDE one long transaction
-- Idempotent: re-running the same batch overwrites with the identical computed value.
INSERT INTO user_stats (user_id, event_count, last_seen)
SELECT user_id, count(*), max(created_at)
FROM events
WHERE user_id BETWEEN :lo AND :hi
GROUP BY user_id
ON CONFLICT (user_id)
DO UPDATE SET event_count = EXCLUDED.event_count,
              last_seen   = EXCLUDED.last_seen;

On MySQL, use the row alias form (new) rather than the deprecated VALUES() function.

-- MySQL 8.0.19+ · run as backfill role · per-batch commit, not one giant transaction
-- ON DUPLICATE KEY UPDATE fires on the PRIMARY key; absolute assignment keeps it idempotent.
-- For INSERT ... SELECT the row alias must sit on the source as a derived-table alias.
INSERT INTO user_stats (user_id, event_count, last_seen)
SELECT new.user_id, new.event_count, new.last_seen
FROM (
    SELECT user_id,
           COUNT(*)        AS event_count,
           MAX(created_at) AS last_seen
    FROM events
    WHERE user_id BETWEEN :lo AND :hi
    GROUP BY user_id
) AS new
ON DUPLICATE KEY UPDATE event_count = new.event_count,
                        last_seen   = new.last_seen;

Repair any already-doubled counters by recomputing from source rather than trying to subtract. A recompute upsert is itself idempotent, so it can run as the new permanent backfill.

Permanent Fix / Long-Term Pattern

Make idempotency a property of the write itself, not of the orchestration around it. The rule is simple: a backfill statement must produce the same final row state whether it runs once or a hundred times. Three habits enforce that, and they extend the broader guarantees described in Idempotent Script Design and How to Write Idempotent SQL Scripts for Safe Deploys.

First, never write a relative mutation in a backfill. Replace every SET total = total + :delta with an absolute SET total = EXCLUDED.total where the value is recomputed from the source. If the target genuinely accumulates from multiple sources, aggregate the source completely in the SELECT and assign the full aggregate.

Second, require a unique key on the conflict target. An upsert is only idempotent if the conflict is detected — without a PRIMARY KEY or UNIQUE index on the natural key, ON CONFLICT has nothing to fire on and ON DUPLICATE KEY UPDATE silently inserts duplicates. Add the constraint before the backfill, following the online index techniques in the Transactional vs Non-Transactional Databases guidance for the engine you run.

Third, make the cursor commit atomic with the data. Persist the progress cursor in the same transaction as the batch it describes, so a crash can never leave the cursor ahead of the data. With an idempotent upsert this is belt-and-suspenders — even a re-processed batch is harmless — but it keeps progress accounting honest.

-- PostgreSQL · run as backfill role · one batch per transaction (BEGIN/COMMIT per loop)
-- Cursor and data commit together, so recovery is exact and the upsert makes overlap safe.
BEGIN;
INSERT INTO user_stats (user_id, event_count, last_seen)
SELECT user_id, count(*), max(created_at)
FROM events WHERE user_id > :last_id ORDER BY user_id LIMIT 1000
GROUP BY user_id
ON CONFLICT (user_id) DO UPDATE
  SET event_count = EXCLUDED.event_count, last_seen = EXCLUDED.last_seen;
UPDATE backfill_progress SET last_id = (SELECT max(user_id) FROM events
  WHERE user_id > :last_id ORDER BY user_id LIMIT 1000) WHERE job = 'user_stats';
COMMIT;

Verification Checklist

Every backfill write is an INSERT ... ON CONFLICT DO UPDATE (PostgreSQL) or INSERT ... ON DUPLICATE KEY UPDATE (MySQL), never a bare INSERT or relative UPDATE.
The DO UPDATE / ON DUPLICATE KEY UPDATE clause assigns absolute values from the source, with zero col = col + delta expressions.
A PRIMARY KEY or UNIQUE index exists on the natural key the upsert conflicts against, and it is the only unique key (MySQL) or named explicitly (PostgreSQL).
Running the same batch twice in a test produces an identical target row (diff the row before and after the second run).
A reconciliation query confirms count(target) equals count(distinct source key) after a simulated mid-run crash and restart.
The progress cursor commits in the same transaction as its batch, so recovery never skips or re-skips rows ambiguously.

Frequently Asked Questions

Does an upsert hurt backfill throughput compared to a plain INSERT? Marginally. The upsert performs an index probe on the conflict target per row, so it is slightly slower than a blind insert into a fresh table. On any table you might re-run — which is every production backfill — the cost is trivial against the alternative of cleaning up double-writes by hand. Batch sizing and replica-lag throttling dominate throughput far more than the upsert overhead.

Can I use ON CONFLICT DO NOTHING instead of DO UPDATE? Only if the first write is always correct and you never need to refresh the value. DO NOTHING makes the insert idempotent but skips already-present rows entirely, so if a later run computes a corrected value it will be ignored. For a backfill that recomputes from a changing source, DO UPDATE with absolute assignment is the safe default; reserve DO NOTHING for write-once seed data.

Why is my MySQL ON DUPLICATE KEY UPDATE updating the wrong row? Because it fires on any unique key, not just the primary key. If the table has a second UNIQUE constraint, an insert that collides on that secondary key updates that row instead of inserting a new one. Audit the table for all unique indexes before relying on the clause, and on PostgreSQL name the conflict target explicitly to avoid the ambiguity altogether.