Preventing Data Loss During Dual-Write Migrations

Your checksum job just reported that 0.3% of rows in the new table disagree with the legacy table, and you have no idea when the drift started. Dual-write migrations — where the application writes the same record to both an old and a new schema until cutover — are the highest-risk window in any zero-downtime change, because they fail silently. There is no outage, no stack trace at the moment of corruption; the second write simply lands out of order, gets dropped on a network blip, or overwrites a newer value, and you discover it days later when reconciliation runs. This page is the runbook for catching that divergence, containing it before cutover, and rewriting the write path so it cannot recur.

The dual-write bridge is the connective tissue of the Expand and Contract Methodology: you cannot safely switch reads or run the contract step until both write paths are provably consistent. Examples below are PostgreSQL; the MySQL 8.0 INSERT … ON DUPLICATE KEY UPDATE form is equivalent.

Symptom / Error Signatures

Data loss during a dual-write migration almost never surfaces as a hard failure. Watch for these instead:

  • Periodic MD5() or CHECKSUM TABLE validation jobs report a nonzero delta between legacy and new tables.
  • Application-level DataIntegrityViolationException, StaleObjectStateException, or DuplicateKeyException from ORM optimistic-locking or cache-invalidation gaps.
  • pg_stat_activity shows sessions stuck idle in transaction, holding one write open while the paired write blocks — a classic partial-commit window.
  • Foreign-key violations during reconciliation because a parent row reached one schema but not the other.
  • Read-after-write inconsistency when a read routes to a replica before the dual-write commit has replicated.

Confirm divergence with a windowed comparison rather than a full-table scan:

-- PostgreSQL · read-only · safe on primary · uses indexed id + updated_at, batches via LIMIT
SELECT l.id, l.updated_at AS legacy_ts, n.updated_at AS new_ts
FROM   legacy_table l
LEFT JOIN new_table n ON l.id = n.id
WHERE  n.id IS NULL OR l.updated_at > n.updated_at
ORDER BY l.updated_at DESC
LIMIT  1000;

Root Cause Analysis

Every silent dual-write loss traces back to one fact: writing to two tables is not atomic unless they share a transaction. When the application issues two separate statements (or worse, two separate database calls), any failure between them leaves one schema updated and the other stale. Four mechanisms turn that gap into lost data:

  • Non-atomic routing. The legacy write commits, then the new-table write fails or is skipped, and nothing rolls the first one back.
  • Missing idempotency keys in asynchronous event processors, so a retried message double-applies or a reordered message overwrites a newer value.
  • No version guard, letting a slower concurrent write land last and clobber the correct value — a last-writer-wins race.
  • Replica routing during the commit window, where a read hits a follower that has the legacy write but not yet the new one.

The structural fix is to make both writes part of one durable unit, which is the entire premise of Dual-Write Synchronization. Once divergence has occurred, you also need a deterministic way to decide which row wins, covered in reconciling divergence between dual-written tables.

Where dual-writes lose data Two write paths from the application: the non-atomic path commits the legacy write then loses the new write on failure, while the atomic path wraps both in one transaction so either both commit or both roll back. Two Write Paths, Two Outcomes Non-atomic (loses data) write legacy write new FAILS, lost Atomic (safe) BEGIN · write legacy · write new · COMMIT both commit or both roll back Version-guarded UPSERT rejects stale overwrites
Wrap both writes in one transaction so failure rolls both back, and guard the new-table write with a version predicate so a slow write cannot clobber a newer value.

Immediate Mitigation

If validation is reporting active divergence, contain it before it compounds. Execute in this exact order.

  1. Halt secondary writes. Route all traffic to the legacy write path via your feature-flag service or load balancer, and queue new-table writes for later reconciliation rather than dropping them.

  2. Trip the circuit breaker on background jobs. Disable cron jobs, async workers, and batch processors that target the new schema, so reconciliation works against a frozen target.

  3. Quantify the divergence. Run the windowed comparison query from the symptoms section in batches to size the affected set before patching anything.

  4. Pin reads to the primary. Disable replica routing for the affected tables until lag drops below 50 ms, eliminating read-after-write inconsistency while you work.

    -- PostgreSQL · read-only · confirm replication is current before re-enabling replica reads
    SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag;
  5. Apply a deterministic patch. Reconcile the new table from the authoritative legacy rows using a version-guarded upsert (below), then re-verify counts before lifting the freeze.

Permanent Fix / Long-Term Pattern

Make both writes durable as a unit and make the new-table write impossible to corrupt by ordering. This is the idempotent core of Dual-Write Synchronization.

  • Make the two writes atomic. When both tables live in one database, write them in the same transaction so a failure rolls both back. For cross-database writes, use the outbox pattern (write the legacy row plus an outbox event in one transaction, then a relay applies the new-table write with at-least-once delivery) rather than relying on best-effort dual calls.

  • Guard every new-table write with a version predicate. An idempotent, monotonic upsert rejects any write that would overwrite a newer value, which makes retries and reordering harmless.

    -- PostgreSQL · application write path · idempotent and reorder-safe
    -- Rejects stale overwrites: the WHERE on the conflict clause keeps the higher version
    INSERT INTO new_table (id, col1, col2, version)
    VALUES (:id, :col1, :col2, :version)
    ON CONFLICT (id) DO UPDATE
      SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2, version = EXCLUDED.version
      WHERE EXCLUDED.version > new_table.version;
  • Run continuous checksum validation. Schedule lightweight, partitioned checksum comparisons during off-peak hours and alert on any delta above 0.01% so drift is caught in hours, not at decommission time.

  • Gate cutover on a consistency SLO. Advance traffic to the new read path only after validation reports 99.99% consistency over a rolling 24-hour window. The reconciliation tooling that closes the last gap is detailed in reconciling divergence between dual-written tables.

When you must abort, disable the new path first, then reconcile — never the reverse:

-- PostgreSQL · incident response · disable routing BEFORE any data operation
UPDATE feature_flags SET is_active = false WHERE name = 'dual_write_migration';
-- Then reload app config to force the legacy-only write path, then run the
-- version-guarded INSERT … ON CONFLICT above as a one-shot reconciliation.

Verification Checklist

  • DataIntegrityViolationException or DuplicateKeyException in application logs over a 72-hour observation window.

Frequently Asked Questions

How do I detect divergence before it becomes data loss? Run a continuous, partitioned checksum job comparing legacy and new tables on a short interval and alert on any delta above a tight threshold such as 0.01%. The earlier you catch a few divergent rows, the cheaper the reconciliation; waiting until decommission turns a five-minute patch into a forensic exercise.

Can I just use two-phase commit across both tables? If both tables are in the same database, a single local transaction is simpler and safer than 2PC — use it. Reserve distributed two-phase commit or the outbox pattern for genuinely cross-database writes, and prefer the outbox because it degrades gracefully (retry) where 2PC blocks on a coordinator failure.

What ordering rule prevents the last-writer-wins race? A monotonic version or updated_at column plus an upsert that only overwrites when the incoming version is strictly greater. That single predicate makes writes idempotent and reorder-safe, so retries and out-of-order delivery cannot corrupt the row.