Dual-Write Synchronization
When a migration splits a column into a new table, changes a type, or moves data to a re-shaped structure, there is a window where both the old and the new shape must stay correct at the same time — because both the previous application version and the new one are serving traffic. Dual-write synchronization is the technique that keeps them in step: every mutation is applied to both stores, so whichever shape a given request reads, it sees current data. The hard part is not writing twice; it is guaranteeing that a write never lands in one store but not the other, because a single silent half-write is the seed of divergence that surfaces hours later as a reconciliation alert. This is the synchronization phase of the Zero-Downtime Schema Evolution Patterns lifecycle, and it serves the engineer changing the data-access layer and the DBA who must trust that the two tables agree.
Dual-write is what makes a Backfill Optimization pass meaningful: with both paths live, the backfill only has to handle rows that existed before synchronization began, while every new row arrives in both shapes already. It is the data-movement half of the Expand and Contract Methodology, and it is almost always gated behind Feature Flag Rollouts so the second write can be enabled per tenant and disabled instantly if the new store misbehaves.
Concept & Mechanism
The naive implementation — write to store A, then write to store B in the same request — has a fatal gap: if the process crashes or store B times out between the two writes, A has the row and B does not, and nothing remembers to fix it. Across a network this is unavoidable; you cannot make two independent writes atomic by sequencing them. The two correct mechanisms both eliminate the gap by making the intent to write B durable inside the same transaction that writes A.
The strongest version, when both shapes live in the same database, is a single local transaction that touches both tables — the legacy row and the new row commit together or not at all, with the engine’s atomicity guaranteeing parity. When the new shape is a separate store (a different database, a search index, a service), you cannot share a transaction, so you use the transactional outbox: the request writes the legacy row and an outbox row in one local transaction, and an async worker reads the outbox and delivers to the new store with retries. Delivery becomes at-least-once, which is why every write into the new store must be idempotent — keyed by an idempotency key so a re-delivery upserts rather than duplicates. The data-loss reasoning behind these choices is detailed in Preventing Data Loss During Dual-Write Migrations.
Even with at-least-once delivery, the two stores will drift: a worker lag spike, a conflict between the backfill and a live write, a clock skew on a “last write wins” rule. Dual-write therefore always ships with continuous reconciliation — a job that compares checksums and repairs divergence — covered in depth by Reconciling Divergence Between Dual-Written Tables.
Prerequisites & Decision Criteria
Reach for dual-write only when a single online ALTER cannot express the change — a table split, a type change requiring transformation, or a move to a separate store. For an in-place additive column, a plain expand-and-backfill is simpler and safer. Confirm before enabling the second write:
> 0.01%sustained) are deployed before the flag is turned on, not after
The authoritative-source decision matters: keep the legacy store authoritative until reconciliation proves the new store trustworthy. Cutting reads over before that is how a synchronization bug becomes a user-visible data bug.
Step-by-Step Procedure
1. Confirm schema compatibility on the target. Apply the legacy schema shape to the target and assert it accepts identical payloads before any traffic.
# PostgreSQL · run against STAGING · read-only against legacy, schema-only apply to target.
# Context: ON_ERROR_STOP makes any incompatibility fail loudly instead of partially applying.
pg_dump --schema-only --no-owner legacy_db \
| psql --set ON_ERROR_STOP=on "$TARGET_STAGING_URL"
Verify before proceeding: the apply exits zero. Any constraint or type error here is a divergence you would otherwise discover in production.
2. Implement the write through a transactional outbox. Write the legacy row and the outbox entry in one transaction; deliver to the target asynchronously and idempotently.
// PostgreSQL legacy store · runs in the application request path.
// Context: legacy write + outbox commit atomically; target delivery is async, at-least-once, idempotent.
func DualWriteInsert(ctx context.Context, db *sql.DB, r Record) error {
tx, err := db.BeginTx(ctx, &sql.TxOptions{Isolation: sql.LevelReadCommitted})
if err != nil {
return err
}
defer tx.Rollback() // no-op after a successful Commit
if _, err = tx.ExecContext(ctx,
"INSERT INTO legacy_table (id, data) VALUES ($1, $2)", r.Key, r.Data); err != nil {
return err
}
// Same transaction: the intent to write the target is now durable.
if _, err = tx.ExecContext(ctx,
"INSERT INTO outbox (idempotency_key, payload) VALUES ($1, $2)", r.Key, r.Payload()); err != nil {
return err
}
return tx.Commit() // both rows persist together or neither does
}
Verify before proceeding: kill the process between the commit and any out-of-band target write in staging — the outbox worker must still deliver the row, proving there is no silent gap.
3. Drain the outbox idempotently into the target. The worker upserts by idempotency key so a retried delivery is harmless.
-- PostgreSQL target store · executed by the outbox worker · idempotent.
-- Context: ON CONFLICT makes re-delivery a no-op; safe to retry on any transport failure.
INSERT INTO target_table (id, new_column, source_key)
VALUES ($1, $2, $3)
ON CONFLICT (source_key) DO UPDATE
SET new_column = EXCLUDED.new_column; -- last delivery wins; repeats change nothing
4. Enable per tenant, then widen. Flip the flag for one tenant, confirm the outbox drain rate matches write throughput, then expand the rollout.
Verification & Observability
The two stores agree only if you measure it. Run a rolling checksum comparison over recently changed rows and alert when divergence persists.
-- PostgreSQL target · run hourly · read-only.
-- Context: compare this checksum set against the same window in the legacy store; drift > 0.01% alerts.
SELECT id,
md5(concat_ws('|', col_a::text, col_b::text, updated_at::text)) AS row_checksum
FROM target_table
WHERE updated_at >= now() - INTERVAL '1 hour'
ORDER BY id;
Watch the outbox backlog — it is the leading indicator of divergence. A growing unprocessed count means the worker cannot keep up and the target is falling behind live writes.
-- PostgreSQL legacy store · run continuously · read-only.
-- Context: a rising backlog means target delivery is lagging — page before it becomes drift.
SELECT count(*) AS undelivered,
now() - min(created_at) AS oldest_pending
FROM outbox
WHERE delivered_at IS NULL;
On the target, confirm no write is blocking — dual-write doubles the write rate, so check pg_stat_activity (PostgreSQL) or SHOW ENGINE INNODB STATUS (MySQL) for lock waits that did not exist before synchronization. The detailed repair workflow for any drift you find is in Reconciling Divergence Between Dual-Written Tables.
Rollback Path
Because the legacy store stayed authoritative, rolling back dual-write is a configuration change, not a data recovery. Disable the second write at the flag; the legacy path continues unaffected and the target simply stops receiving new rows.
# Context: instant rollback — legacy remains the source of truth, no data is destroyed.
./bin/flags set dual_write.target_store=off --tenant all
./bin/outbox drain --until-empty # let the worker finish in-flight deliveries, then idle
If the target accumulated bad rows from a buggy transform, repair the target from the authoritative legacy store rather than deleting blindly — re-run the reconciliation in repair mode so the target is overwritten with legacy truth for the affected keys. Only DELETE target rows when you can prove, by correlation ID, exactly which rows the bad write produced and that nothing reads them. Never roll back by dropping the legacy store while it is still authoritative.
Common Errors & Fixes
Target rows missing for writes that succeeded in legacy. Root cause: a non-transactional dual-write lost the second write to a crash or timeout. Fix: move to the transactional outbox so the intent to write the target is durable; the pattern and its proof are in Preventing Data Loss During Dual-Write Migrations.
Duplicate rows in the target after retries. Root cause: delivery is at-least-once but the write is not idempotent. Fix: add a unique constraint on the source key and ON CONFLICT DO UPDATE so a re-delivery upserts.
ERROR: duplicate key value violates unique constraint on the outbox. Root cause: the same idempotency key was enqueued twice for genuinely different payloads. Fix: scope the key to the mutation, not the entity, or include a version so each distinct change is its own outbox row.
Drift alert fires intermittently then clears. Root cause: reconciliation read the target mid-delivery, comparing a row before the worker applied it. Fix: reconcile over a window that lags the outbox drain latency, and only alert on drift sustained across consecutive cycles.
Target connection pool exhausted under dual-write. Root cause: the target pool was sized for read traffic, not the added write load. Fix: raise the target pool to roughly 1.5x the legacy pool and cap target write timeouts so a slow target fails fast to the retry queue instead of holding connections.
Child Page Index
This section covers the two failure surfaces that decide whether dual-write is safe. Preventing Data Loss During Dual-Write Migrations covers closing the silent-half-write gap with atomic writes and the transactional outbox, so a row never lands in one store without the other. Reconciling Divergence Between Dual-Written Tables covers detecting and repairing the drift that accumulates anyway — checksum strategies, conflict-resolution rules, and repair-mode reconciliation that overwrites the target from the authoritative store.
Frequently Asked Questions
Can I just write to both stores in the same request and skip the outbox? Only if both stores share one local database transaction, so they commit atomically. If the new store is separate — a different database, a search index, a service — two sequential writes have a gap where a crash leaves one store ahead of the other. The transactional outbox closes that gap by making the intent to write the second store durable inside the first store’s transaction.
Which store should be authoritative during dual-write? The legacy store, until reconciliation proves the new store matches it for long enough to trust. Keeping legacy authoritative means a synchronization bug is recoverable — you repair the target from legacy truth — and reads only move over after parity is demonstrated, not assumed.
How do I prevent duplicate rows when delivery retries?
Make every target write idempotent. Give each entity a stable idempotency key, put a unique constraint on it in the target, and write with ON CONFLICT DO UPDATE (PostgreSQL) or ON DUPLICATE KEY UPDATE (MySQL). Then at-least-once delivery is safe: a re-delivery upserts the same row instead of inserting a second copy.
When is it safe to stop dual-writing and remove the legacy path? After reads have cut over to the new store, reconciliation has shown drift below threshold for a sustained window, and no in-flight application version still writes or reads the legacy shape. Then disable the second write, drain the outbox to empty, and only afterward remove the legacy write code.