Testing Migrations Against Production-Like Snapshots

The migration passed in CI and failed in production with ERROR: could not create unique index "idx_users_email" — Key (email)=(a@b.com) is duplicated. It passed in CI because the CI database was empty: a CREATE UNIQUE INDEX on a table with no rows can never find a duplicate, and a SET NOT NULL on a column with no rows can never find a null. Data-dependent failures are invisible against an empty schema and certain against real data. The same blind spot hides slow rewrites: an ALTER TABLE that completes instantly on ten rows can hold a lock for fifteen minutes on forty million. The fix is to stop testing migrations against CREATE DATABASE; migrate up and start testing them against a sanitized restore of production — a snapshot with realistic row counts, real value distributions, and the actual duplicates and nulls that will block your constraints. This page covers restoring that snapshot into an ephemeral CI database, applying the migration against it, and asserting the outcomes an empty schema can never reveal.

Realistic data is the whole point: duplicates, nulls, and row counts that an empty CI schema can never reproduce are exactly what break constraints and lengthen rewrites.

Symptom / Error Signatures

The hallmark is a migration that is green in CI and red in production. The production logs name the data that broke it. A unique index over existing duplicates: PostgreSQL ERROR: could not create unique index "idx_users_email" DETAIL: Key (email)=(...) is duplicated; MySQL ERROR 1062 (23000): Duplicate entry 'a@b.com' for key 'idx_users_email'. A NOT NULL over existing nulls: PostgreSQL ERROR: column "country" contains null values; MySQL ERROR 1138 (22004): Invalid use of NULL value. A foreign key over orphan rows: ERROR: insert or update on table "orders" violates foreign key constraint.

The performance signature is quieter and more dangerous: the migration succeeds but takes far longer than the CI run implied. A CREATE INDEX that returned in 40 ms against an empty CI table runs for twelve minutes against the real row count, and a deploy timeout or lock_timeout fires partway through. The CI duration told you nothing because there was nothing to scan.

If your CI consistently passes migrations that production rejects on data, the test environment is the suspect — it is almost certainly running against a freshly created, empty schema.

Root Cause Analysis

An empty schema can only test syntactic and structural validity: does the SQL parse, do referenced objects exist, is the catalog change legal. It is structurally incapable of testing anything that depends on the contents of the table, because there are no contents. Three whole classes of migration failure are invisible without real data:

Constraint violations — uniqueness, NOT NULL, check constraints, and foreign keys all validate against existing rows at creation time. Zero rows means zero violations, every time.
Rewrite duration and lock hold time — the cost of an ALTER TABLE that rewrites the table, or a CREATE INDEX that scans it, is proportional to row count. Empty tables make every rewrite look instantaneous, hiding the lock-hold problem that catching table-lock regressions in migration tests targets directly.
Data-shape surprises — a backfill UPDATE that assumes every JSON column is well-formed, or a type narrowing that assumes every value fits the smaller type, only fails on the row that breaks the assumption — a row that only exists in real data.

The environment must therefore mirror production’s data, not just its schema. That is precisely the environment parity discipline applied to CI: the test database needs production’s row counts and value distributions. It must not carry production’s secrets, so the snapshot is sanitized first — anonymizing personal data while preserving the structural properties (cardinality, null density, duplicate patterns) that make the test meaningful. The full anonymization workflow is covered in seeding anonymized production data into staging; for migration testing the rule is narrow: scramble the values, keep the shape.

Immediate Mitigation

When a constraint-violation failure is blocking a deploy right now, the immediate move is to reproduce it locally against real data and clean the data before re-attempting:

Restore the latest sanitized snapshot into a throwaway database and apply the failing migration to confirm the exact violating rows.

# CI / local shell · ephemeral DB · destroyed after the run · no prod writes
# Context: restore a sanitized dump into a fresh database for testing only.
createdb ci_migration_test
pg_restore --no-owner --dbname=ci_migration_test snapshot_sanitized.dump
psql -d ci_migration_test -v ON_ERROR_STOP=1 -f migrations/0042_add_unique_email.sql

Find the offending rows so you know the size of the data cleanup before touching production.

-- PostgreSQL · read-only · run against the snapshot, not production
-- Context: enumerate the duplicates a unique index would reject.
SELECT email, count(*) FROM users GROUP BY email HAVING count(*) > 1;

Stage a data-cleanup migration first — deduplicate or backfill the nulls in an idempotent step — and only then create the constraint. Splitting cleanup from constraint creation keeps each step re-runnable.
Add the snapshot test to the gate so the same failure cannot reach production again (next section).

Permanent Fix / Long-Term Pattern

The durable fix is a required CI stage that restores a sanitized snapshot, applies the migration, and asserts both correctness and timing.

# .gitlab-ci.yml — migration test against a production-like snapshot
# Context: ephemeral service DB; sanitized restore; never connects to production.
migration_snapshot_test:
  stage: test
  services: [ "postgres:16" ]
  rules:
    - changes: [ "migrations/**/*" ]
  script:
    - set -euo pipefail
    - createdb -h postgres ci_db
    # Restore a nightly-built, anonymized snapshot (no PII, real distributions)
    - pg_restore -h postgres --no-owner -d ci_db "$SANITIZED_SNAPSHOT"
    # Apply the migration and fail on any constraint/data error
    - psql -h postgres -d ci_db -v ON_ERROR_STOP=1 -f "$(ls migrations/*.sql | tail -1)"
    # Assert the apply finished within the deploy budget
    - ./scripts/assert-migration-duration.sh --max-seconds 120
  allow_failure: false

Keep the snapshot fresh and the test fast. A nightly job builds the sanitized dump so the test always sees recent data shapes; for very large tables, a representative sample that preserves cardinality and null density keeps CI quick without losing the failure modes that matter. Run the apply with ON_ERROR_STOP=1 (PostgreSQL) or --abort-on-error (MySQL client) so the first data-dependent error fails the job rather than scrolling past. Measure the apply duration and fail when it exceeds the deploy budget — a long rewrite that succeeds in CI is still a production incident waiting to happen, which is why this test pairs naturally with catching table-lock regressions in migration tests. For where this stage sits among the other safety checks, see the automated migration testing overview, and combine it with the plan-time gate in blocking deploys on failed migration dry-runs so both the plan and its effect on real data are checked.

Verification Checklist

CI applies the migration against a restored snapshot, not a freshly created empty schema.
The snapshot is sanitized of personal data but preserves row counts, null density, and duplicate patterns.
The apply runs with ON_ERROR_STOP=1 / --abort-on-error so the first data error fails the job.
Apply duration is measured and the job fails when it exceeds the deploy budget.
The snapshot is rebuilt on a schedule so the test sees current data shapes.
The ephemeral test database is destroyed after the run and never reused across builds.

Frequently Asked Questions

Why not just test against staging instead of a snapshot in CI? Staging is shared and stateful, so concurrent tests interfere and a failed migration leaves it dirty for the next run. An ephemeral CI database restored from a snapshot is isolated, reproducible, and disposable — each build gets a clean copy of realistic data and destroys it afterward, which is exactly what a deterministic gate needs.

How do I keep the snapshot small enough for fast CI? Sample large tables while preserving the statistical properties that cause failures: keep the duplicates, the nulls, and the value cardinality, but reduce the raw row count. A few hundred thousand representative rows reproduces most constraint and rewrite issues far faster than a full multi-terabyte restore.

Doesn’t restoring a production snapshot risk leaking PII into CI? Only if you restore it unsanitized. The snapshot must be anonymized before it ever reaches CI — scramble personal values while preserving shape — so the test sees realistic distributions without real secrets. The sanitization step is mandatory, not optional.