Seeding Anonymized Production Data Into Staging
A migration sailed through staging in four seconds and then ran for forty minutes in production, holding a lock long enough to trigger an incident. The cause was not the SQL — it was the data. Staging held ten thousand synthetic rows with a uniform distribution; production held ninety million rows with a heavy tail of NULLs, a skewed foreign-key fan-out, and a handful of pathological values your generator never produced. The decision that lands engineers here is “how do we make staging look enough like production to catch this, without copying customers’ personal data into a less-secured environment?” The answer is a sanitized snapshot: a production dump with personally identifiable information masked in place, referential integrity preserved, and a refresh cadence that keeps it representative. This page walks the build, the masking rules, and the integrity checks.
Symptom / Error Signatures
The gap between staging data and production data shows up as migrations that behave completely differently across the two environments:
- A migration that is instant in staging holds a long lock in production: the staging table was small enough that
ALTER TABLEfinished before anyone noticed. - A backfill that completes in seconds in staging breaches replica lag in production, because the row count differs by four orders of magnitude.
- A
NOT NULLorUNIQUEconstraint that adds cleanly in staging fails in production withERROR: column "email" contains null valuesorERROR 1062 ... Duplicate entry— the violating rows existed only in real data. - Query plans diverge: the optimizer picks an index scan in staging and a sequential scan in production because table statistics and value distributions differ.
- Compliance flags a copy of raw customer records sitting in a staging environment that lacks production’s access controls.
Making staging volumetrically and distributionally faithful is the core of Environment Parity Strategies; the broader configuration parity (engine version, extensions, settings) is covered in Ensuring Environment Parity Between Dev and Prod Databases.
Root Cause Analysis
Migrations interact with three properties of data that synthetic seeds almost never reproduce: volume, distribution, and dirtiness. Volume determines lock duration and backfill runtime — an ALTER TABLE that rewrites the table scales with row count, so a thousand-row staging table tells you nothing about a hundred-million-row production table. Distribution determines query plans and constraint behavior — real foreign keys fan out unevenly, real text columns have NULLs and empty strings, real timestamps cluster around business hours. Dirtiness is the accumulated residue of years of bugs: orphaned rows, duplicate “unique” values, encoding quirks. A constraint-adding migration only fails on dirty data, and synthetic generators produce clean data by construction.
So the only reliable test fixture is production data itself — which immediately collides with the requirement that personal data must not leak into a lower environment. The resolution is to transform the data on the way out: dump production, mask every column that carries personal information, and load the masked result into staging. The hard part is masking without breaking referential integrity. If users.id = 42 is masked but orders.user_id = 42 is not, the join breaks; if you randomize a primary key, every foreign key pointing at it dangles. The discipline is to mask values (names, emails, phone numbers) while preserving keys and shapes (the same id space, the same null pattern, the same cardinality).
| Property to preserve | Why a migration needs it | Masking must keep |
|---|---|---|
| Row volume | Lock duration, backfill runtime | Same row count per table (or a deterministic sample) |
| Key relationships | Joins, foreign-key validation | Identical key values across parent/child |
| Value distribution | Query plans, NULL/UNIQUE behavior | Same NULL ratio and cardinality after masking |
| Data dirtiness | Constraint-add failures | Orphans and edge cases left intact (do not “clean”) |
| PII confidentiality | Compliance | Real names, emails, etc. replaced with fake but valid-shaped values |
A common mistake is to “fix” dirty data during sanitization — deduplicating, backfilling NULLs — which destroys exactly the rows a constraint migration must be tested against.
Immediate Mitigation
If a migration just behaved differently in production than in staging and you need a faithful fixture now, build a one-off sanitized snapshot before the next attempt.
- Dump production with a read-only role, excluding nothing structural so the shape is intact.
# Shell · run with a READ-ONLY replica role · read load only, point at a replica not the primary
# Dump schema + data from a follower to avoid adding load to the primary.
pg_dump --no-owner --format=custom --file=prod.dump "$PROD_REPLICA_READONLY_URL"
- Restore into an isolated scratch database, then mask in place with deterministic, shape-preserving updates. Mask values, never keys.
-- PostgreSQL · run in the SCRATCH db only · never against production
-- Replaces PII values; keeps id/user_id keys and the NULL pattern untouched.
UPDATE users
SET full_name = 'User ' || id,
email = 'user' || id || '@example.invalid',
phone = CASE WHEN phone IS NULL THEN NULL ELSE '+10000000' || lpad(id::text, 4, '0') END;
- On MySQL, apply the equivalent in the scratch instance. Keep the same NULL handling so distribution survives.
-- MySQL 8.0 · run in the SCRATCH schema only · never against production
-- Deterministic masking keyed on id preserves joins and the NULL ratio.
UPDATE users
SET full_name = CONCAT('User ', id),
email = CONCAT('user', id, '@example.invalid'),
phone = IF(phone IS NULL, NULL, CONCAT('+10000000', LPAD(id, 4, '0')));
- Verify referential integrity survived masking, then load the masked dump into staging. The throttling you will reuse for the load mirrors Optimizing Backfill Scripts for Zero-Downtime Deploys.
Permanent Fix / Long-Term Pattern
Make the sanitized snapshot a scheduled artifact, not a heroic one-off. The goal of Environment Parity Strategies is that any engineer can test a migration against production-shaped data on demand, and that requires three standing pieces.
A declarative masking policy. Maintain a per-column rule set — which columns are PII, how each is masked — as code, reviewed like any schema change. New columns default to “masked unless explicitly marked safe,” so a forgotten field never leaks. Keep the masking deterministic (keyed on the primary key) so the same row masks to the same value across refreshes, which keeps the fixture stable for repeatable tests.
A refresh cadence tied to drift, not the calendar. Refresh often enough that staging’s volume and distribution track production. A weekly rebuild is typical; refresh sooner after a large data event (a bulk import, a new high-cardinality feature) that would otherwise leave staging unrepresentative.
An integrity gate after every build. A snapshot is only useful if its keys still line up. Run a foreign-key audit on the masked result before publishing it to staging.
-- PostgreSQL · read-only, run against the masked snapshot before publishing
-- Finds orphaned child rows that masking or sampling may have introduced.
SELECT count(*) AS orphaned_orders
FROM orders o
LEFT JOIN users u ON u.id = o.user_id
WHERE u.id IS NULL;
If you sample rather than copy whole tables to control size, sample consistently across related tables (the same id set in parent and child) so the audit above returns zero. The configuration side of parity — matching engine version, extensions, and settings between staging and production — is detailed in Ensuring Environment Parity Between Dev and Prod Databases.
Verification Checklist
Frequently Asked Questions
Why not just generate synthetic data instead of copying production?
Synthetic data is clean by construction, and clean data hides the exact failures migrations cause in production: NULLs that block a NOT NULL add, duplicates that block a UNIQUE constraint, and volume that turns an instant ALTER into a long lock. Generators rarely reproduce production’s distribution and never its accumulated dirtiness, so they give false confidence.
How do I mask PII without breaking foreign keys?
Mask values, never keys. Replace names, emails, and phone numbers with fake but valid-shaped values derived deterministically from the primary key, and leave every id and *_id column untouched. Because the keys are unchanged, all joins and foreign-key constraints still resolve, and the deterministic derivation keeps the fixture stable across refreshes.
Should I clean up dirty data while sanitizing? No. Orphaned rows, duplicates, and NULLs are the exact conditions that make a constraint-adding migration fail, and they are why you want real data in the first place. Cleaning them during sanitization removes the test value of the snapshot. Mask the personal data, preserve the dirt, and let the migration test catch it.