Ensuring Environment Parity Between Dev and Prod Databases
A migration that ran clean on a laptop fails the moment it touches production, and the build log reads Migration checksum mismatch for migration 003. The same script, the same tool, a different outcome — because the development database and the production database have silently diverged. Drift between these environments is the single most common reason a green pipeline still produces a 3 a.m. rollback: the migration tool computes the wrong delta because the state table it trusts no longer describes the physical schema in front of it. This page is the runbook for detecting that divergence, mitigating it without losing data, and closing the gap permanently. It assumes you already understand the broader environment parity strategies this section covers and the migration fundamentals that define what a correct migration is.
Symptom / Error Signatures
The failure surfaces during validation, at application startup, or as lock contention on the history table:
FlywayValidateException: Validate failed: Migration checksum mismatch for migration 003Liquibase Validation Failed: 1 change sets have validation failuresActiveRecord::PendingMigrationError: Migrations are pending; run 'rails db:migrate'- PostgreSQL
ERROR: relation "users_v2" does not existor MySQLERROR 1062 (23000): Duplicate entryduring application startup - Prolonged
ACCESS EXCLUSIVElock onschema_migrations,flyway_schema_history, ordatabasechangelogduring concurrent deploys
Confirm the divergence with read-only queries before touching anything:
-- PostgreSQL · read-only · safe to run against production with a SELECT-only role
-- Identify sessions blocked on migration-state locks
SELECT pid, usename, state, wait_event_type, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
AND query ~* '(ALTER|CREATE|DROP|INSERT INTO.*(schema_migrations|flyway_schema_history|databasechangelog))';
-- Inspect the recorded migration history (read-only)
SELECT * FROM flyway_schema_history ORDER BY installed_rank DESC LIMIT 10;
-- Compare physical schema against the expected baseline (read-only)
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'public' AND table_name = 'users_v2';
Root Cause Analysis
Drift originates from one of two places: a manual hotfix applied directly to production that bypasses the version control pipeline, or a non-transactional DDL statement that partially applied when a CI/CD run was interrupted. In both cases the recorded migration history and the physical schema describe two different databases. The migration tool computes its next action by diffing the recorded state against the migration files — so when that record is stale, it either re-applies a change that already exists (relation already exists) or skips one the physical schema is missing (relation does not exist). This misalignment breaks the additive, backward-compatible contract that zero-downtime deploys rely on, because the application image expects column and constraint states the database does not actually hold. The deeper mechanism — why a half-applied MySQL ALTER cannot self-heal — is covered in handling non-transactional DDL in MySQL migrations.
Immediate Mitigation
Apply every corrective action on a staging replica or read-only clone first. Never edit a production state table without a verified snapshot.
- Halt the pipeline. Stop CI/CD runners so no concurrent migration races the one you are reconciling.
- Generate the delta as SQL, do not apply it. Render what the tool would run so you can read it before it executes.
- Flyway:
flyway migrate -dryRunOutput=/tmp/dryrun.sql - Liquibase:
liquibase updateSQL
- Flyway:
- Reconcile state against physical schema. If the history table is stale but the physical schema is correct, insert the missing record with the exact expected checksum. If the physical schema is missing objects but the record claims success, generate a targeted patch from
pg_dump --schema-only(ormysqldump --no-data), diff it against the baseline, and apply only the missing DDL.
When a partial application must be reversed, do it explicitly inside a transaction (PostgreSQL DDL is transactional; this is exactly what protects you here):
-- PostgreSQL · run as migration role on the affected database · wrap in a txn so a typo aborts cleanly
BEGIN;
-- Revert the partially applied DDL
DROP TABLE IF EXISTS users_v2;
ALTER TABLE users DROP COLUMN IF EXISTS temp_migration_flag;
-- Remove the bogus state record ONLY if the physical change was also reverted above
DELETE FROM flyway_schema_history WHERE version = '003' AND success = TRUE;
COMMIT;
Verify application connectivity, connection-pool saturation, and error rates before resuming the pipeline.
Permanent Fix / Long-Term Pattern
The durable fix removes the two paths that let drift in. Close the manual-change path by disabling direct console access for schema modifications through IAM/RBAC, so every DDL change must pass through the migration tool. Close the partial-application path by adopting idempotent script design — CREATE TABLE IF NOT EXISTS, guarded ADD COLUMN, deterministic constraint names — so a re-run after an interruption converges instead of failing. Then make divergence visible: run a schema-validation check on every pull request and a nightly comparison against a production read replica, and wire a checksum gate that blocks any merge where local migration state diverges from the tracked baseline. The same discipline that aligns schema should align configuration and seed data, which is the broader subject of the environment parity strategies section and the seeding anonymized production data into staging workflow.
Verification Checklist
flyway validate/liquibase validatereports zero checksum mismatches after reconciliationflyway migrate -dryRunOutput/liquibase updateSQL) produces zero unexpected DDL against a staging replicapg_stat_activity(orSHOW PROCESSLISTon MySQL) shows no lingeringALTER/CREATElocks
Frequently Asked Questions
Should I edit flyway_schema_history directly to fix a checksum mismatch?
Only as a last resort, on a database you have snapshotted first, and only when you have confirmed the physical schema is correct and it is the record that is wrong. Prefer flyway repair, which recomputes checksums for failed entries without hand-editing rows. Manual UPDATE/DELETE against the history table is the fastest way to turn a recoverable drift into an unrecoverable one.
How do I stop a hand-applied hotfix from reintroducing drift next week?
Make it impossible to apply DDL outside the pipeline. Revoke schema-modification privileges from interactive roles via IAM/RBAC, and add a required schema-drift gate (migrate diff against the live database with a non-zero exit on difference) so any out-of-band change fails the next build until it is committed to version control.
Does environment parity mean dev and prod must hold the same data? No — it means the same schema, configuration, and migration state. Data volume and content differ, but the structural contract the migration tool reasons about must match. For realistic testing without copying real customer data, seed staging from an anonymized production extract rather than synthetic fixtures.