Enforcing Backward-Compatibility Checks in Pull Requests

The deploy that breaks production is rarely the one that fails — it is the one that succeeds. A migration that drops users.legacy_email applies in milliseconds and turns the pipeline green, but the previous application version still running on half the fleet issues SELECT legacy_email FROM users and starts throwing ERROR: column "legacy_email" does not exist on every request until the rollout finishes. Backward incompatibility is invisible to a dry-run because the SQL is valid; it is only unsafe in the context of code that is still in flight. The defense is a required pull-request check that diffs the proposed schema against the live database, classifies every change as additive-safe or potentially-breaking, and fails the merge on anything that removes or renames a column the running version may still read. This page builds that check, the allow-list that keeps it from blocking safe work, and the branch protection that makes it un-bypassable.

Every proposed change is classified against the live schema: additive DDL passes; a removal or rename blocks the merge until it is staged behind expand-and-contract.

Symptom / Error Signatures

The runtime symptom appears after a clean deploy, during the rollout window when old and new code run side by side. PostgreSQL logs fill with ERROR: column "legacy_email" does not exist or ERROR: relation "old_orders" does not exist; MySQL emits ERROR 1054 (42S22): Unknown column 'legacy_email' in 'field list'. Error rate spikes to roughly the fraction of the fleet still on the old image, then falls as the rollout completes — a tell-tale ramp-and-recover curve that points straight at a backward-incompatible schema change.

A subtler signature is the disguised rename. Across two migrations, one adds email_address and a later one drops email; each migration alone looks additive, but together they remove a column the old code reads. The PR diff for either migration in isolation looks safe, which is exactly why a per-migration eyeball review misses it and a schema-vs-live diff catches it.

The CI symptom you want is the gate firing: a failed required check named something like backward-compat with output Breaking change: DROP COLUMN users.legacy_email — still read by deployed version. That red check on the PR is the system working.

Root Cause Analysis

Backward compatibility is a property of the transition, not of either schema alone. A migration is backward compatible when the previously deployed application version keeps working against the new schema. That makes “is this safe?” a question only answerable by diffing the proposed schema against what is live right now — not against main, not against the developer’s branch point.

The breaking operations are a small, well-defined set: dropping a column or table, renaming a column or table, narrowing a type or length, adding a NOT NULL column without a default, and tightening a constraint that existing rows might violate. Everything else is generally additive and safe to ship ahead of code. The asymmetry is the whole reason expand-and-contract exists: you add the new shape, migrate readers and writers, and only remove the old shape in a later deploy once no running version references it.

The two engines differ in how a few of these present, which the classifier must respect:

Change	PostgreSQL	MySQL 8.0	Backward-compatible?
`ADD COLUMN` nullable / with default	Metadata-only (PG 11+)	`ALGORITHM=INSTANT` (8.0.12+)	Yes — old code ignores it
`DROP COLUMN`	Marks column dropped	Rewrite or instant drop	No — old code may still read it
`RENAME COLUMN`	Catalog rename	Catalog rename	No — old name vanishes
`ALTER COLUMN ... SET NOT NULL` (no default)	Full table scan	Validation	No — old writers may omit it
`ADD CONSTRAINT ... NOT VALID` then `VALIDATE`	Two-step, non-blocking	n/a	Yes — when staged

The classifier’s job is to map each diff entry onto this table and fail the build on any row marked “No” that is not explicitly waived. A gate that checks DROP COLUMN against the current branch instead of the live database will pass a change that is safe relative to main but breaking relative to what is actually deployed — which is the only thing that matters.

Immediate Mitigation

If a backward-incompatible change already merged and is causing the error ramp during a rollout, the fastest mitigation is to stop removing the old shape and let the expanded schema coexist:

Halt the rollout so you do not widen the fraction of the fleet that lost the column.
Re-add the dropped column as nullable if it was already dropped — additive recovery never loses data and immediately silences the column does not exist errors.

-- PostgreSQL · run as migration role · metadata-only, safe under load
-- Context: emergency re-expand; restores backward compatibility for the old image.
ALTER TABLE users ADD COLUMN IF NOT EXISTS legacy_email VARCHAR(320);

-- MySQL 8.0 · run as migration role · ALGORITHM=INSTANT avoids a rewrite
-- Context: emergency re-expand to restore the column the old code reads.
ALTER TABLE users ADD COLUMN legacy_email VARCHAR(320) NULL, ALGORITHM=INSTANT;

Backfill the restored column from its replacement so reads return real data, following idempotent batch limits.
Stand the gate up before the next attempt so the broken change cannot re-merge. Add the schema-vs-live diff as a required check (next section).

This is path disablement and re-expansion, not destructive reversal — the same principle behind writing safe down migrations for automated rollback.

Permanent Fix / Long-Term Pattern

The permanent control is a required PR check that diffs the proposed schema against the live database and fails on any non-additive change not on the allow-list.

# .github/workflows/backward-compat.yml — required PR check
# Context: read-only role to production; never applies anything; blocks the merge.
on: { pull_request: { paths: [ "migrations/**", "prisma/schema.prisma" ] } }
jobs:
  backward-compat:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Diff proposed schema vs live database
        run: |
          set -euo pipefail
          # emit the SQL needed to go FROM live TO proposed
          npx prisma migrate diff \
            --from-url "$" \
            --to-schema-datamodel prisma/schema.prisma \
            --script > proposed.sql
          ./scripts/classify-backward-compat.sh proposed.sql

# scripts/classify-backward-compat.sh — allow-list, not deny-list
# Context: pure text classification of generated DDL; no DB connection.
set -euo pipefail
plan="$1"
# Strip every line that is an explicitly-allowed additive statement…
residual=$(grep -E -iv \
  -e '^\s*(ALTER TABLE .* ADD COLUMN|CREATE TABLE|CREATE INDEX( CONCURRENTLY)?)' \
  -e '^\s*ALTER TABLE .* ADD CONSTRAINT .* NOT VALID' \
  -e '^\s*(--|$)' "$plan" || true)
# …anything left is an unclassified or breaking change → fail.
if [ -n "$residual" ]; then
  echo "Backward-incompatible or unclassified DDL:"; echo "$residual"; exit 1
fi

The allow-list is the design decision that matters. A deny-list of breaking verbs goes stale the moment someone finds a new way to remove data; an allow-list passes only the additive statements your policy has explicitly sanctioned and fails everything else, including DDL you have not yet reviewed. When a genuinely safe removal must ship — the old column is provably unread because its prior deploy already retired it — stage it through expand-and-contract and record an explicit, reviewed waiver for that one migration rather than weakening the allow-list. Wire this as a required status check in branch protection so the merge is impossible while it is red, and pair it with the plan-time gate from blocking deploys on failed migration dry-runs. The full menu of gates this belongs to lives in the migration pipeline gating overview.

Verification Checklist

The check diffs against the live production database, not against main or the branch point.
Classification uses an additive allow-list; unclassified DDL fails closed.
DROP, RENAME, narrowing ALTER COLUMN, and NOT NULL-without-default are all flagged.
Disguised renames across two migrations are caught because the diff is schema-vs-live, not per-migration.
The check runs only when migration or schema files change, and is marked required in branch protection.
A sanctioned removal requires an explicit per-migration waiver rather than a broad allow-list edit.

Frequently Asked Questions

Why diff against the live database instead of the main branch? Because backward compatibility is about the code currently deployed, and the deployed schema is what the live database holds — which can differ from main after a hotfix or an in-flight rollout. Diffing against the live schema catches drops that are safe relative to the repository but breaking relative to what is actually running.

How does the gate catch a rename split across two migrations? It does not inspect migrations one at a time. It compares the full proposed schema to the full live schema, so a column that exists live but is absent in the proposal registers as a removal regardless of how many migration files conspired to produce it.

Is adding a NOT NULL column always blocked? A NOT NULL column with no default is blocked because old writers that omit it will fail. A NOT NULL column with a safe default (or added nullable, backfilled, then constrained in a later deploy) is allowed, because every old write still succeeds. The classifier distinguishes the two by the presence of a default.