Writing Safe Down Migrations for Automated Rollback

The default down migration your tool generates is a loaded gun pointed at your data. When the up was ADD COLUMN region_code, the scaffolded down is DROP COLUMN region_code — and if a pipeline runs that down automatically on a failed deploy, it deletes every value backfilled into that column, irreversibly, at the worst possible moment. The question is not “how do I write a down that exactly reverses the up,” it is “which down migrations are safe to let a machine run unattended?” The answer is narrow: a down is automatable only when it is non-destructive, idempotent, and reverses an additive change that nothing yet depends on. For everything else — anything that would drop data, or that ran in an expand-and-contract sequence — the down must be a deliberate no-op, and rollback is handled by restoring the application image instead. This page draws that line and shows how to write the down migrations that fall on the safe side of it.

The test is destruction, not symmetry: a non-destructive additive reversal is automatable; anything that loses data becomes a no-op with rollback handled by the application image.

Symptom / Error Signatures

The disaster signature is data loss after a rollback that “worked.” The pipeline reports the down migration succeeded — exit code 0, no error — yet a backfilled column is gone and there is no error message, because dropping a populated column is perfectly legal SQL. The first sign is downstream: reports return nulls, a reconciliation job flags missing values, support tickets about vanished data. The down did exactly what it said, which was the problem.

The second signature is a down that is not idempotent failing on a retry. The pipeline runs the down, it partially applies, the pipeline retries, and the second run errors with PostgreSQL ERROR: column "region_code" does not exist or MySQL ERROR 1091 (42000): Can't DROP 'region_code'; check that column/key exists — because the first attempt already removed it and the statement has no IF EXISTS guard. The rollback itself becomes stuck.

The third is the expand-and-contract trap: an automated down tries to reverse the contract step (which dropped the old column) by re-adding an empty column, silently losing the data that lived there. The reversal restored the structure and discarded the contents.

Root Cause Analysis

The generated down assumes rollback means “make the schema identical to before.” That assumption is wrong the moment any data has flowed through the new schema. The correct model is asymmetric: applying a change forward and reversing it are not mirror operations, because forward motion can create data that reverse motion would have to destroy.

This splits down migrations into two categories with opposite handling:

`up` operation	Safe automated `down`?	Correct `down`
`CREATE INDEX` (just built)	Yes	`DROP INDEX IF EXISTS` — index is derived, no data lost
`ADD COLUMN`, never written to	Yes	`DROP COLUMN IF EXISTS` — only if guaranteed unpopulated
`ADD COLUMN` + backfill	No	No-op; restore previous image
`DROP COLUMN` (contract step)	No	No-op; the data is already gone, re-adding is a lie
Data `UPDATE` / transform	No	No-op; reverse transform may be lossy or impossible

The safe rows share a property: the object being removed is derived or empty, so removing it loses nothing. An index can always be rebuilt from the table. A column that was added and never written to holds no data. Everything else carries information, and a down that removes it is destructive.

That is why, in an expand-and-contract sequence, the down for the contract migration must be a no-op. The contract already dropped the old column; the data is gone; there is nothing for an automated down to honestly restore. Rollback at that point is a code concern — restore the previous application image, which the auto-reverting migrations on health-check failure gate already does — not a schema concern. And every safe down must be idempotent, following the same IF EXISTS discipline as how to write idempotent SQL scripts for safe deploys, because a retrying pipeline will run it more than once.

Immediate Mitigation

If an automated down has just dropped a populated column, treat it as a data-loss incident:

Stop the pipeline so no further down steps run and compound the loss.
Recover the data from a backup or replica that predates the down, restoring just the affected column rather than the whole database where possible.

-- PostgreSQL · run as migration role · re-add the column, then restore values
-- Context: emergency recovery; re-expand non-destructively before backfilling.
ALTER TABLE orders ADD COLUMN IF NOT EXISTS region_code VARCHAR(8);
-- Backfill from a point-in-time copy / replica restored alongside.
UPDATE orders o SET region_code = b.region_code
  FROM orders_backup b WHERE b.id = o.id AND o.region_code IS NULL;

Neutralize the dangerous down so it cannot run again. Replace its body with an explicit no-op and a comment explaining why.

-- PostgreSQL/MySQL · the down for a backfilled column must NOT drop it
-- Context: rollback is handled by restoring the previous application image.
-- DOWN: intentional no-op — region_code holds backfilled data; do not DROP.
SELECT 1;

Verify the restored data matches the pre-incident counts before resuming any deploys.

Permanent Fix / Long-Term Pattern

The durable pattern is a down policy enforced both by convention and by a gate: only non-destructive, idempotent down migrations may exist, and any down reversing a data-bearing change is an explicit no-op.

-- PostgreSQL · safe automatable down · idempotent · reverses only a derived object
-- Context: dropping an index loses no data and is safe to re-run.
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_region;

-- MySQL 8.0 · safe automatable down · guard with IF EXISTS for re-runnability
-- Context: index is derived; drop is non-destructive and idempotent.
ALTER TABLE orders DROP INDEX idx_orders_region;   -- preceded by an existence check

Pair the convention with the [assert-safe-down] gate already wired into the pipeline: a check that scans every down for DROP COLUMN, DROP TABLE, TRUNCATE, and DELETE, and fails the build unless the statement is explicitly waived as operating on a provably-empty or derived object.

# scripts/assert-safe-down.sh — block destructive down migrations
# Context: pure text scan of down files; no DB connection; runs as a required check.
set -euo pipefail
for f in migrations/*.down.sql; do
  if grep -E -i '\b(DROP\s+COLUMN|DROP\s+TABLE|TRUNCATE|DELETE)\b' "$f" \
     | grep -vq '-- safe-down-waiver'; then
    echo "Destructive down without waiver: $f" >&2; exit 1
  fi
done

The guiding rule: a down migration earns automation only by being non-destructive and idempotent. When the forward change created or transformed data, the honest down is SELECT 1; and rollback belongs to the application image. This keeps the rollback automation contract intact — automated reversal disables a path or restores code, but never destroys data — and it composes with online, non-locking forward changes like those in implementing expand-contract for high-traffic tables.

Verification Checklist

Every automatable down is non-destructive — it removes only derived or provably-empty objects.
Every down is idempotent with IF EXISTS guards so a pipeline retry cannot break it.
The down for any backfilled column or contract step is an explicit SELECT 1; no-op.
A required gate scans down files for DROP COLUMN, DROP TABLE, TRUNCATE, DELETE.
Any destructive down carries an explicit, reviewed waiver justifying it as non-lossy.
Rollback for data-bearing changes is documented as image restore, not schema reversal.

Frequently Asked Questions

Should a down migration always exactly reverse the up? No. Exact reversal is only safe when the up created nothing that holds data — an index, or a column never written to. Once data has flowed through the new schema, an exact reversal would destroy it, so the correct down becomes a no-op and rollback is handled by restoring the application image.

When should a down migration be a no-op? Whenever reversing the up would lose data: after an ADD COLUMN that was backfilled, after any DROP performed in a contract step, and after any data transform. In all of these the honest down is SELECT 1;, because there is nothing the schema reversal can truthfully restore.

How do I stop the tool’s generated DROP COLUMN down from being automated? Replace the generated body with an explicit no-op and a comment, and add a required CI check that fails the build on DROP COLUMN, DROP TABLE, TRUNCATE, or DELETE in any down file unless it carries a reviewed waiver. That makes the dangerous default impossible to ship by accident.