Rollback Automation

Rollback automation is the verify-phase safety net that reverses a deploy the moment it proves unhealthy — without losing data and without a human in the loop at 3 a.m. Its central insight is that rolling back code and rolling back schema are different operations with different risk profiles. Restoring the previous application image is fast, safe, and reversible; running a destructive down migration that drops a column is slow, dangerous, and often irreversible. Automated rollback leans entirely on the first and forbids the second. This page serves the platform engineers who wire the health-check-to-rollback loop, the DBAs who must guarantee no automated path can destroy data, and the on-call engineers who need a reversal they can trust to fire without supervision.

The contract that makes this safe is forward-only, additive deploys: a migration adds structure the old code ignores and the new code uses, so reverting to the old image leaves the schema harmlessly expanded. This page sits under CI/CD & Migration Automation, and it is the contingency for everything that Migration Pipeline Gating and Automated Migration Testing tried to prevent — because even a gated, tested migration can meet production traffic it never saw in CI. Rollback automation is the deploy-time application of the Expand and Contract Methodology: expand always, contract never automatically.

Health-check-triggered automated rollback A forward-only additive deploy enters a health-check window; passing checks keep the new image, failing checks restore the previous image and flip flags while the expanded schema stays in place — never dropping a column. Reverse the Code, Keep the Schema Additive deploy forward-only Health-check window Pass: keep new image Fail: restore previous image flip flags Rollback restores the image and flips flags — the expanded schema stays; nothing is dropped.
The automated path reverses only the application image and feature flags; the additive schema is left in place so the reversal can never lose data.

Concept & Mechanism

Automated rollback works by separating two things that an unsafe pipeline conflates: the application version and the schema version. The application version is trivially reversible — container orchestrators keep the previous image and can re-point traffic to it in seconds. The schema version is not trivially reversible, because reversing it usually means destroying the structure (and data) the migration created. Rollback automation exploits this asymmetry: it reverses only the application version and leaves the schema alone.

This is only possible because deploys are forward-only and additive. An additive migration adds a column, table, or index that the previous application code does not reference. When the new code misbehaves and the pipeline restores the previous image, that old code runs perfectly against the expanded schema — it simply ignores the new column. No down migration runs; nothing is dropped; no data is lost. The migration stays applied and is cleaned up later, deliberately, by a separate contract-phase change once the new code is proven stable. This is the deploy-side encoding of the Expand and Contract Methodology.

The trigger is a health check. After the new image rolls out, the pipeline holds it in an observation window and samples error rate, latency percentiles, and any business-critical signal. If the window stays green, the deploy is promoted; if a threshold breaks, the rollback job fires automatically. The health check is the same gate the CI/CD & Migration Automation pipeline ends on — the difference is that here a red result acts rather than merely reports.

Flag flips are the finer-grained sibling of image rollback. When a schema change is coupled to a feature flag, the fastest reversal is flipping the flag off, which disables the new code path without redeploying anything. This is instantaneous and is the reversal of choice when the new behavior is gated behind a toggle, as the Feature Flag Rollouts cluster describes. Image restore and flag flip are complementary: flip the flag to stop the bleeding in milliseconds, then restore the image if the flag flip alone does not recover health.

The one thing automated rollback never does is run a destructive down migration. A down that executes DROP COLUMN or TRUNCATE can lose data the forward migration’s backfill produced, and once an automated job has dropped a column there is nothing to restore to. Destructive reversals are manual, reviewed operations. When a down migration is genuinely needed in automation, it must itself be additive-only or idempotent and non-destructive, the discipline the Idempotent Script Design cluster enforces.

Prerequisites & Decision Criteria

Automated rollback is safe only when the deploy was designed for it. The checklist below decides whether a given deploy is eligible for an automatic reversal or must fall back to a manual, reviewed one.

  • DROP, TRUNCATE, or DELETE against production; destructive reversals require human approval.
  • before the new image rolled out, so reverting the image never strands the schema.
Reversal mechanism Speed Data risk When to use
Feature flag flip Milliseconds None New behavior is gated behind a toggle
Restore previous image Seconds None (schema stays expanded) Code regression, additive schema
Forward fix (roll forward) Minutes None Bug is small and a fix is faster than reverting
Destructive down migration Slow High — possible data loss Never automatic; manual + reviewed only

The decision rule is a hierarchy: try the flag flip first, the image restore second, and a forward fix third; reserve the destructive down for a human. If a deploy is not additive — if it requires a destructive reversal to undo — it is not eligible for automated rollback and must not be wired into the auto-revert path at all.

Step-by-Step Procedure

This procedure wires a health-check-triggered automated rollback for an additive deploy. The schema migration has already run as a separate, gated step before the image rollout.

1. Confirm the deploy is additive and the previous image is retained. Eligibility is the precondition for everything else.

# Context: pre-deploy guard in CI; non-zero exit aborts the deploy before any traffic shifts.
./bin/assert-additive --migration migrations/0042_add_region_code.sql   # fails on DROP/TRUNCATE
./bin/assert-image-retained --service api --previous-tag "$PREV_IMAGE"

Verify before proceeding: confirm the orchestrator lists the previous image tag as available for re-promotion.

2. Roll out the new image behind a flag. Deploy the code with the new path gated so a flip can disable it instantly.

# Context: orchestrator rollout; the new schema column exists already; new behavior is flag-gated off.
./bin/deploy --service api --image "$NEW_IMAGE" --flag region_code_enabled=false

3. Enable the new path gradually and open the health-check window. Turn the flag on for a slice of traffic and start sampling.

# Context: progressive delivery; the window samples health while traffic ramps.
./bin/flag set region_code_enabled --percent 10
./bin/healthcheck --service api --error-rate-max 1% --p95-latency-max 250ms --window 5m

Verify before proceeding: confirm the health check is reading live production metrics, not a cached or synthetic value.

4. On a health-check breach, flip the flag first. The fastest, zero-risk reversal stops the bleeding immediately.

# Context: triggered by a non-zero healthcheck exit; instantaneous; no redeploy required.
./bin/flag set region_code_enabled --percent 0   # disable the new path in milliseconds

5. If health does not recover, restore the previous image. Re-promote the retained image without touching the schema.

# Context: orchestrator rollback; restores the previous application version; schema stays expanded.
./bin/deploy --service api --image "$PREV_IMAGE"   # old code runs fine against the added column

Verify before proceeding: confirm error rate and latency return to baseline after the image restore.

6. Leave the schema in place and record the incident. The migration stays applied; do not run a down.

# Context: post-rollback bookkeeping; explicitly does NOT drop the added column.
# The contract-phase removal, if ever needed, is a separate reviewed migration — never automatic.
./bin/record-rollback --service api --reason "p95 latency breach" --schema-action none

Verification & Observability

After wiring the auto-revert, prove it fires correctly and that a rollback truly left the schema intact. Confirm the additive column survives a reversal.

On PostgreSQL, verify the added column is still present after an image rollback:

-- PostgreSQL · run against production after an automated rollback
-- Context: read-only diagnostics; safe anywhere; confirms the additive change was not reverted.
SELECT column_name, is_nullable, data_type
FROM information_schema.columns
WHERE table_name = 'orders' AND column_name = 'region_code';

On MySQL 8.0, confirm no destructive DDL ran during the rollback window by checking the schema is unchanged:

-- MySQL 8.0 · run against production after rollback
-- Context: read-only; confirms the column the forward migration added is still present.
SELECT COLUMN_NAME, IS_NULLABLE, DATA_TYPE
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME = 'orders' AND COLUMN_NAME = 'region_code';

Confirm the rollback machinery behaves as designed:

  • DROP, TRUNCATE, or DELETE appears in the rollback job’s executed statements.
  • schema-action: none, so the schema state is auditable.

Treat the rollback as observable: the incident record should capture which mechanism fired (flag flip, image restore, or both) and confirm the schema was untouched, so a postmortem can verify no data path was at risk.

Rollback Path

The reversal of rollback automation is re-deploying forward once the underlying problem is fixed — there is no destructive cleanup to undo because the automated path never destroyed anything. After an auto-revert, the schema is still expanded and the previous image is live, which is a stable, safe state you can stay in indefinitely.

To roll forward after fixing the regression, redeploy the new image and re-enable the flag gradually, reusing the same health-check window:

# Context: forward redeploy after a fix; the additive schema is already in place, so no migration runs.
./bin/deploy --service api --image "$FIXED_IMAGE"
./bin/flag set region_code_enabled --percent 10
./bin/healthcheck --service api --error-rate-max 1% --p95-latency-max 250ms --window 5m

The safe condition for rolling forward is that the regression’s root cause is understood and fixed, and the health-check window stays green on the new attempt. Because the schema never changed during the rollback, there is no schema reconciliation to perform — the forward deploy is a pure code change against an already-expanded schema.

The only time a true schema reversal is warranted is the deliberate contract phase, long after the new code is proven stable, when an unused column is genuinely removed. That removal is never part of the automated rollback path — it is a separate, reviewed, gated migration, exactly the destructive change that Migration Pipeline Gating forces an explicit override for.

Common Errors & Fixes

Automated rollback ran a down migration and dropped a column with backfilled data. The down migration was destructive and wired into the auto-revert path. Root cause: treating rollback as schema reversal instead of code reversal. Fix by removing all destructive down migrations from the automated path and reverting only the image and flags; schema removal becomes a separate manual operation.

Restoring the previous image fails because the old code can’t read the new schema. The migration was not additive — it renamed or narrowed a column the old code still expects. Root cause: a non-additive deploy was wired for automated rollback. Fix by making the forward migration strictly additive (add the new column, dual-write, never rename in place), the contract the Expand and Contract Methodology enforces.

The health check passes but users still see errors. The thresholds are too loose or measure the wrong signal — infrastructure latency is fine while a business-critical path is broken. Root cause: the health check does not cover the failure mode. Fix by adding a synthetic transaction that exercises the new code path and gating rollback on it.

Flag flip has no effect because the new behavior is not actually gated. Code shipped that reads the new column unconditionally, so flipping the flag off changes nothing. Root cause: the flag guards only part of the path. Fix by ensuring every branch that depends on the schema change is behind the same flag, the coupling the Feature Flag Rollouts cluster details.

Image restore is slow because the previous image was garbage-collected. The orchestrator pruned the old image, forcing a rebuild during the incident. Root cause: no retention policy for the previous image. Fix by pinning the previous image tag for the duration of the observation window so re-promotion is instant.

Child Page Index

This section drills into the two mechanisms that make automated reversal trustworthy. Auto-Reverting Migrations on Health-Check Failure covers the trigger end to end — defining health thresholds that catch real failures, the observation window, and wiring a breach to fire the flag flip and image restore without a human. Writing Safe Down Migrations for Automated Rollback covers the rare case where a down migration must run in automation — making it strictly additive or idempotent and non-destructive so it can never lose data.

For the gates and tests that try to prevent the failures this section recovers from, return to Migration Pipeline Gating and Automated Migration Testing, and for the full pipeline contract read CI/CD & Migration Automation.

Frequently Asked Questions

Why not just write a down migration that undoes the schema change automatically? Because a down migration that drops a column or truncates a table can lose data the forward migration’s backfill produced, and once an automated job has destroyed that structure there is nothing to restore to. Automated rollback reverses only the application image and feature flags, leaving the additive schema in place. The old code runs perfectly against the expanded schema because it simply ignores the new column.

What’s the difference between flipping a flag and restoring the previous image? A flag flip is instantaneous and disables the new code path without any redeploy, so it is the first move when the new behavior is gated behind a toggle. Restoring the previous image takes seconds and re-promotes the prior application version. Use the flag flip to stop the bleeding immediately, then restore the image if the flip alone does not bring health back to baseline.

When is automated rollback the wrong choice? When the deploy is not additive. If reverting requires a destructive schema change to undo a rename or a narrowed type, the deploy is ineligible for automated rollback and must not be wired into the auto-revert path. Redesign the migration to be additive, or handle the reversal as a manual, reviewed operation with a human confirming no data is at risk.