Catching Table-Lock Regressions in Migration Tests

A migration that takes an exclusive lock on a hot table does not error — it waits, and while it waits, every query behind it waits too. The pager fires for “database unresponsive,” not “migration failed,” because from the application’s view the table simply stopped answering for ninety seconds. The migration that caused it looked harmless in review: an ALTER TABLE ... ALTER COLUMN ... TYPE, a SET NOT NULL without the two-step dance, an index build that fell back to LOCK=SHARED on MySQL. The difference between a safe online change and a table-locking outage is invisible in the SQL text and absent from an empty-schema test. You catch it the only way it manifests — by measuring the lock a migration takes. This page builds an automated test that detects a long or exclusive lock by combining a tight lock_timeout against a snapshot with a concurrent-load probe that fails the build when the migration blocks a normal query for longer than the budget.

Lock-regression test harness A migration is applied while a concurrent probe issues queries against the same table; if the probe is blocked beyond the budget, the test fails. Measure the Lock, Fail on the Block Migration session ALTER TABLE … (takes lock) Probe session SELECT/INSERT in a loop Probe blocked > budget → FAIL the build holds lock waits on lock A safe online change never blocks the probe; a rewrite does.
The probe is the assertion: a migration that takes an online, non-blocking lock never delays the probe, while a rewrite or exclusive lock stalls it past the budget and fails the test.

Symptom / Error Signatures

In production, a lock regression looks like a stall, not an error. Application latency jumps across every query that touches the table; connection pools saturate as requests queue behind the blocked statement; health checks time out. On PostgreSQL, pg_stat_activity shows the migration’s backend in wait_event_type = Lock with everything else waiting on it, and pg_locks shows an AccessExclusiveLock on the table. On MySQL, SHOW PROCESSLIST shows queries stuck in Waiting for table metadata lock, and SHOW ENGINE INNODB STATUS reports a long-held lock.

In the test harness you are building, the signature is deliberate and benign: the probe session’s query exceeds its lock_timeout and the database raises it on purpose. PostgreSQL: ERROR: canceling statement due to lock timeout. MySQL: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction. That error, raised by the probe, is the test detecting the regression — a controlled failure standing in for the uncontrolled outage it prevents.

The regression you most want to catch is the silent fallback: a DDL you expect to run online quietly downgrades to a locking algorithm. MySQL logs nothing unless you force it; PostgreSQL takes the stronger lock without comment. Only a measurement reveals it.

Root Cause Analysis

Whether a migration locks depends on the operation, the engine, and the data — none of which is visible in the SQL string alone. The lock test exists because the lock level is an emergent property, not a syntactic one.

On PostgreSQL, the danger operations take an ACCESS EXCLUSIVE lock that blocks reads and writes: a table-rewriting ALTER COLUMN ... TYPE, adding a column with a volatile default on older versions, or VALIDATE-less constraint additions. The safe forms — ADD COLUMN nullable, CREATE INDEX CONCURRENTLY, ADD CONSTRAINT ... NOT VALID then a separate VALIDATE CONSTRAINT — take weaker locks that let traffic through. A short lock_timeout turns “block forever” into “fail fast,” but it does not by itself tell you whether the lock would have blocked traffic; for that you need a concurrent reader.

On MySQL 8.0, the pivot is the algorithm. ALGORITHM=INPLACE, LOCK=NONE keeps the table writable; a COPY algorithm rewrites it under a near-exclusive lock; and even INPLACE operations grab a brief metadata lock that a long-running transaction can stall behind. Worse, MySQL silently chooses COPY for operations that cannot run INPLACE. The defense is to force the online algorithm so the engine refuses rather than degrades:

Engine Safe (online) form Locking form to catch How the test forces detection
PostgreSQL CREATE INDEX CONCURRENTLY; ADD COLUMN nullable ALTER COLUMN ... TYPE rewrite; SET NOT NULL scan Short lock_timeout + concurrent probe
MySQL 8.0 ALGORITHM=INPLACE, LOCK=NONE implicit ALGORITHM=COPY; LOCK=SHARED Add explicit ALGORITHM=INPLACE, LOCK=NONE so non-online DDL errors

Forcing ALGORITHM=INPLACE, LOCK=NONE makes MySQL raise ERROR 1846: ALGORITHM=INPLACE is not supported at migration time when an operation cannot run online — converting a silent runtime outage into a loud test failure. This is the same non-transactional, non-online DDL hazard covered in depth under handling non-transactional DDL in MySQL migrations. And because lock duration scales with row count, the test is only meaningful against realistic data — run it against the restored dataset from testing migrations against production-like snapshots, never an empty schema.

Immediate Mitigation

If a migration is locking a hot table in production right now, the priority is to release the lock and re-stage the change as online:

  1. Cancel the blocking statement so traffic recovers immediately. Identify and terminate the migration backend.
-- PostgreSQL · requires pg_signal_backend / superuser · run during the incident
-- Context: find and cancel the migration that holds AccessExclusiveLock.
SELECT pid, query FROM pg_stat_activity
 WHERE wait_event_type IS NULL AND state = 'active' AND query ILIKE 'ALTER TABLE%';
SELECT pg_cancel_backend(<pid>);   -- pg_terminate_backend(<pid>) if cancel is ignored
  1. Cap the blast radius for the retry by setting a short lock_timeout so the next attempt fails fast instead of stalling traffic.
-- PostgreSQL · run as migration role · prevents a long exclusive wait
-- Context: a fast failure is recoverable; a long lock is an outage.
SET lock_timeout = '2s';
  1. Re-author the operation in its online formCREATE INDEX CONCURRENTLY, a two-step NOT VALID/VALIDATE, or an ADD COLUMN-plus-backfill instead of an in-place rewrite — staged through expand-and-contract for high-traffic tables.

  2. Add the lock test to CI so the locking form cannot return (next section).

Permanent Fix / Long-Term Pattern

The durable control is a test that runs the migration against a populated snapshot under concurrent load and fails when the probe is blocked beyond the budget.

# scripts/lock-regression-test.sh — concurrent probe against a populated CI db
# Context: ephemeral DB restored from a sanitized snapshot; destroyed after.
set -euo pipefail
BUDGET_MS=200
# Probe: loop a normal query with a tight lock_timeout; non-zero exit if it's blocked.
( while true; do
    psql -d ci_db -v ON_ERROR_STOP=1 \
      -c "SET lock_timeout='${BUDGET_MS}ms'; SELECT 1 FROM orders LIMIT 1;" \
      || { echo "PROBE BLOCKED > ${BUDGET_MS}ms — lock regression"; exit 1; }
    sleep 0.05
  done ) & PROBE=$!
# Apply the migration while the probe runs.
psql -d ci_db -v ON_ERROR_STOP=1 -f "$MIGRATION_FILE"
kill "$PROBE" 2>/dev/null || true
wait "$PROBE" 2>/dev/null || exit 1   # propagate a probe failure to the build
-- MySQL 8.0 · run as migration role · forces the engine to refuse non-online DDL
-- Context: errors with 1846 at test time if the operation cannot run INPLACE.
ALTER TABLE orders ADD INDEX idx_orders_region (region_code),
  ALGORITHM=INPLACE, LOCK=NONE;

Two complementary assertions make the test robust. The MySQL side forces ALGORITHM=INPLACE, LOCK=NONE so any operation that cannot run online fails the migration loudly rather than degrading to a copy. The PostgreSQL side runs a concurrent probe with a tight lock_timeout; if the probe is ever canceled, the migration took a lock stronger than the budget allows and the build fails. Set the budget to the longest stall your application tolerates without health checks tripping — typically a few hundred milliseconds. Keep this test alongside the snapshot apply so both data-correctness and lock behavior are checked in one stage; both belong to the broader automated migration testing suite.

Verification Checklist

  • lock_timeout and a probe cancellation fails the build.
  • ALGORITHM=INPLACE, LOCK=NONE so non-online DDL errors at test time.

Frequently Asked Questions

Isn’t a short lock_timeout alone enough to catch this? A short lock_timeout makes the migration fail fast if something else holds the table, but it does not reveal whether the migration’s own lock would block traffic — a rewrite can complete within the timeout while still blocking every concurrent query for its duration. The concurrent probe is what measures impact on real traffic; the timeout only bounds waiting.

How do I stop MySQL from silently choosing a copy algorithm? Always specify ALGORITHM=INPLACE, LOCK=NONE explicitly on the ALTER TABLE. When the operation cannot run online, MySQL raises ERROR 1846 instead of quietly degrading to a table-rewriting copy, turning a hidden production outage into a visible test failure.

What lock budget should the probe enforce? Set it to the longest pause your application can absorb before health checks or client timeouts fire — usually 100 to 500 ms. Any migration that blocks a normal query longer than that would degrade live traffic, so the test should fail at that threshold rather than at some arbitrary larger number.