Handling Non-Transactional DDL in MySQL Migrations

Schema deployments in production frequently fail due to MySQL’s implicit commit behavior. Unlike application-layer transactions, Data Definition Language (DDL) statements trigger automatic commits before and after execution. This architectural reality forces engineering teams to rethink deployment pipelines when navigating Database Migration Fundamentals & Tool Selection and designing resilient, zero-downtime release workflows.

Symptom

When non-transactional DDL collides with active workloads, deployment runners stall or fail with the following signatures:

  • ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
  • ERROR 1062 (23000): Duplicate entry '...' for key '...' during index creation
  • ERROR 1091 (42000): Can't DROP 'column_name'; check that column/key exists
  • Migration runners (Flyway, Liquibase, ActiveRecord, Django ORM) hanging indefinitely at Waiting for table metadata lock
  • Partial schema state post-interruption: new columns exist but constraints, indexes, or foreign keys are missing, leaving the database in an inconsistent version.

Diagnostic Query: Identify stalled DDL and blocking sessions immediately.

-- Context: Execute on primary instance as DBA/privileged user
SELECT 
 p.ID, p.USER, p.HOST, p.DB, p.COMMAND, p.TIME, p.STATE, LEFT(p.INFO, 80) AS QUERY
FROM information_schema.processlist p
WHERE p.COMMAND != 'Sleep' 
 AND p.TIME > 30 
 AND (p.INFO LIKE '%ALTER TABLE%' OR p.INFO LIKE '%CREATE INDEX%' OR p.INFO LIKE '%DROP%')
ORDER BY p.TIME DESC;

Root Cause

InnoDB implicitly commits the current transaction before executing DDL and commits again immediately after. Because DDL cannot be rolled back, any failure mid-execution leaves the schema partially altered. Concurrent SELECT, INSERT, or UPDATE operations acquire shared metadata locks (MDL) on the target table. When a migration attempts an ALTER TABLE, it requests an exclusive MDL. If long-running queries or uncommitted transactions hold the shared lock, the DDL statement blocks, eventually timing out or triggering cascading connection pool exhaustion. This behavior fundamentally differentiates MySQL from fully transactional RDBMS platforms, a distinction thoroughly documented in Transactional vs Non-Transactional DBs. Migration tools that assume atomic rollback semantics will fail to reconcile the database state, requiring manual intervention.

Immediate Mitigation

When a migration stalls or fails in production, execute the following sequence to restore stability and reconcile state.

1. Reduce Lock Contention & Retry Safely

-- Context: Run in the same session before re-executing the failing DDL
SET SESSION lock_wait_timeout = 5;
-- Re-run the exact ALTER/CREATE statement

2. Terminate Blocking Sessions (Coordinate with App Owners)

-- Context: Identify the specific process ID from the diagnostic query above
KILL <process_id>;
-- Warning: Only kill long-running SELECTs or idle transactions. Killing active DML may cause data inconsistency.

3. Audit Partial Schema State

-- Context: Run to map exactly which DDL steps succeeded before failure
SELECT COLUMN_NAME, DATA_TYPE, IS_NULLABLE, COLUMN_DEFAULT 
FROM information_schema.columns 
WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME = 'your_table';

SELECT INDEX_NAME, NON_UNIQUE, SEQ_IN_INDEX, COLUMN_NAME 
FROM information_schema.statistics 
WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME = 'your_table';

4. Explicit Rollback/Reconciliation Commands If the migration partially succeeded, manually revert or complete the remaining steps. Do not rely on the migration runner’s automatic rollback.

-- Context: Execute only if the column/index was created but constraints failed
ALTER TABLE your_table DROP COLUMN IF EXISTS new_column;
DROP INDEX IF EXISTS idx_partial ON your_table;
ALTER TABLE your_table DROP FOREIGN KEY fk_partial;

-- Context: If only the column was added but needs to be kept, mark migration as applied
-- Update migration runner state table manually to prevent duplicate execution
-- Example (Flyway): UPDATE flyway_schema_history SET success = 1 WHERE version = 'X.X';

Permanent Resolution

Eliminate DDL-induced outages by adopting an expand/contract (parallel change) strategy and enforcing idempotent execution.

  1. Additive-First Migrations: Only add columns, tables, or indexes in a single deployment. Make new columns NULLABLE with sensible defaults to avoid full table rebuilds.
  2. Online Schema Change Tools: Replace native ALTER TABLE with pt-online-schema-change (Percona) or gh-ost (GitHub). These tools create shadow tables, sync data via triggers or binary log parsing, and swap tables atomically, bypassing MDL contention.
  3. Idempotent Script Design: Wrap DDL in conditional checks so re-runs do not fail on partial states.
-- Safe pattern for MySQL 8.0+
ALTER TABLE your_table ADD COLUMN IF NOT EXISTS new_col VARCHAR(255) DEFAULT 'default';
CREATE INDEX IF NOT EXISTS idx_new_col ON your_table(new_col);
  1. Decouple Schema & App Deployments: Push schema changes first. Deploy application code that supports both old and new structures. Clean up deprecated columns in a subsequent release.
  2. Enforce Pre-Flight Validation: Run schema diffs against staging replicas, execute EXPLAIN on index creation queries, and simulate lock wait scenarios in CI/CD pipelines before merging migration PRs.

Validation Protocol

Confirm migration integrity and system stability before closing the deployment ticket.

1. Verify Runner State

-- Context: Confirm the migration version matches expected state
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 1;
-- Adjust table name per your runner (e.g., flyway_schema_history, ar_internal_metadata, django_migrations)

2. Audit for Orphaned Objects

-- Context: Ensure online schema change tools cleaned up after themselves
SHOW TRIGGERS FROM your_database LIKE '%_ghc%';
SHOW TABLES LIKE '%_old';
-- If found, manually DROP TRIGGER <name> and DROP TABLE <name>

3. Validate Index & Constraint Integrity

-- Context: Run on primary instance
CHECK TABLE your_table EXTENDED;
SELECT TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE 
FROM information_schema.table_constraints 
WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME = 'your_table';

4. Monitor Post-Deployment MDL Contention

-- Context: Run during peak traffic (first 15 mins post-deploy)
SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, OWNER_THREAD_ID 
FROM performance_schema.metadata_locks 
WHERE LOCK_TYPE = 'EXCLUSIVE' AND LOCK_STATUS = 'PENDING';

5. Application Smoke Tests

  • Execute read/write integration tests against endpoints touching the altered table.
  • Verify query execution plans (EXPLAIN) have not regressed due to new indexes.
  • Monitor connection pool exhaustion metrics and InnoDB_row_lock_waits for 30 minutes.