Routing DDL vs DML Traffic During Migrations

The migration runbook says “run on the primary,” but your application does not talk to “the primary” — it talks to a router that splits reads to replicas and writes to the writer endpoint. During a schema change that split becomes a hazard: if a ALTER TABLE is accidentally issued through a read connection it either errors against a read-only replica or, worse, succeeds against one replica and never propagates, leaving the database fleet with inconsistent schemas. Meanwhile the replicas that should receive the change get it through replication — but only after the primary finishes, and a heavy DDL can stall replica apply and spike lag on every read you route there. Getting a migration through a read/write-split topology cleanly means routing all DDL to the primary, keeping ordinary DML reads on followers, and planning for the lag and rebuilds the DDL will cause downstream.

This is an operational routing problem, not a SQL problem. The ALTER itself may be trivial; deciding which connection runs it, and what the replicas do while it replicates, is where deploys go wrong. Below: how to recognize misrouted DDL, route it correctly right now, and make the topology enforce the rule permanently.

Symptom / Error Signatures

Misrouted or mistimed schema changes announce themselves in a few distinct ways.

  • PostgreSQL: a DDL statement returns ERROR: cannot execute ALTER TABLE in a read-only transaction because it was issued on a hot-standby connection.
  • MySQL: ERROR 1290 (HY000): The MySQL server is running with the --read-only option (or --super-read-only) when DDL lands on a replica connection.
  • Schema drift between nodes: information_schema.columns differs between the primary and a replica, because a change was applied directly to one replica out of band instead of flowing through replication.
  • Replica lag spikes coinciding exactly with the DDL on the primary — Seconds_Behind_Source or pg_stat_replication.replay_lag jumps while the schema change replays serially on each follower.
  • Read endpoints return errors mid-migration because the application’s new code reads a column that exists on the primary but has not yet replicated to the follower serving the read.

The last symptom is the dangerous one: the schema is correct everywhere eventually, but for the lag window a read replica is serving the old schema to new code.

Root Cause Analysis

A read/write router classifies each statement and sends it to an endpoint. The classification is usually based on the leading keyword: SELECT goes to a replica, INSERT/UPDATE/DELETE go to the writer. DDL (ALTER, CREATE, DROP) is a write and must go to the writer, but many routers either misclassify it, or the application opens an explicit “read” connection for a maintenance task and runs DDL on it by mistake. Replicas are deliberately read_only / super_read_only (PostgreSQL hot standbys are read-only by construction), so DDL on a replica connection either errors or, if someone disabled the flag, applies locally and creates drift.

The second mechanism is timing. DDL only ever runs on the primary, then flows to replicas through the same replication stream as data. But schema replay on a replica is serial and can be slow — a long ALTER that rewrites a table on the primary replays as an equally long operation on each follower, blocking the apply of everything queued behind it. During that window the replica’s data lags and may be serving a schema that does not yet match what the primary and the new application code expect.

Statement class Correct endpoint If misrouted to a replica Replication effect
DDL (ALTER, CREATE INDEX, DROP) Primary / writer only Errors (read-only) or creates drift Replays serially on each replica; can stall apply
Write DML (INSERT/UPDATE/DELETE) Primary / writer Errors on read-only replica Normal row-based replication
Read DML (SELECT) Replica (offload) Works but loses offload benefit None
Backfill writes Primary, throttled Errors on replica Throttle against lag

The cleanest way to think about it: DDL is a write that the whole fleet must converge on, so it has exactly one valid entry point — the primary — and the migration must account for the lag its propagation creates. The online-DDL choices that keep that propagation cheap are the subject of the Expand and Contract Methodology, and the replica-side handling is detailed in Configuring Read Replicas for Seamless Schema Updates.

DDL and DML routing through a read/write split The application router sends DDL and writes to the primary; reads go to replicas. Schema changes flow from the primary to replicas over the replication stream. Routing During a Migration App router classifies SQL Primary DDL + writes Replicas read DML only DDL SELECT schema replicates down DDL has one entry point; replicas receive it only via the replication stream.
DDL enters only at the primary and reaches replicas through replication; the router must never send a schema change down a read connection.

Immediate Mitigation

If DDL has been misrouted, or a migration is stalling replicas right now, work through these steps.

  1. Pin the migration to the writer endpoint. Do not rely on statement classification for schema changes. Open an explicit primary connection for the migration tool using the writer DSN.
# Run the migration against the WRITER endpoint explicitly.
# Context: maintenance task; uses the primary DSN, never the read/split endpoint.
DATABASE_URL="$PRIMARY_WRITER_URL" ./bin/migrate up
  1. Confirm you are actually on the primary before running DDL. A one-line guard prevents the read-only error and prevents drift.
-- PostgreSQL · run as the migration role · aborts if the session is a hot standby
DO $$ BEGIN
  IF pg_is_in_recovery() THEN
    RAISE EXCEPTION 'Refusing DDL: this connection is a read replica';
  END IF;
END $$;
-- MySQL · run as the migration role · confirm this node is the writable primary
SELECT @@read_only, @@super_read_only;   -- both must be 0 before running DDL
  1. If a replica has drifted because DDL was applied to it directly, do not try to hand-patch it into agreement. Treat the replica as untrustworthy: take it out of the read pool and rebuild it from the primary (re-clone / re-pg_basebackup / re-seed from a snapshot). Hand-reconciled schemas are a recurring source of subtle failures.

  2. Throttle anything competing with the DDL replay. A long schema replay plus a running backfill will saturate replica apply. Pause backfills until the DDL has propagated, applying the limits from Tuning Backfill Batch Size Against Replication Lag.

  3. Hold new-schema reads off the replicas until lag clears. If the new application code reads a just-added column, keep those reads on the primary (or behind a flag) until every replica confirms it has the column, then re-enable replica offload.

Permanent Fix / Long-Term Pattern

The durable fix is to make the topology enforce the routing rule so a human can never put DDL on a replica. Three layers do this.

Configure replicas as hard read-only. Set super_read_only = ON on MySQL replicas and rely on PostgreSQL hot standbys being read-only by construction. A misrouted DDL then errors loudly instead of silently creating drift — a fast failure you can catch in a deploy gate.

Give the migration tool its own primary-only DSN. Application traffic uses the split router; the migration runner uses a dedicated connection string that resolves only to the writer endpoint. The two never share a pool, so application read offloading can never capture a schema change.

# Two separate connection roles. Context: app uses the split endpoint; migrations the writer.
database:
  app_readwrite:  "postgres://app@router.split:5432/prod"     # router splits reads/writes
  migrations:     "postgres://migrator@primary.writer:5432/prod"  # writer only, DDL here

Sequence the migration so replica lag never serves stale schema to new code. This is the routing-aware form of the additive deploy order: add the column on the primary, wait until every replica reports the change present and lag is near zero, then roll out the application code that reads it on replicas. The expand-then-deploy ordering is exactly what the Expand and Contract Methodology prescribes; here the added constraint is that “the schema is ready” means ready on every follower, not just the primary.

-- PostgreSQL · run on PRIMARY · confirm a replica has caught up before routing new reads to it
-- read-only; compare each replica's replay position to the primary's current LSN.
SELECT application_name, replay_lag,
       (pg_current_wal_lsn() = replay_lsn) AS fully_caught_up
FROM   pg_stat_replication;

For the broader replica-management mechanics — connection draining, replica warm-up, and read-after-write consistency during a schema update — the Configuring Read Replicas for Seamless Schema Updates guide and the parent Read/Write Splitting Tactics section carry the full procedure.

Verification Checklist

  • pg_is_in_recovery() / @@super_read_only) that aborts if the session is not the primary.
  • super_read_only (MySQL) or hot-standby read-only (PostgreSQL), so misrouted DDL errors instead of drifting.

Frequently Asked Questions

Why can’t I just run the ALTER on each replica directly to speed up propagation? Because that creates exactly the drift you are trying to avoid, and replicas reject writes when super_read_only is set. DDL has one legitimate entry point — the primary — and reaches replicas through the replication stream so that ordering and consistency are guaranteed. Applying it node-by-node breaks that guarantee and produces a fleet whose schemas disagree.

How do I keep replica reads working while a slow DDL replays? Route reads that depend on the new schema to the primary (or gate them behind a flag) until every replica confirms it has applied the change and lag is near zero, then switch those reads back to the replicas. Reads that do not touch the changed columns can stay on replicas throughout, since they tolerate the temporary lag.

Should the migration connection use the same pool as the application? No. Give migrations a dedicated writer-only connection string separate from the application’s read/write-split endpoint. Sharing a pool risks a schema change being classified as a read and sent to a replica, and it lets a heavy migration starve the application’s own connection budget.

A replica drifted after someone ran DDL on it. Can I patch it back into sync? Don’t. Hand-reconciling a drifted replica leaves subtle differences — column order, index presence, constraint state — that surface as intermittent failures later. Take the node out of the read pool and rebuild it from the primary so its schema and data are byte-for-byte consistent with the source of truth.