Configuring Read Replicas for Seamless Schema Updates

You ran an ALTER TABLE on the primary, it applied cleanly, and seconds later your read endpoints started throwing column "new_column" does not exist — from replicas that had not yet replayed the DDL. The primary is always ahead of its followers, so any non-backward-compatible change opens a window where read queries routed to a lagging replica hit a schema that no longer matches what the application expects. Standard connection poolers make this worse: they treat every replica as interchangeable and have no idea which ones have caught up to the latest DDL. This page is the runbook for diagnosing that mismatch and configuring replica routing so reads never land on a node running stale schema.

This is the read-side counterpart to the Expand and Contract Methodology, which keeps each schema state additive so a lagging replica still answers correctly, and it depends on the lag discipline from Backfill Optimization to keep followers close enough to route to. Examples cover PostgreSQL and MySQL 8.0.

Symptom / Error Signatures

A DDL change racing ahead of replication surfaces as schema-mismatch errors that appear only on reads:

  • ERROR: column "new_column_name" does not exist (PostgreSQL) or ERROR 1054: Unknown column (MySQL), but only against read endpoints.
  • ORM stack traces such as ActiveRecord::StatementInvalid: PG::UndefinedColumn on read-only operations while writes succeed.
  • Pooler metrics: replica_lag_seconds spiking above 5 s immediately after an ALTER TABLE.
  • Proxy logs: schema mismatch on replica node, triggering failover to the primary and then connection exhaustion as all reads pile onto one node.
  • Client-side 504 Gateway Timeout on read endpoints while write endpoints stay responsive.

Confirm it is a replication gap, not a code bug, by checking lag directly:

-- PostgreSQL · read-only · per-replica WAL apply position and byte lag
SELECT client_addr, state, sent_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM   pg_stat_replication;

-- MySQL/MariaDB · read-only · run on each replica
SHOW REPLICA STATUS\G
-- Critical: Seconds_Behind_Source, Replica_IO_Running, Replica_SQL_Running
-- Note: MySQL 8.0.22+ uses SHOW REPLICA STATUS; older versions use SHOW SLAVE STATUS

Root Cause Analysis

Replication is asynchronous: a DDL statement applies to the primary instantly but reaches each replica only after it replays the WAL/binlog record, which can lag by seconds under load or during a heavy backfill. During that gap the primary and its replicas hold two different schemas at once. A pooler doing read/write splitting routes a read to whichever replica is available, and if that replica has not replayed the ALTER, the query references a column the node does not yet have — a fatal error, not a slow query.

The reason this is so common is that connection poolers (PgBouncer, ProxySQL, HAProxy, Envoy) have no native concept of schema version. They health-check for liveness and sometimes for lag, but they treat all replicas as functionally identical regardless of DDL sync state. That is a deterministic race, not bad luck: every non-additive change reproduces it. The structural fix has two halves — keep changes backward compatible so a stale replica still answers correctly (the job of the Read/Write Splitting Tactics overview), and make routing aware of lag and schema parity. Separating which statements may even reach a replica is the focus of routing DDL vs DML traffic during migrations.

The replica schema-mismatch window An ALTER applies to the primary immediately; one replica has replayed it and one has not. Lag-aware routing keeps the unsynced replica out of the read pool until its schema matches the primary. Lag-Aware Routing During DDL Primary ALTER applied Replica A — synced in read pool Replica B — lagging excluded until parity Read query → Replica A only A replica is eligible for reads only when its schema matches the primary.
Routing excludes any replica whose schema has not caught up to the primary, closing the mismatch window without failing reads over to the primary en masse.

Immediate Mitigation

When reads are erroring on schema mismatch right now, get traffic onto correct schema first.

  1. Force primary routing for the affected tables. Temporarily send 100% of reads for those tables to the primary until replication catches up — the primary always has the latest schema.

  2. Catch the mismatch at the application layer. Wrap the read path to catch UndefinedColumn/UnknownColumn and retry against the primary endpoint, so a single lagging replica does not surface as a user error.

  3. Pause the pipeline. Halt pending DDL batches; do not proceed until lag is below 1 s on every replica.

    -- PostgreSQL · read-only · confirm every replica is current before resuming
    SELECT client_addr, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;
  4. Revert routing before reverting DDL. If mismatch is cascading, pull replicas from the read pool first, then decide on DDL rollback.

    # PgBouncer · admin console · stop sending reads to the replica pool, then reload
    psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "PAUSE replica_pool;"
    # remove the replica pool from pgbouncer.ini, then:
    psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "RELOAD;"
  5. Roll back the DDL only if required. Reversal differs by engine; do this after routing is safe.

    -- PostgreSQL · supports IF EXISTS
    ALTER TABLE your_table DROP COLUMN IF EXISTS new_column_name;
    -- MySQL · no DROP COLUMN IF EXISTS; verify the column exists first
    ALTER TABLE your_table DROP COLUMN new_column_name;

Permanent Fix / Long-Term Pattern

Make every change survivable on a stale replica and make routing refuse a replica that is not ready. These controls fold into the broader Read/Write Splitting Tactics.

  1. Keep DDL backward compatible. Phase changes additively — ADD COLUMN nullable/with default → deploy code → backfill → enforce constraints → DROP legacy — so every intermediate schema is readable by both synced and lagging replicas. A replica behind by one column still answers old queries correctly.

  2. Make routing lag-aware. Configure ProxySQL or your pooler to monitor pg_stat_replication (PostgreSQL) or SHOW REPLICA STATUS (MySQL) and automatically exclude any replica with lag > 2s from the read pool during a migration. ProxySQL supports custom monitor_query checks for exactly this.

  3. Add a schema-parity sidecar. Run a lightweight service that polls information_schema.columns on each replica and marks it OFFLINE_SOFT until its column set and constraints match the primary, closing the gap that pure lag checks miss when a replica is current on WAL but mid-replay of DDL.

  4. Gate read routing behind a flag. Hold read traffic on the new schema path until every replica reports parity, so no replica receives a query for a column it does not yet have. Which statement classes are even allowed to a replica is governed by routing DDL vs DML traffic during migrations.

  5. Use online schema-change tooling during low-traffic windows — pt-online-schema-change (MySQL) or pg_repack for table rewrites — to minimize the lock and replication impact that widens the mismatch window in the first place.

Verification Checklist

  • EXPLAIN (ANALYZE, BUFFERS) on critical read queries shows no plan regression post-migration on every node.
  • schema_mismatch errors across the rollout window, and column parity matches across all replicas.

Frequently Asked Questions

Why do reads fail when the primary ALTER succeeded? Because replication is asynchronous. The DDL applies to the primary immediately but each replica only sees it after replaying the WAL/binlog. During that gap a read routed to a lagging replica references a column that node does not yet have, raising UndefinedColumn/Unknown column. Keeping changes additive and excluding lagging replicas from the read pool closes the window.

Can my connection pooler route around stale-schema replicas automatically? Not by default — poolers track liveness and sometimes lag, but not schema version. Add a lag threshold to the pool’s health check and, for DDL windows, a schema-parity sidecar that marks a replica offline until its information_schema.columns matches the primary.

Is it enough to just monitor replication lag? Usually, but not always. A replica can report near-zero lag while still mid-replay of a specific DDL statement, so a pure lag check can let a schema-mismatched node back into the pool early. Pair the lag check with a column-parity check before re-admitting a replica during a migration.