g14wxz/commerce-database-architect

Database architecture skills, docs, and rules for high-demand multi-tenant commerce platforms (PostgreSQL source of truth, Neo4j as derived GraphRAG projection, transactional outbox, RLS-based tenant isolation). Includes live schema introspection workflow via Postgres MCP.

Quality

97%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

Expand / Contract Migration Template

Name: g14wxz/commerce-database-architect
Rating: 97.27272727272728 (1 reviews)
Author: g14wxz

The canonical zero-downtime migration pattern for production. Skills cite this file when proposing schema changes against a live database.

The four phases — Expand, Backfill, Cutover, Contract — must be deployed separately, not as a single migration. Each phase is its own PR, its own deploy, its own observability window.

Worked example: rename `catalog_products.price` to `catalog_products.list_price`

A naive ALTER TABLE catalog_products RENAME COLUMN price TO list_price takes a brief lock, but it requires the application to flip from price to list_price atomically — impossible across multiple service instances.

Instead, four phases:

Phase 1 — Expand (additive, online, reversible)

-- Migration: V0123__add_list_price_to_products.sql
ALTER TABLE catalog_products
  ADD COLUMN list_price NUMERIC(12, 2);

Lock impact: AccessExclusiveLock on catalog_products, < 50ms (column add with no default and no constraint is fast in PG ≥ 11).
Backward compatibility: old code reads/writes price, ignores list_price. New column is NULL.
Rollback: ALTER TABLE catalog_products DROP COLUMN list_price;

Deploy. Observe. Wait for any in-flight long transactions to drain.

Phase 2 — Backfill (batched, throttled, idempotent)

-- Migration: V0124__backfill_list_price_from_price.sql
-- Run in batches of 5,000 rows, throttled

WITH batch AS (
  SELECT id
    FROM catalog_products
   WHERE list_price IS NULL
     AND price IS NOT NULL
   ORDER BY id
   LIMIT 5000
   FOR UPDATE SKIP LOCKED
)
UPDATE catalog_products p
   SET list_price = p.price
  FROM batch b
 WHERE p.id = b.id;

Run a script that loops:

while true; do
  ROWS=$(psql -tA -c "<the UPDATE above returning row count>")
  if [ "$ROWS" -eq 0 ]; then break; fi
  echo "backfilled $ROWS rows"
  sleep 0.5      # throttle to limit replication lag
done

Lock impact: RowExclusiveLock per batch, < 100ms each.
Backward compatibility: writes from old code go to price only; list_price catches up via re-running the batch script (idempotent WHERE list_price IS NULL).
Rollback: UPDATE is reversible by UPDATE ... SET list_price = NULL, but typically you'd just continue forward to Phase 1's rollback.

Observe replication lag. Pause between batches if lag exceeds threshold.

Phase 3 — Cutover (application reads/writes new column, old still works)

Application changes (deployed independently per service):

- product.price = body.price
+ product.list_price = body.price
+ product.price = body.price          # dual-write during cutover

For reads:

- return { price: product.price }
+ return { price: product.list_price ?? product.price }

Backward compatibility: during the cutover window, both columns are populated on writes. Old service instances still using price continue to work.
Rollback: revert the application deploy. The DB schema is unchanged.

Wait until all service instances have deployed the new code. Use a feature flag if a cleaner rollout is needed:

if feature.enabled('use_list_price'):
    write_list_price(value)
else:
    write_price(value)
write_both_during_cutover(value)

Phase 4 — Contract (enforce, drop old surface)

-- Migration: V0125__make_list_price_required.sql
ALTER TABLE catalog_products
  ALTER COLUMN list_price SET NOT NULL;

-- Migration: V0126__drop_old_price_column.sql
ALTER TABLE catalog_products
  DROP COLUMN price;

Lock impact: SET NOT NULL requires a full table scan in PG < 12; in PG ≥ 12 it can use an existing CHECK (list_price IS NOT NULL) NOT VALID validated separately to avoid the lock.
Backward compatibility: none — old code reading price will fail. Only run after the cutover deploy is fully rolled out and burned in.
Rollback: ALTER TABLE ... ADD COLUMN price ... and re-backfill from list_price. Costly. The contract phase is a one-way door — only ship when confident.

Phases for other change types

Adding a NOT NULL column with a default

Phase	Step
1. Expand	`ADD COLUMN x INT;` (nullable, no default)
2. Backfill	`UPDATE ... SET x = ... WHERE x IS NULL;` (batched)
3. Cutover	App writes x on every insert/update
4. Contract	`ALTER ... SET NOT NULL;` + add default for safety

Changing a column type

Phase	Step
1. Expand	`ADD COLUMN x_new NEW_TYPE;`
2. Backfill	`UPDATE ... SET x_new = x::NEW_TYPE;`
3. Cutover	Dual-write x and x_new
4. Contract	Drop x, rename x_new to x

Adding an index

CREATE INDEX CONCURRENTLY idx_name ON table_name (cols);

CONCURRENTLY avoids blocking writes. If the build fails (e.g. invalid index), drop and retry. Never use plain CREATE INDEX on a production table over 1M rows.

Renaming a table

Phase	Step
1. Expand	`CREATE VIEW old_name AS SELECT * FROM new_name;` (or vice versa with renamed table)
2. Cutover	App switches to new name
3. Contract	Drop the view

Adding a foreign key

-- Phase 1: add as NOT VALID (no immediate scan, no lock)
ALTER TABLE child_table
  ADD CONSTRAINT child_parent_fk
    FOREIGN KEY (parent_id) REFERENCES parent_table (id)
    NOT VALID;

-- Phase 2: validate (scan, no exclusive lock)
ALTER TABLE child_table VALIDATE CONSTRAINT child_parent_fk;

Lock impact reference

Operation	Lock	Notes
`ADD COLUMN` (nullable, no default)	AccessExclusiveLock, brief	Safe
`ADD COLUMN` (with non-volatile default, PG ≥ 11)	AccessExclusiveLock, brief	Safe
`ADD COLUMN` (with volatile default)	AccessExclusiveLock, full rewrite	Avoid
`DROP COLUMN`	AccessExclusiveLock, brief	Logical only; reclaim space later
`ALTER COLUMN ... TYPE` (compatible)	AccessExclusiveLock, brief	Check PG docs per type
`ALTER COLUMN ... TYPE` (incompatible)	AccessExclusiveLock, full rewrite	Avoid — use 4-phase
`SET NOT NULL` (PG < 12)	AccessExclusiveLock, full scan	Use NOT VALID + VALIDATE in PG 12+
`CREATE INDEX`	ShareLock, blocks writes	Use CONCURRENTLY
`CREATE INDEX CONCURRENTLY`	ShareUpdateExclusiveLock	Safe; can fail on duplicates

Common mistakes (don't)

Skipping the backfill phase because "the column is nullable, who cares". When the column becomes required, you'll be stuck.
Doing all four phases in one migration. The cutover phase requires application code that doesn't exist until you deploy it. Migrations and deploys must be ordered correctly.
Using CREATE INDEX without CONCURRENTLY on production tables. Blocks writes for the duration of the build.
Using a volatile default in ADD COLUMN. Forces a full table rewrite under AccessExclusiveLock.
Running ALTER COLUMN ... TYPE between incompatible types. Forces a full table rewrite. Use the four-phase pattern.
Forgetting lock_timeout. Always set SET lock_timeout = '5s'; before a DDL statement so it fails fast instead of queueing behind a long transaction.