Database Constraints as Tripwires: Three Bugs Caught Red-Handed
A Yuwei postmortem: 23514, 23P01, and a batch of ghost seats--constraints aren't just defenses, they're the most honest alarms.
I’ve previously written about how Yuwei prevents overselling with three layers of protection (row locks, optimistic locks, CHECK constraints). This post is about the other side of that design: constraints aren’t just defenses, they’re alarms. Since launch there have been three bugs, and none of them surfaced as a user reporting “the data looks wrong”–in every case a constraint blew up first, with the PostgreSQL error code nailed right into the logs. When a constraint fires, it means some code path broke the rules–and it died before the write landed.
Case 1: Ghost seats–one business action, two implementation paths
Symptom: After the system auto-cancelled a timed-out order, another order tried to pick its old seat and got blocked at creation with “seat already taken by another order.” Checking the database: no active order held that seat. What held it was a leftover seat_assignments row whose owner had long since moved to timeout_cancelled.
Root cause: Cancelling an order in Yuwei has more than one path. Manual cancellation (cancelOrder) correctly deleted the seat binding; but the countdown service’s timeout rollback (rollbackOne) is a separate entry point, and it internally splits into two branches:
- Branch A: the tenant enabled “auto-cancel on expiry,” and the order moves to
timeout_cancelled–seats should be released; - Branch B: the order moves to
pending_overdue(overdue, awaiting action)–by design, seats stay held until an operator decides manually.
When writing branch A, all the attention went to the order state machine: order.status got updated, the seat binding got forgotten. That’s not a carelessness problem, it’s a structural one: the full semantics of the “cancel” business action (update status + release seats + sync accommodation) are scattered across every call site, and every new entry point relies on a human remembering all of it.
The fix: a tx.seatAssignment.deleteMany({ where: { tenantId, orderId } }) right after the status update–five minutes of work. What’s actually valuable is the 13 static invariant assertions added alongside–a set of tests that scan the source code directly, hard-enforcing that “calls that delete seat bindings may only appear in cancelOrder and rollbackOne”:
// order-seat-release.service.spec.ts (sketch)
const ALLOWED = ['cancelOrder', 'rollbackOne'];
const FORBIDDEN = ['confirmOrder', 'reducePax', 'renewExpire', 'updatePassengers'];
it.each(FORBIDDEN)('%s must never contain seatAssignment.deleteMany', (fn) => {
expect(sourceOf(fn)).not.toContain('seatAssignment.deleteMany');
});
From now on, if someone deletes seats in the wrong place, or adds a third cancellation path and forgets the deletion again, CI goes red. Code review relies on human eyes to prevent regressions, and it can’t stop your future self three months out; invariant tests turn the “rules” into code that bites.
Case 2: 23514–one CHECK makes a whole class of orders uncancellable
After the multi-day tour version (V1.5) shipped, any order with accommodation errored out on cancellation. The logs showed PostgreSQL’s 23514 check_violation, constraint name order_acc_status_check. Its definition, roughly:
CHECK (
status NOT IN ('manual_cancelled', 'timeout_cancelled', ...) -- non-terminal states: anything goes
OR accommodation_status IN ('released', 'not_required') -- terminal states: must be settled
)
In other words, when an order reaches a terminal state, its accommodation resources must already be settled. But cancelOrder, reducePax down to 0, and the countdown rollback–all three paths only wrote status, and nobody touched accommodation_status. So the database rejected every cancellation, expiring pending orders failed to roll back too, and inventory was never released.
At first glance this looks like the constraint firing on friendly targets: it made orders completely uncancellable–worse than having no constraint at all? Run the numbers and it’s exactly the opposite. Without this CHECK, cancellations would “succeed,” but every cancelled multi-day tour order would quietly leak one room’s worth of inventory. That kind of loss throws no errors; it only shows up weeks later as a number that won’t reconcile during stocktaking, followed by a round of cross-table archaeology. The constraint turned it into an explicit bug that blew up on launch day, was located the same day, and was fixed the same day–that’s the value of an alarm: converting a chronic internal injury into an acute external one.
The fix is again about convergence. Instead of patching a status assignment into each of the three call sites, write one pure function that pins the rule down in a single place:
function accommodationStatusForCancelled(current: AccStatus | null): AccStatus | null {
if (current === null) return null; // no accommodation involved
if (current === 'not_required' || current === 'released') return current;
return 'released'; // everything else gets released
}
The in-transaction order-locking utility lockOrderInTx outputs this field directly, and all cancellation paths write it via a uniform spread. When a fourth cancellation path gets added someday, getting it wrong will be hard–the incorrect version is actually more work to write.
Case 3: 23P01–EXCLUDE constraints validate intermediate states
The pricing table uses an EXCLUDE constraint to guarantee that “active ranges for the same channel and route never overlap”:
ALTER TABLE pricing ADD CONSTRAINT pricing_no_overlap
EXCLUDE USING gist (
channel_id WITH =,
route_id WITH =,
daterange(valid_from, valid_to, '[]') WITH &&
);
One operation is a nested split: inserting [5/15, 5/20] into the middle of an existing [5/1, 5/31], splitting the old range into [5/1, 5/14] and [5/21, 5/31]. The first implementation did create for the tail segment first, then update to trim the old range. Test case TC-PRICE-003 ran once: 23P01 exclusion_violation.
The reason is worth committing to muscle memory: non-DEFERRABLE constraints validate the immediate state after each statement, not the final state at transaction commit. At the moment the tail segment [5/21, 5/31] is created, the old record is still the full [5/1, 5/31], and the two collide head-on in the GiST index. Transaction atomicity only guarantees “all or nothing”–it never guarantees “the intermediate steps are exempt from checks.”
Two ways out. Declare the constraint DEFERRABLE INITIALLY DEFERRED and push validation to commit time–but that makes every transaction touching this table lose per-statement immediate feedback, loosening a global belt for one call site’s convenience. Not worth it. The other is to straighten out the statement order: update the old range first, trimming it down to the front segment (clearing the space), then create the tail segment. After the change, it turned out the correct ordering is also the version that reads more naturally to a human–shrink first, then fill in. The code a constraint forces you to write is often the code you should have written anyway.
Wrapping up
Three bugs, three constraint firings (CHECK x 2, EXCLUDE x 1), with these in common:
- When a constraint fires, the dirty data hasn’t landed yet–all three died before the write, so cleanup cost was zero;
- Every fix went beyond fixing the bug and converged the rule to a single point: invariant tests, a pure function, an explicit statement order. Constraints cover the last meter; convergence makes sure the next person never gets to the last meter;
- The counterfactual is clear: had I skipped these constraints to save the hassle, all three bugs would have survived for weeks as silent data corruption–ghost seats shrinking sellable inventory, accommodation leaks inflating costs, overlapping ranges producing two prices for the same day.
Writing a constraint takes ten minutes. Doing archaeology on a pile of dirty data takes weeks. That trade is worth making every single time.