TO_VESPER — FLAG-037 Scope Extension Pre-Code Findings¶
To: Vesper (COO)
From: Orion (Director of Engineering)
CC: Katja (Founder & CEO), Atlas (CSO)
Date: 2026-04-21
Re: Atlas Ruling — FLAG-037 Reconciler Correction (CANCELLED_BY_ENGINE layer)
Tasking: 03 Branches/fix-reconciler-disappeared-order-conservative/[C] Orion Tasking — FLAG-037 Scope Extension (CANCELLED_BY_ENGINE).md
Summary¶
Pre-code investigation for the Atlas-ordered CANCELLED_BY_ENGINE layer on top of the existing FLAG-037 age-threshold work (C1–C4, already approved). No branch created, no code written. Investigation confirms all five questions have clean answers; implementation is well-scoped.
Biggest items to flag up-front:
-
Enum spelling deviation from existing code. Atlas's spec uses British spelling (
CANCELLED_BY_ENGINE, double-L). The existingOrderStatusenum uses US spelling (CANCELED, single-L). I will use Atlas's exact string as written in the ruling (CANCELLED_BY_ENGINE) — this is the on-disk value and the enum-name convention for the new value. The existingCANCELEDvalue is untouched. Flagging so Vesper knows this is a deliberate, spec-compliant exception, not a typo. -
No race condition to solve. The main loop is synchronous.
_cancel_all_live_ordersruns inside_enter_degraded_modein the same tick; the next reconciler pass is on the next tick. Writing the DB in_cancel_all_live_ordersbefore the function returns naturally precedes the reconciler seeing the disappeared order. Atlas's "ordering matters — no race conditions" is already structurally satisfied; we just need to actually write. -
Schema migration trivial. Orders table
statuscolumn is TEXT with no CHECK constraint, and the codebase already has an_ensure_columnpattern for backward-compat column adds.cancelled_atandcancel_reasonboth slot in as nullable TEXT columns.CANCELLED_BY_ENGINEas a string value is accepted immediately; no DDL change for the status domain. -
Local main drift (already flagged in FLAG-044 delivery). My local
maindoes not include FLAG-041, FLAG-042, or FLAG-044 commits — Katja's realmaindoes. C6's edits tomain_loop.py(_cancel_all_live_orders,_enter_degraded_mode) will sit adjacent to FLAG-042 changes in the same file. Katja applies patches on her machine, where main is current; thegit ampath will either apply cleanly or surface a fuzz/conflict for her to show me. Flagging again so Vesper is not surprised.
Q1 — cancel_all flow¶
Method: NEOEngine._cancel_all_live_orders(self, context: str) -> int
Location: neo_engine/main_loop.py:1201
Call sites:
- _cancel_live_orders_on_shutdown (main_loop.py:1264) — shutdown path.
- _enter_degraded_mode (main_loop.py:1379) — inside if not already_degraded: block. This is the DEGRADED-entry call the CANCELLED_BY_ENGINE layer must protect.
Order objects available at the call site:
active_orders = self._state.get_active_orders() # List[Order] dataclasses
cancellable = [o for o in active_orders if o.offer_sequence is not None]
The Order dataclass (neo_engine/models.py:97) exposes every field we need:
- id: str — primary key for the UPDATE.
- status: OrderStatus — currently ACTIVE or PARTIALLY_FILLED (filter of get_active_orders).
- offer_sequence: Optional[int] — the on-ledger sequence.
- side: OrderSide, created_at, updated_at, filled_quantity, etc. — all available if we need them.
What the method currently does: fetches active orders, iterates, submits OfferCancel to the gateway, logs per-order success/failure, returns sent count. No DB write for cancellation intent. The order stays in the DB as ACTIVE/PARTIALLY_FILLED until either the reconciler sees it disappear (current pathology) or a fill arrives.
DB write methods available:
- self._state.update_order_status(order_id, new_status, *, ...) — accepts new_status: OrderStatus plus optional tx hashes / filled_qty, but does not currently accept a cancelled_at kwarg.
- self._state exposes _transaction, set_engine_state, etc. — full DB surface is available.
Proposed call pattern (implementation spec in C6 below): add a new StateManager.mark_cancelled_by_engine(order_id, cancelled_at, reason=None) method so intent is legible at the call site, and invoke it before the gateway submit_offer_cancel. Writing first means a failed gateway submit still leaves the DB in a safe state (marked CANCELLED_BY_ENGINE, reconciler will skip); a successful gateway submit followed by DB failure is the worst case and matches today's behavior (phantom fill on disappeared order) — no regression.
Alternative considered and rejected: extending update_order_status with a cancelled_at kwarg. Works, but obscures the intent token at the call site. Dedicated method is clearer and matches Atlas's "engine intent must override inference" principle.
Q2 — Orders table schema¶
Base DDL (state_manager.py:433, inside initialize_database):
CREATE TABLE IF NOT EXISTS orders (
id TEXT PRIMARY KEY,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
status TEXT NOT NULL,
side TEXT NOT NULL,
price REAL NOT NULL,
quantity REAL NOT NULL,
filled_quantity REAL NOT NULL DEFAULT 0.0,
submit_tx_hash TEXT,
offer_sequence INTEGER,
cancel_tx_hash TEXT,
failure_reason TEXT,
client_order_id TEXT
);
Additional columns added post-hoc via _ensure_column (state_manager.py:739–744 and surrounding block):
- strategy_version TEXT, parameter_set_id INTEGER, quote_survival_time_s REAL, ladder_level INTEGER, quote_microprice REAL, session_id INTEGER
status field: exists, TEXT, NOT NULL. No CHECK constraint on the domain — any string is accepted. Current valid values (from OrderStatus enum at models.py:19):
- pending_submission, dry_run_skipped, paper_active, submitted, active, partially_filled, filled, cancel_pending, canceled (US spelling, single-L), failed.
cancelled_at: does NOT exist. Requires a new column.
cancel_reason: does NOT exist. Requires a new column (Atlas lists this as optional; I will include it for auditability — cheap, one line).
Migration pattern (_ensure_column helper at state_manager.py:90):
def _ensure_column(conn, table_name, column_name, definition):
if not _column_exists(conn, table_name, column_name):
conn.execute(f"ALTER TABLE {table_name} ADD COLUMN {column_name} {definition}")
This is the established pattern — idempotent (no-op on re-run), backward-compat (existing rows get NULL), no crash on existing DBs. Matches D2, Corridor Guard, Distance-to-Touch, FLAG-037 C1–C4, and every other column add in the codebase.
Verdict: Atlas's preferred implementation (status = CANCELLED_BY_ENGINE, cancelled_at timestamp) requires:
- No DDL change for the status domain — add a new value to OrderStatus enum, write the string as "CANCELLED_BY_ENGINE" (Atlas's spelling), and existing DBs accept it without any migration.
- Two column adds for cancelled_at TEXT and cancel_reason TEXT via _ensure_column. Nullable, no default, legacy rows safely NULL.
No schema complication, no downtime, no backfill required.
Q3 — Reconciler read path¶
Function: _handle_disappeared_active_order(order, engine, result, reconciler) at neo_engine/ledger_reconciler.py:673.
order object: an Order dataclass instance (same type returned by StateManager.get_active_orders → _row_to_order). Passed by value from the iteration loop at ledger_reconciler.py:568:
order.status is read directly in the existing function body (line 704):
So the CANCELLED_BY_ENGINE guard check is a simple attribute compare:
No DB lookup required for the guard. The order was fetched with its status already populated. The guard reads a dataclass field in-process — zero cost.
One caveat for the dataclass: _row_to_order (state_manager.py:2680) reads fields from a sqlite3.Row. If we add cancelled_at and cancel_reason to the Order dataclass, _row_to_order must also populate them with the defensive pattern already used for other post-hoc columns:
cancelled_at=row["cancelled_at"] if "cancelled_at" in row.keys() else None,
cancel_reason=row["cancel_reason"] if "cancel_reason" in row.keys() else None,
This keeps the reader compatible with DBs that haven't run the migration yet (defensive; _ensure_column in initialize_database will always run on engine start, so in practice this branch is never hit after startup — but it matches the established pattern for every other optional field).
Canonical evaluation order (per Atlas + tasking, to be coded exactly as:)
# (1) CANCELLED_BY_ENGINE check — first. Never phantom fill.
if order.status == OrderStatus.CANCELLED_BY_ENGINE:
log.info("RECONCILER_SKIP_ENGINE_CANCEL", extra={...})
# finalize — no inventory change, no phantom fill, no delta
return
# (2) cancel_tx_hash check — existing cancel-race short-circuit.
if order.cancel_tx_hash is not None:
...
result.cancel_races += 1
return
# (3) Age-threshold gate — existing FLAG-037 C2 logic (unchanged).
if cons_cfg.enabled:
...
The existing function already has (2) and (3) in that order. Adding (1) as the first block is a prepend — no reshuffle of the downstream branches.
Q4 — Write-before-reconcile ordering¶
Structural guarantee: the reconciler is invoked synchronously at Step 5 of the main tick (main_loop.py:2455):
DEGRADED entry (_enter_degraded_mode) is invoked later in the same tick — it can be triggered from multiple sources (truth-check in Step 8.5, reconciler held_pending_review, guard evaluators in Step 8.5a/b/c/d, recovery evaluators in Step 8.4). None of those paths are async; all are plain sync method calls inside the same tick's handler.
Inside _enter_degraded_mode, on first entry (not already_degraded), _cancel_all_live_orders("Degraded entry cancel") is called. The method is sync: it calls get_active_orders (DB read), iterates, submits OfferCancel to the gateway (sync, xrpl-py submit), and returns.
Gap between DB writes in _cancel_all_live_orders and the next reconciler pass: exactly one tick boundary. On tick N, DEGRADED entry writes the CANCELLED_BY_ENGINE status to the DB. On tick N+1, Step 5 — Reconcile calls _reconciler.reconcile, which calls get_active_orders (note: get_active_orders filters on status IN (ACTIVE, PARTIALLY_FILLED), so once we update status to CANCELLED_BY_ENGINE, the order is no longer returned — but the canonical path per Atlas is that the reconciler DOES see disappeared orders and evaluates the guard). Two behaviors are possible depending on ordering:
- Path A: status update first, then gateway submit. Order is UPDATE'd to CANCELLED_BY_ENGINE → no longer in
get_active_ordersresult → reconciler never sees it as "disappeared" → phantom-fill code path never runs. Guard is moot here because the order is simply excluded. - Path B: gateway submit first, then status update. OfferCancel lands, order disappears from ledger, but DB still says ACTIVE — reconciler on next tick sees disappeared order, falls into
_handle_disappeared_active_order, and would phantom-fill (current pathology).
Atlas's guard is written for the defensive case: even if the order reaches _handle_disappeared_active_order, the CANCELLED_BY_ENGINE check skips it. This matters in two situations:
1. Failure recovery: if the DB write succeeded but the gateway submit failed, the OfferCancel never lands, but the order is marked CANCELLED_BY_ENGINE. The ledger reconciler reconciles against account_offers, so if our order is still on-ledger, get_active_orders won't include it (status is CANCELLED_BY_ENGINE, not in the ACTIVE/PARTIALLY_FILLED filter). The order effectively becomes orphaned: live on ledger, engine thinks it's cancelled. This is a real concern. Recommendation in C6: DB write first, then gateway submit; on gateway failure, log WARNING and continue — leaving a stale live offer is less harmful than a phantom fill on reconciliation. Atlas's "intent must override inference" says the engine's intent (cancelled) wins over ledger state.
2. Restart continuity: cancel_all fires on tick N, DB flushed, engine crashes before tick N+1. On restart, the reconciler reads the DB, sees CANCELLED_BY_ENGINE orders, correctly skips them. This is test #3.
Plan: C6 writes the DB first, then submits the gateway OfferCancel. On gateway exception, we log and continue. The reconciler guard (also in C6) is the backstop for any edge case where the order still reaches _handle_disappeared_active_order (e.g. if Path B ordering is ever accidentally reintroduced, or if a future code path creates a CANCELLED_BY_ENGINE-marked order that is still active on ledger, or the restart-continuity case where the DB is loaded fresh and the reconciler runs on the first tick).
No async gap. The engine is single-threaded at the tick level. No other thread touches _state concurrently.
Q5 — Schema migration safety¶
StateManager.initialize_database (state_manager.py:420):
- Uses CREATE TABLE IF NOT EXISTS for all tables → idempotent.
- _ensure_column helper (state_manager.py:90) wraps PRAGMA table_info + ALTER TABLE ADD COLUMN → idempotent, no-op if column already exists.
- Executed inside a _transaction context manager → atomic.
- Called unconditionally on every engine startup → migration runs automatically.
Existing column adds validate the pattern. FLAG-036 (wallet truth reconciliation), FLAG-037 C1–C4 (age-threshold anomaly log), Corridor Guard, Anchor Error Telemetry — all added columns via _ensure_column without issue. Katja's existing neo_live_stage1.db has survived dozens of schema additions this way.
Concrete C5 migration lines (to be added in the block at state_manager.py:~760):
# FLAG-037 ext — CANCELLED_BY_ENGINE persistence (Atlas ruling 2026-04-21)
_ensure_column(conn, "orders", "cancelled_at", "TEXT")
_ensure_column(conn, "orders", "cancel_reason", "TEXT")
Backward compat:
- Existing rows get NULL for both columns. No default needed — these columns are only meaningful for CANCELLED_BY_ENGINE-transitioned orders.
- _row_to_order reads each new column with the if "col" in row.keys() else None guard, matching the pattern for every other post-hoc column.
- status = "CANCELLED_BY_ENGINE" is a new string value in a TEXT column with no CHECK constraint — no domain migration needed. The OrderStatus enum just gets a new member; Python's str-Enum conversion accepts the value on read.
- No backfill required. Legacy orders that were cancelled pre-migration are already in terminal states (CANCELED, FILLED, FAILED) and will never hit the disappeared-order path.
Rollback safety: If this layer is reverted, the new columns remain in the DB (harmless — NULL on all subsequent rows). The new enum value, if revert is clean, no longer appears; if any CANCELLED_BY_ENGINE rows exist, the reverted code would fail OrderStatus(row["status"]) conversion. Mitigation: revert would need to update those rows to CANCELED first. Not a concern for forward delivery; flagging for completeness.
Crash on existing DB: impossible. _ensure_column short-circuits if the column exists; CREATE TABLE IF NOT EXISTS is a no-op. Run on a fresh DB, the new columns appear alongside the base schema. Run on a populated DB, the columns are appended, existing rows untouched.
Spelling clarification (flagged up front)¶
Atlas's ruling uses British CANCELLED_BY_ENGINE (double-L).
Existing enum uses US CANCELED (single-L, per existing convention in the codebase).
I will use Atlas's exact string: CANCELLED_BY_ENGINE. This will be:
- The enum member name: OrderStatus.CANCELLED_BY_ENGINE
- The enum string value (on disk): "CANCELLED_BY_ENGINE"
- The log token fragment: RECONCILER_SKIP_ENGINE_CANCEL (Atlas-specified, unrelated to the spelling issue)
The existing OrderStatus.CANCELED = "canceled" is untouched. The two values coexist; each has distinct semantics:
- CANCELED — operator/external cancellation reached terminal state via confirm_cancel (existing).
- CANCELLED_BY_ENGINE — the engine itself cancelled this order as part of DEGRADED entry; must not be phantom-filled by the reconciler (new).
No rename of the existing value. No aliasing. Clean separation of intent.
Implementation plan (final, ready to commit once reviewed)¶
C5 — schema + dataclass + enum (models.py, state_manager.py)
- Add OrderStatus.CANCELLED_BY_ENGINE = "CANCELLED_BY_ENGINE".
- Add to Order dataclass: cancelled_at: Optional[str] = None, cancel_reason: Optional[str] = None.
- Add _ensure_column lines for cancelled_at, cancel_reason in initialize_database.
- Extend _row_to_order with defensive reads for both columns.
- Add StateManager.mark_cancelled_by_engine(order_id, cancelled_at, reason=None) method — UPDATE status + cancelled_at + cancel_reason + updated_at inside _transaction.
C6 — cancel_all integration + reconciler guard (main_loop.py, ledger_reconciler.py)
- In _cancel_all_live_orders: before self._gateway.submit_offer_cancel(...), call self._state.mark_cancelled_by_engine(order.id, cancelled_at=now_iso, reason=context). Wrap in try/except — on DB failure, log ERROR and skip the gateway submit for this order (don't risk orphaning a live offer). On gateway failure after DB success, log ERROR and continue (order is marked; reconciler guard is the backstop).
- In _handle_disappeared_active_order: add CANCELLED_BY_ENGINE check as the first block, before the cancel_tx_hash check. Log token: RECONCILER_SKIP_ENGINE_CANCEL. Return without any inventory change, no phantom fill, no delta, no anomaly row.
- No change to the age-threshold block (C2) — it runs after the new guard and the existing cancel-race check, unchanged.
C7 — tests (tests/test_reconciler_cancelled_by_engine.py)
Five mandatory tests per Atlas:
1. DEGRADED cancel test — engine enters DEGRADED, cancel_all fires, reconciler runs on the same order IDs on the next tick; asserts: no phantom fills, inventory unchanged, RECONCILER_SKIP_ENGINE_CANCEL log emitted.
2. Ambiguous disappearance test — order NOT CANCELLED_BY_ENGINE, no cancel_tx_hash; age < threshold → phantom fill path (existing FLAG-037 behavior). Age ≥ threshold → held_pending_review. Confirms age-gate (C2) still works for genuinely ambiguous cases.
3. Restart continuity test — populate DB with CANCELLED_BY_ENGINE rows; instantiate fresh StateManager + LedgerReconciler; run reconcile with empty account_offers; assert: skip path fires, no phantom fills, inventory unchanged.
4. Mixed case test — DB with a mix of CANCELLED_BY_ENGINE, ambiguous disappeared, and cancel_tx_hash orders; reconcile; assert: only ambiguous ones traverse the age-gate, CANCELLED_BY_ENGINE ones are skipped cleanly, cancel_race path fires for cancel_tx_hash rows.
5. Truth gate preservation test — DEGRADED cancel_all + reconciler pass with CANCELLED_BY_ENGINE orders; assert wallet truth delta is zero (no artificial divergence from the cancellation path).
Tests use the Windows-safe teardown pattern already established (close StateManager before TemporaryDirectory cleanup, per FLAG-037 C4).
Total final test count: 15 (10 pre-existing age-threshold + 5 new). Full regression target: all targeted suites green, pre-existing 10 errors from test_reconciler_anomaly_log.py unchanged.
Risks / caveats¶
- Spelling deviation (flagged above). Using Atlas's British spelling, existing US spelling preserved.
- Local main drift from Katja's main. My sandbox's main lacks FLAG-041/042/044. C6's
main_loop.pyedits may have textual adjacency to FLAG-042 changes in_enter_degraded_mode.git amon Katja's machine will either apply cleanly or surface a fuzz for her to screenshot. Pre-flagging so no one is surprised. - Dedicated
mark_cancelled_by_enginemethod vs. extendingupdate_order_status. I'm going with the dedicated method for intent legibility. Will flip to the kwarg extension if Vesper prefers. cancel_reasoninclusion. Atlas listed it as optional. I'm including it (cheap, auditable, and thecontextarg to_cancel_all_live_ordersalready gives us the right value —"Degraded entry cancel"or"Shutdown cancel"). Can drop to barecancelled_atif preferred.- No branch pre-created. Per standing rules, investigation stays on
main. Branch will be created at commit time and patches delivered to08 Patches/fix-reconciler-disappeared-order-conservative-ext/as spec'd in the tasking memo.
Ready to proceed¶
If Vesper approves the spelling + method-shape decisions above, I'll proceed to C5 → C6 → C7 → patch bundle → delivery memo. Expected delivery same session. No further blockers identified.
— Orion Director of Engineering, BlueFly AI Enterprises