Skip to content

TO_VESPER — FLAG-037 Scope Extension Pre-Code Findings

To: Vesper (COO) From: Orion (Director of Engineering) CC: Katja (Founder & CEO), Atlas (CSO) Date: 2026-04-21 Re: Atlas Ruling — FLAG-037 Reconciler Correction (CANCELLED_BY_ENGINE layer) Tasking: 03 Branches/fix-reconciler-disappeared-order-conservative/[C] Orion Tasking — FLAG-037 Scope Extension (CANCELLED_BY_ENGINE).md


Summary

Pre-code investigation for the Atlas-ordered CANCELLED_BY_ENGINE layer on top of the existing FLAG-037 age-threshold work (C1–C4, already approved). No branch created, no code written. Investigation confirms all five questions have clean answers; implementation is well-scoped.

Biggest items to flag up-front:

  1. Enum spelling deviation from existing code. Atlas's spec uses British spelling (CANCELLED_BY_ENGINE, double-L). The existing OrderStatus enum uses US spelling (CANCELED, single-L). I will use Atlas's exact string as written in the ruling (CANCELLED_BY_ENGINE) — this is the on-disk value and the enum-name convention for the new value. The existing CANCELED value is untouched. Flagging so Vesper knows this is a deliberate, spec-compliant exception, not a typo.

  2. No race condition to solve. The main loop is synchronous. _cancel_all_live_orders runs inside _enter_degraded_mode in the same tick; the next reconciler pass is on the next tick. Writing the DB in _cancel_all_live_orders before the function returns naturally precedes the reconciler seeing the disappeared order. Atlas's "ordering matters — no race conditions" is already structurally satisfied; we just need to actually write.

  3. Schema migration trivial. Orders table status column is TEXT with no CHECK constraint, and the codebase already has an _ensure_column pattern for backward-compat column adds. cancelled_at and cancel_reason both slot in as nullable TEXT columns. CANCELLED_BY_ENGINE as a string value is accepted immediately; no DDL change for the status domain.

  4. Local main drift (already flagged in FLAG-044 delivery). My local main does not include FLAG-041, FLAG-042, or FLAG-044 commits — Katja's real main does. C6's edits to main_loop.py (_cancel_all_live_orders, _enter_degraded_mode) will sit adjacent to FLAG-042 changes in the same file. Katja applies patches on her machine, where main is current; the git am path will either apply cleanly or surface a fuzz/conflict for her to show me. Flagging again so Vesper is not surprised.


Q1 — cancel_all flow

Method: NEOEngine._cancel_all_live_orders(self, context: str) -> int Location: neo_engine/main_loop.py:1201 Call sites: - _cancel_live_orders_on_shutdown (main_loop.py:1264) — shutdown path. - _enter_degraded_mode (main_loop.py:1379) — inside if not already_degraded: block. This is the DEGRADED-entry call the CANCELLED_BY_ENGINE layer must protect.

Order objects available at the call site:

active_orders = self._state.get_active_orders()            # List[Order] dataclasses
cancellable   = [o for o in active_orders if o.offer_sequence is not None]

The Order dataclass (neo_engine/models.py:97) exposes every field we need: - id: str — primary key for the UPDATE. - status: OrderStatus — currently ACTIVE or PARTIALLY_FILLED (filter of get_active_orders). - offer_sequence: Optional[int] — the on-ledger sequence. - side: OrderSide, created_at, updated_at, filled_quantity, etc. — all available if we need them.

What the method currently does: fetches active orders, iterates, submits OfferCancel to the gateway, logs per-order success/failure, returns sent count. No DB write for cancellation intent. The order stays in the DB as ACTIVE/PARTIALLY_FILLED until either the reconciler sees it disappear (current pathology) or a fill arrives.

DB write methods available: - self._state.update_order_status(order_id, new_status, *, ...) — accepts new_status: OrderStatus plus optional tx hashes / filled_qty, but does not currently accept a cancelled_at kwarg. - self._state exposes _transaction, set_engine_state, etc. — full DB surface is available.

Proposed call pattern (implementation spec in C6 below): add a new StateManager.mark_cancelled_by_engine(order_id, cancelled_at, reason=None) method so intent is legible at the call site, and invoke it before the gateway submit_offer_cancel. Writing first means a failed gateway submit still leaves the DB in a safe state (marked CANCELLED_BY_ENGINE, reconciler will skip); a successful gateway submit followed by DB failure is the worst case and matches today's behavior (phantom fill on disappeared order) — no regression.

Alternative considered and rejected: extending update_order_status with a cancelled_at kwarg. Works, but obscures the intent token at the call site. Dedicated method is clearer and matches Atlas's "engine intent must override inference" principle.


Q2 — Orders table schema

Base DDL (state_manager.py:433, inside initialize_database):

CREATE TABLE IF NOT EXISTS orders (
    id               TEXT    PRIMARY KEY,
    created_at       TEXT    NOT NULL,
    updated_at       TEXT    NOT NULL,
    status           TEXT    NOT NULL,
    side             TEXT    NOT NULL,
    price            REAL    NOT NULL,
    quantity         REAL    NOT NULL,
    filled_quantity  REAL    NOT NULL DEFAULT 0.0,
    submit_tx_hash   TEXT,
    offer_sequence   INTEGER,
    cancel_tx_hash   TEXT,
    failure_reason   TEXT,
    client_order_id  TEXT
);

Additional columns added post-hoc via _ensure_column (state_manager.py:739–744 and surrounding block): - strategy_version TEXT, parameter_set_id INTEGER, quote_survival_time_s REAL, ladder_level INTEGER, quote_microprice REAL, session_id INTEGER

status field: exists, TEXT, NOT NULL. No CHECK constraint on the domain — any string is accepted. Current valid values (from OrderStatus enum at models.py:19): - pending_submission, dry_run_skipped, paper_active, submitted, active, partially_filled, filled, cancel_pending, canceled (US spelling, single-L), failed.

cancelled_at: does NOT exist. Requires a new column.

cancel_reason: does NOT exist. Requires a new column (Atlas lists this as optional; I will include it for auditability — cheap, one line).

Migration pattern (_ensure_column helper at state_manager.py:90):

def _ensure_column(conn, table_name, column_name, definition):
    if not _column_exists(conn, table_name, column_name):
        conn.execute(f"ALTER TABLE {table_name} ADD COLUMN {column_name} {definition}")

This is the established pattern — idempotent (no-op on re-run), backward-compat (existing rows get NULL), no crash on existing DBs. Matches D2, Corridor Guard, Distance-to-Touch, FLAG-037 C1–C4, and every other column add in the codebase.

Verdict: Atlas's preferred implementation (status = CANCELLED_BY_ENGINE, cancelled_at timestamp) requires: - No DDL change for the status domain — add a new value to OrderStatus enum, write the string as "CANCELLED_BY_ENGINE" (Atlas's spelling), and existing DBs accept it without any migration. - Two column adds for cancelled_at TEXT and cancel_reason TEXT via _ensure_column. Nullable, no default, legacy rows safely NULL.

No schema complication, no downtime, no backfill required.


Q3 — Reconciler read path

Function: _handle_disappeared_active_order(order, engine, result, reconciler) at neo_engine/ledger_reconciler.py:673.

order object: an Order dataclass instance (same type returned by StateManager.get_active_orders_row_to_order). Passed by value from the iteration loop at ledger_reconciler.py:568:

_handle_disappeared_active_order(order, engine, result, reconciler)

order.status is read directly in the existing function body (line 704):

"status": order.status.value,

So the CANCELLED_BY_ENGINE guard check is a simple attribute compare:

if order.status == OrderStatus.CANCELLED_BY_ENGINE:
    ...

No DB lookup required for the guard. The order was fetched with its status already populated. The guard reads a dataclass field in-process — zero cost.

One caveat for the dataclass: _row_to_order (state_manager.py:2680) reads fields from a sqlite3.Row. If we add cancelled_at and cancel_reason to the Order dataclass, _row_to_order must also populate them with the defensive pattern already used for other post-hoc columns:

cancelled_at=row["cancelled_at"] if "cancelled_at" in row.keys() else None,
cancel_reason=row["cancel_reason"] if "cancel_reason" in row.keys() else None,

This keeps the reader compatible with DBs that haven't run the migration yet (defensive; _ensure_column in initialize_database will always run on engine start, so in practice this branch is never hit after startup — but it matches the established pattern for every other optional field).

Canonical evaluation order (per Atlas + tasking, to be coded exactly as:)

# (1) CANCELLED_BY_ENGINE check — first. Never phantom fill.
if order.status == OrderStatus.CANCELLED_BY_ENGINE:
    log.info("RECONCILER_SKIP_ENGINE_CANCEL", extra={...})
    # finalize — no inventory change, no phantom fill, no delta
    return

# (2) cancel_tx_hash check — existing cancel-race short-circuit.
if order.cancel_tx_hash is not None:
    ...
    result.cancel_races += 1
    return

# (3) Age-threshold gate — existing FLAG-037 C2 logic (unchanged).
if cons_cfg.enabled:
    ...

The existing function already has (2) and (3) in that order. Adding (1) as the first block is a prepend — no reshuffle of the downstream branches.


Q4 — Write-before-reconcile ordering

Structural guarantee: the reconciler is invoked synchronously at Step 5 of the main tick (main_loop.py:2455):

recon_result = self._reconciler.reconcile(account_offers)

DEGRADED entry (_enter_degraded_mode) is invoked later in the same tick — it can be triggered from multiple sources (truth-check in Step 8.5, reconciler held_pending_review, guard evaluators in Step 8.5a/b/c/d, recovery evaluators in Step 8.4). None of those paths are async; all are plain sync method calls inside the same tick's handler.

Inside _enter_degraded_mode, on first entry (not already_degraded), _cancel_all_live_orders("Degraded entry cancel") is called. The method is sync: it calls get_active_orders (DB read), iterates, submits OfferCancel to the gateway (sync, xrpl-py submit), and returns.

Gap between DB writes in _cancel_all_live_orders and the next reconciler pass: exactly one tick boundary. On tick N, DEGRADED entry writes the CANCELLED_BY_ENGINE status to the DB. On tick N+1, Step 5 — Reconcile calls _reconciler.reconcile, which calls get_active_orders (note: get_active_orders filters on status IN (ACTIVE, PARTIALLY_FILLED), so once we update status to CANCELLED_BY_ENGINE, the order is no longer returned — but the canonical path per Atlas is that the reconciler DOES see disappeared orders and evaluates the guard). Two behaviors are possible depending on ordering:

  • Path A: status update first, then gateway submit. Order is UPDATE'd to CANCELLED_BY_ENGINE → no longer in get_active_orders result → reconciler never sees it as "disappeared" → phantom-fill code path never runs. Guard is moot here because the order is simply excluded.
  • Path B: gateway submit first, then status update. OfferCancel lands, order disappears from ledger, but DB still says ACTIVE — reconciler on next tick sees disappeared order, falls into _handle_disappeared_active_order, and would phantom-fill (current pathology).

Atlas's guard is written for the defensive case: even if the order reaches _handle_disappeared_active_order, the CANCELLED_BY_ENGINE check skips it. This matters in two situations: 1. Failure recovery: if the DB write succeeded but the gateway submit failed, the OfferCancel never lands, but the order is marked CANCELLED_BY_ENGINE. The ledger reconciler reconciles against account_offers, so if our order is still on-ledger, get_active_orders won't include it (status is CANCELLED_BY_ENGINE, not in the ACTIVE/PARTIALLY_FILLED filter). The order effectively becomes orphaned: live on ledger, engine thinks it's cancelled. This is a real concern. Recommendation in C6: DB write first, then gateway submit; on gateway failure, log WARNING and continue — leaving a stale live offer is less harmful than a phantom fill on reconciliation. Atlas's "intent must override inference" says the engine's intent (cancelled) wins over ledger state. 2. Restart continuity: cancel_all fires on tick N, DB flushed, engine crashes before tick N+1. On restart, the reconciler reads the DB, sees CANCELLED_BY_ENGINE orders, correctly skips them. This is test #3.

Plan: C6 writes the DB first, then submits the gateway OfferCancel. On gateway exception, we log and continue. The reconciler guard (also in C6) is the backstop for any edge case where the order still reaches _handle_disappeared_active_order (e.g. if Path B ordering is ever accidentally reintroduced, or if a future code path creates a CANCELLED_BY_ENGINE-marked order that is still active on ledger, or the restart-continuity case where the DB is loaded fresh and the reconciler runs on the first tick).

No async gap. The engine is single-threaded at the tick level. No other thread touches _state concurrently.


Q5 — Schema migration safety

StateManager.initialize_database (state_manager.py:420): - Uses CREATE TABLE IF NOT EXISTS for all tables → idempotent. - _ensure_column helper (state_manager.py:90) wraps PRAGMA table_info + ALTER TABLE ADD COLUMN → idempotent, no-op if column already exists. - Executed inside a _transaction context manager → atomic. - Called unconditionally on every engine startup → migration runs automatically.

Existing column adds validate the pattern. FLAG-036 (wallet truth reconciliation), FLAG-037 C1–C4 (age-threshold anomaly log), Corridor Guard, Anchor Error Telemetry — all added columns via _ensure_column without issue. Katja's existing neo_live_stage1.db has survived dozens of schema additions this way.

Concrete C5 migration lines (to be added in the block at state_manager.py:~760):

# FLAG-037 ext — CANCELLED_BY_ENGINE persistence (Atlas ruling 2026-04-21)
_ensure_column(conn, "orders", "cancelled_at",  "TEXT")
_ensure_column(conn, "orders", "cancel_reason", "TEXT")

Backward compat: - Existing rows get NULL for both columns. No default needed — these columns are only meaningful for CANCELLED_BY_ENGINE-transitioned orders. - _row_to_order reads each new column with the if "col" in row.keys() else None guard, matching the pattern for every other post-hoc column. - status = "CANCELLED_BY_ENGINE" is a new string value in a TEXT column with no CHECK constraint — no domain migration needed. The OrderStatus enum just gets a new member; Python's str-Enum conversion accepts the value on read. - No backfill required. Legacy orders that were cancelled pre-migration are already in terminal states (CANCELED, FILLED, FAILED) and will never hit the disappeared-order path.

Rollback safety: If this layer is reverted, the new columns remain in the DB (harmless — NULL on all subsequent rows). The new enum value, if revert is clean, no longer appears; if any CANCELLED_BY_ENGINE rows exist, the reverted code would fail OrderStatus(row["status"]) conversion. Mitigation: revert would need to update those rows to CANCELED first. Not a concern for forward delivery; flagging for completeness.

Crash on existing DB: impossible. _ensure_column short-circuits if the column exists; CREATE TABLE IF NOT EXISTS is a no-op. Run on a fresh DB, the new columns appear alongside the base schema. Run on a populated DB, the columns are appended, existing rows untouched.


Spelling clarification (flagged up front)

Atlas's ruling uses British CANCELLED_BY_ENGINE (double-L). Existing enum uses US CANCELED (single-L, per existing convention in the codebase).

I will use Atlas's exact string: CANCELLED_BY_ENGINE. This will be: - The enum member name: OrderStatus.CANCELLED_BY_ENGINE - The enum string value (on disk): "CANCELLED_BY_ENGINE" - The log token fragment: RECONCILER_SKIP_ENGINE_CANCEL (Atlas-specified, unrelated to the spelling issue)

The existing OrderStatus.CANCELED = "canceled" is untouched. The two values coexist; each has distinct semantics: - CANCELED — operator/external cancellation reached terminal state via confirm_cancel (existing). - CANCELLED_BY_ENGINE — the engine itself cancelled this order as part of DEGRADED entry; must not be phantom-filled by the reconciler (new).

No rename of the existing value. No aliasing. Clean separation of intent.


Implementation plan (final, ready to commit once reviewed)

C5 — schema + dataclass + enum (models.py, state_manager.py) - Add OrderStatus.CANCELLED_BY_ENGINE = "CANCELLED_BY_ENGINE". - Add to Order dataclass: cancelled_at: Optional[str] = None, cancel_reason: Optional[str] = None. - Add _ensure_column lines for cancelled_at, cancel_reason in initialize_database. - Extend _row_to_order with defensive reads for both columns. - Add StateManager.mark_cancelled_by_engine(order_id, cancelled_at, reason=None) method — UPDATE status + cancelled_at + cancel_reason + updated_at inside _transaction.

C6 — cancel_all integration + reconciler guard (main_loop.py, ledger_reconciler.py) - In _cancel_all_live_orders: before self._gateway.submit_offer_cancel(...), call self._state.mark_cancelled_by_engine(order.id, cancelled_at=now_iso, reason=context). Wrap in try/except — on DB failure, log ERROR and skip the gateway submit for this order (don't risk orphaning a live offer). On gateway failure after DB success, log ERROR and continue (order is marked; reconciler guard is the backstop). - In _handle_disappeared_active_order: add CANCELLED_BY_ENGINE check as the first block, before the cancel_tx_hash check. Log token: RECONCILER_SKIP_ENGINE_CANCEL. Return without any inventory change, no phantom fill, no delta, no anomaly row. - No change to the age-threshold block (C2) — it runs after the new guard and the existing cancel-race check, unchanged.

C7 — tests (tests/test_reconciler_cancelled_by_engine.py) Five mandatory tests per Atlas: 1. DEGRADED cancel test — engine enters DEGRADED, cancel_all fires, reconciler runs on the same order IDs on the next tick; asserts: no phantom fills, inventory unchanged, RECONCILER_SKIP_ENGINE_CANCEL log emitted. 2. Ambiguous disappearance test — order NOT CANCELLED_BY_ENGINE, no cancel_tx_hash; age < threshold → phantom fill path (existing FLAG-037 behavior). Age ≥ threshold → held_pending_review. Confirms age-gate (C2) still works for genuinely ambiguous cases. 3. Restart continuity test — populate DB with CANCELLED_BY_ENGINE rows; instantiate fresh StateManager + LedgerReconciler; run reconcile with empty account_offers; assert: skip path fires, no phantom fills, inventory unchanged. 4. Mixed case test — DB with a mix of CANCELLED_BY_ENGINE, ambiguous disappeared, and cancel_tx_hash orders; reconcile; assert: only ambiguous ones traverse the age-gate, CANCELLED_BY_ENGINE ones are skipped cleanly, cancel_race path fires for cancel_tx_hash rows. 5. Truth gate preservation test — DEGRADED cancel_all + reconciler pass with CANCELLED_BY_ENGINE orders; assert wallet truth delta is zero (no artificial divergence from the cancellation path).

Tests use the Windows-safe teardown pattern already established (close StateManager before TemporaryDirectory cleanup, per FLAG-037 C4).

Total final test count: 15 (10 pre-existing age-threshold + 5 new). Full regression target: all targeted suites green, pre-existing 10 errors from test_reconciler_anomaly_log.py unchanged.


Risks / caveats

  1. Spelling deviation (flagged above). Using Atlas's British spelling, existing US spelling preserved.
  2. Local main drift from Katja's main. My sandbox's main lacks FLAG-041/042/044. C6's main_loop.py edits may have textual adjacency to FLAG-042 changes in _enter_degraded_mode. git am on Katja's machine will either apply cleanly or surface a fuzz for her to screenshot. Pre-flagging so no one is surprised.
  3. Dedicated mark_cancelled_by_engine method vs. extending update_order_status. I'm going with the dedicated method for intent legibility. Will flip to the kwarg extension if Vesper prefers.
  4. cancel_reason inclusion. Atlas listed it as optional. I'm including it (cheap, auditable, and the context arg to _cancel_all_live_orders already gives us the right value — "Degraded entry cancel" or "Shutdown cancel"). Can drop to bare cancelled_at if preferred.
  5. No branch pre-created. Per standing rules, investigation stays on main. Branch will be created at commit time and patches delivered to 08 Patches/fix-reconciler-disappeared-order-conservative-ext/ as spec'd in the tasking memo.

Ready to proceed

If Vesper approves the spelling + method-shape decisions above, I'll proceed to C5 → C6 → C7 → patch bundle → delivery memo. Expected delivery same session. No further blockers identified.

— Orion Director of Engineering, BlueFly AI Enterprises