Skip to content

Summary

Pre-code investigation for fix/cancel-fill-race. Bottom line: the fix is mechanically shallow but has one upstream dependency that affects scope — the gateway does NOT currently expose an on-chain tx-history lookup. Adding account_tx to the gateway is a prerequisite; it's a small addition but it belongs in the branch, not retrofitted later.

Before anything else — flagging a taxonomy correction. The tasking doc specifies tecNO_ENTRY as the cancel-race result code. The actual XRPL code returned by submit_and_wait for an already-consumed offer is tecNO_TARGET. The gateway already distinguishes this case (xrpl_gateway.py:1220, 1234; models.py:392), and the CancelResponse carries xrpl_result_code as a machine-readable field specifically for this detection. Request: use tecNO_TARGET throughout the branch and update the tasking doc + log token names accordingly. The rest of this memo uses tecNO_TARGET.

Second — the race surface is broader than DEGRADED entry. The fix must land uniformly in _cancel_all_live_orders because that one function serves three call sites today (DEGRADED entry, shutdown cancel) and a fourth tomorrow (ANCHOR_IDLE entry — FLAG-046). Single-point fix is correct; no scoping needed. See Q5.

Proposed commit sequence: 5 commits (not 4), because C1 must add account_tx to the gateway before C3 can call it. Detail below.


Q1 — Cancel result inspection

Primary call site: neo_engine/main_loop.py:1281, inside _cancel_all_live_orders:

try:
    self._gateway.submit_offer_cancel(order.offer_sequence)     # ← CancelResponse discarded
    sent += 1
    log.info(f"{context}: sent", extra={...})
except Exception as exc:
    log.error(f"{context}: failed", extra={...})

The returned CancelResponse is not assigned — the call site ignores the result entirely. Exceptions are caught, but a returned success=False with xrpl_result_code="tecNO_TARGET" silently reports "sent" (the sent += 1 line still runs). That's the exact hole FLAG-047 needs to close.

Secondary call sites that also use submit_offer_cancel (for completeness — none is the FLAG-047 target, but they're relevant to test scoping):

file:line context reads result? FLAG-047 scope?
main_loop.py:1281 _cancel_all_live_orders NO YES — primary fix site
main_loop.py:841 FLAG-012 recovery cancel round 1 yes (result.success or result.xrpl_result_code == "tecNO_TARGET") NO — different lifecycle path, no CANCELLED_BY_ENGINE write
main_loop.py:901 FLAG-012 retry cancel yes NO — same as above
main_loop.py:4275 _attempt_cancel (order-lifecycle CANCEL_PENDING path) yes, returns cancel_resp NO — orders here are CANCEL_PENDING, not CANCELLED_BY_ENGINE
execution_engine.py (via order_lifecycle) three-phase cancel path yes NO — normal order lifecycle

Only the _cancel_all_live_orders path is FLAG-047-relevant because only that path writes CANCELLED_BY_ENGINE via mark_cancelled_by_engine BEFORE gateway submit (the write-before-submit pattern introduced in FLAG-037 C6). The other cancel paths follow request_cancel → CANCEL_PENDING → confirm_cancel and do not use the CANCELLED_BY_ENGINE guard.

Fix shape (C2): change line 1281 from discard-and-try to capture-and-branch. Pseudocode:

try:
    resp = self._gateway.submit_offer_cancel(order.offer_sequence)
except Exception as exc:
    log.error(f"{context}: failed", extra={..., "error": str(exc)})
    continue

if resp.xrpl_result_code == "tecNO_TARGET":
    # Race detected — offer was consumed before cancel arrived.
    # Transition CANCELLED_BY_ENGINE → CANCEL_RACE_UNKNOWN so
    # the reconciler's skip guard does NOT fire on this order.
    try:
        self._state.mark_cancel_race_unknown(
            order.id,
            race_detected_at=datetime.utcnow().isoformat() + "Z",
        )
    except Exception as exc:
        log.error(f"{context}: race status write failed", extra={...})
    log.warning(
        "CANCEL_RACE_DETECTED — tecNO_TARGET on cancel; "
        "order needs on-chain lookup",
        extra={
            "order_id": order.id,
            "offer_sequence": order.offer_sequence,
        },
    )
    sent += 1  # cancel was sent; semantics of "sent" preserved
    continue

if resp.success:
    sent += 1
    log.info(f"{context}: sent", extra={...})
else:
    # Other failure — non-race. Status stays CANCELLED_BY_ENGINE.
    log.error(
        f"{context}: cancel failed",
        extra={..., "xrpl_result_code": resp.xrpl_result_code,
               "failure_reason": resp.failure_reason},
    )

Notes: * The DB write for mark_cancelled_by_engine runs BEFORE this block (line 1265) and stays. The CANCEL_RACE_UNKNOWN transition is a status overwrite, not a conditional initial write — this preserves the write-before-submit invariant against a crash between the DB mark and the gateway call. * mark_cancel_race_unknown is a new state_manager method, structurally mirroring mark_cancelled_by_engine. See Q4.


Q2 — On-chain tx history

Current state of gateway: NO tx-history lookup exists. xrpl_gateway.py:_build_xrpl_request dispatches only: ServerInfo, AccountInfo, AccountOffers, AccountLines, BookOffers, AMMInfo. There is no account_tx, no tx, no ledger_entry.

XRPL API choice: account_tx is the idiomatic endpoint. Given the offer_sequence we need to resolve, the query shape is:

from xrpl.models.requests import AccountTx
req = AccountTx(
    account=self._wallet_address,
    ledger_index_min=ledger_index_at_cancel - K,  # K = small window,
    ledger_index_max=ledger_index_at_cancel + K,  # e.g. 10-20 ledgers
    limit=200,
    forward=False,
)
resp = self._client.request(req).result

Then walk resp["transactions"] looking at metadata AffectedNodes: * If a DeletedNode of type Offer with PreviousFields.Sequence == offer_sequence appears under an OfferCreate or Payment tx from a DIFFERENT account → fill (our offer was consumed by a counterparty). * If a DeletedNode of type Offer with that sequence appears under an OfferCancel tx from OUR account → cancel (our own cancel, or a prior cancel, did fire despite the late tecNO_TARGET we observed). * If neither branch matches → inconclusive.

Ledger-window sizing: cancel_tx_ledger_index gives us the pivot. Offers typically settle within 1-2 ledgers (each ~4s). A ±10-ledger window (~40-80 seconds total) should capture the event with high confidence. Larger windows risk matching unrelated tx and add cost; smaller windows risk missing a race that straddled validation. I'd default to ±15 ledgers; easy to tune once the first live race is observed.

Alternative considered + rejected: tx by cancel tx hash. The OfferCancel tx hash is known (we just submitted it), but its metadata only describes what the OfferCancel itself did — which was nothing, by definition (tecNO_TARGET = offer was already gone). It doesn't tell us what consumed the offer. account_tx is required.

Performance impact: one additional XRPL RPC call per race event (rare — only fires on tecNO_TARGET). Bounded, synchronous in the reconciler path. Acceptable.

Gateway addition (new method — C1 scope):

def get_account_tx_for_offer(
    self,
    offer_sequence: int,
    ledger_window: int = 15,
    pivot_ledger_index: Optional[int] = None,
) -> OfferResolution:
    """
    Resolve what happened to a specific offer_sequence by querying
    account_tx around a pivot ledger. Returns OfferResolution enum:
    FILLED, CANCELLED, INCONCLUSIVE.
    """

Returns a new lightweight dataclass/enum (OfferResolution) so the reconciler doesn't inspect raw XRPL JSON. Lives in models.py.

Fail-closed: exceptions during the RPC call, timeouts, or any parse ambiguity → return OfferResolution.INCONCLUSIVE. The reconciler then treats it as truth-check trigger per the Atlas invariant ("if the engine cannot prove alignment with reality, it does not act").


Q3 — Reconciler entry point

Exact location: neo_engine/ledger_reconciler.py:760-771, inside _handle_disappeared_active_order:

# -----------------------------------------------------------------------
# FLAG-037 extension — CANCELLED_BY_ENGINE guard (Atlas ruling 2026-04-21)
# -----------------------------------------------------------------------
# FIRST check. Atlas's canonical evaluation order is:
#   1. CANCELLED_BY_ENGINE → never phantom fill.
#   2. cancel_tx_hash      → existing cancel-race short-circuit.
#   3. Age-threshold gate  → existing FLAG-037 C2 logic.
...
if order.status == OrderStatus.CANCELLED_BY_ENGINE:
    log.info(
        "RECONCILER_SKIP_ENGINE_CANCEL — disappeared order was "
        "cancelled by engine; skipping phantom-fill path",
        extra={...},
    )
    return

Also relevant: * ledger_reconciler.py:355_get_orders_for_reconciliation includes CANCELLED_BY_ENGINE in the status filter. CANCEL_RACE_UNKNOWN must also be added here so the reconciler fetches these orders. * ledger_reconciler.py:519-521_reconcile_order dispatch branch routes CANCELLED_BY_ENGINE directly into _handle_disappeared_active_order. CANCEL_RACE_UNKNOWN follows the same dispatch — it's a disappeared-order case requiring resolution.

Fix shape (C3): Insert a new branch BEFORE the CANCELLED_BY_ENGINE guard. Revised canonical evaluation order:

1. CANCEL_RACE_UNKNOWN     → on-chain tx lookup → FILL / CANCEL / INCONCLUSIVE
2. CANCELLED_BY_ENGINE     → skip (unchanged)
3. cancel_tx_hash          → cancel race (unchanged)
4. Age-threshold gate      → FLAG-037 C2 (unchanged)

Pseudocode:

if order.status == OrderStatus.CANCEL_RACE_UNKNOWN:
    resolution = self._gateway.get_account_tx_for_offer(
        order.offer_sequence,
        pivot_ledger_index=order.cancel_race_pivot_ledger,  # populated in C2
    )
    if resolution == OfferResolution.FILLED:
        # Record as fill (normal fill path — not phantom). This runs
        # the real fill accounting: engine.record_full_fill with the
        # on-chain-derived size/price.
        log.warning(
            "CANCEL_RACE_FILL_CONFIRMED — on-chain tx lookup shows "
            "offer was filled before cancel; recording fill",
            extra={"order_id": order.id, "offer_sequence": ...},
        )
        engine.record_full_fill(order, size=..., price=...)
        self._state.mark_filled_after_race(order.id)
        result.full_fills += 1
        return

    if resolution == OfferResolution.CANCELLED:
        log.info(
            "CANCEL_RACE_CANCEL_CONFIRMED — on-chain tx lookup shows "
            "offer was cancelled despite tecNO_TARGET",
            extra={"order_id": order.id, "offer_sequence": ...},
        )
        self._state.mark_cancelled_by_engine(
            order.id,
            reason="cancel_race_resolved_to_cancel",
        )
        return

    # INCONCLUSIVE — fail-closed. Anomaly row + result.held_pending_review
    # increments, which triggers DEGRADED via reconciler at main loop.
    log.error(
        "CANCEL_RACE_INCONCLUSIVE — on-chain tx lookup could not "
        "determine outcome; escalating to truth check",
        extra={...},
    )
    self._write_anomaly_row(order, action_taken="cancel_race_inconclusive")
    result.held_pending_review += 1
    return

Fill-size sourcing on CANCEL_RACE_FILL_CONFIRMED: this is the subtle bit. The on-chain lookup tells us the offer_sequence was consumed, and the DeletedNode metadata for the Offer carries PreviousFields.TakerPays and PreviousFields.TakerGets (what was left before the consuming tx ate it) and FinalFields (what was left after — typically 0 for full fills). Delta = PreviousFields - FinalFields = the fill amount. I'll need to walk the AffectedNodes carefully and convert XRPL amount encoding (drops for XRP, issued amount for RLUSD). This is one of the riskier bits of the branch. Alternative: invoke existing fill-recording telemetry the same way the reconciler does today for phantom fills, using the order's stated intended size. That's less precise but simpler. Lean toward the on-chain-derived approach — we'd want the actual realized fill, not the intended one. Will confirm in implementation.


Q4 — DB schema

Additive only. No migration required.

Current orders schema (per FLAG-037 C5): * status TEXT — unconstrained (no CHECK). Accepts any string. CANCELLED_BY_ENGINE was just a new value added Apr 21. * cancelled_at TEXT — ISO 8601, populated only by mark_cancelled_by_engine. Legacy rows NULL. * cancel_reason TEXT — free-form. Legacy rows NULL.

Schema additions for FLAG-047 (C1 scope): * New status value: CANCEL_RACE_UNKNOWN — added to OrderStatus enum in models.py. Column needs no change. * New optional column: cancel_race_detected_at TEXT — timestamp when tecNO_TARGET was observed on the submit_offer_cancel return. Added via _ensure_column("orders", "cancel_race_detected_at", "TEXT") in initialize_database, mirroring the FLAG-037 C5 pattern exactly. * New optional column: cancel_race_pivot_ledger INTEGER — the ledger_index at which the cancel attempt validated. Used by the reconciler to scope the account_tx window. Same idempotent migration pattern. * _row_to_order extended with the "col" in row.keys() else None defensive pattern for both new columns, matching the FLAG-037 C5 treatment.

New state_manager methods: * mark_cancel_race_unknown(order_id, *, race_detected_at, pivot_ledger) — single-transaction write of status + two new fields + refresh of updated_at. Mirrors mark_cancelled_by_engine exactly. * mark_filled_after_race(order_id, *, filled_at=None) — transitions CANCEL_RACE_UNKNOWN → FILLED with audit trail. Optional; could also be handled by the existing record_full_fill pathway. Worth a discussion — see "open question for Vesper" at bottom.

Legacy compatibility: * Orders written before this branch have no cancel_race_* columns — idempotent _ensure_column adds them as NULL. Reads work because _row_to_order already uses the defensive if "col" in row.keys() else None pattern. * mark_cancelled_by_engine is unchanged and continues to produce orders with status=CANCELLED_BY_ENGINE, cancelled_at populated, and cancel_race_* fields NULL. Reconciler branch on CANCELLED_BY_ENGINE is unchanged — the skip path still fires for normal (non-race) engine cancels.


Q5 — Interaction with FLAG-046 (ANCHOR_IDLE)

Confirmed: the fix applies uniformly. No scoping needed.

Per my FLAG-046 pre-code findings (Vesper-approved earlier today), the ANCHOR_IDLE entry helper _enter_anchor_idle_mode will call _cancel_all_live_orders("Anchor idle entry cancel"). This is the same single function that FLAG-047 is patching. One fix covers all three call sites:

call site from same race surface?
DEGRADED entry _enter_degraded_mode_cancel_all_live_orders("Degraded entry cancel") YES — primary FLAG-047 scope
ANCHOR_IDLE entry (future) _enter_anchor_idle_mode_cancel_all_live_orders("Anchor idle entry cancel") YES — inherited via shared function
Shutdown cancel _cancel_live_orders_on_shutdown_cancel_all_live_orders("Shutdown cancel") YES — also inherits, and arguably more important (shutdown without race handling strands fills)

Implication for branch sequencing: * FLAG-047 lands first (SESSION-BLOCKING per tasking). * FLAG-046 (ANCHOR_IDLE) cuts after and inherits the fix automatically — no additional work needed in the FLAG-046 branch to handle the race, it just works because the race handling is at the shared function layer. * Tests for FLAG-047 should use DEGRADED entry as the trigger (the observed S48 failure mode). ANCHOR_IDLE-as-trigger test can be added when FLAG-046 lands — or as a regression test in the FLAG-046 branch.

Ordering with shutdown cancel: shutdown-cancel race resolution is MORE important than DEGRADED cancel race resolution. A stranded fill at shutdown sits uncredited until the next session's reconciler runs — which might be too late. The fix covers this case because the same function is patched, but it's worth calling out that we're improving shutdown correctness as a side effect.


Risk register

  1. Gateway account_tx is a new XRPL request type for us. The dispatch pattern (_build_xrpl_request) is well-established from existing request types, but parsing AccountTx response metadata for our specific "which tx consumed this offer" question is new ground. Mitigated by tight unit tests with recorded XRPL response fixtures in C4.

  2. Fill-size derivation from AffectedNodes metadata. Getting the delta computation right (TakerPays/TakerGets deltas, drops-to-XRP conversion, issued-amount parsing) needs care. Consider a helper in xrpl_gateway.py with its own tests.

  3. Test surface. FLAG-037 C7 tests used the 5 Atlas-locked cases on CANCELLED_BY_ENGINE behavior. FLAG-047 doesn't change CANCELLED_BY_ENGINE behavior (non-race path unchanged) so those 5 tests should still pass. But I'll need a new file tests/test_cancel_fill_race.py with at least these cases:

    1. tecNO_TARGET observed → order transitions to CANCEL_RACE_UNKNOWN, CANCEL_RACE_DETECTED logged
    2. CANCEL_RACE_UNKNOWN + on-chain lookup returns FILL → record_full_fill called with on-chain-derived size, CANCEL_RACE_FILL_CONFIRMED logged, result.full_fills += 1
    3. CANCEL_RACE_UNKNOWN + on-chain lookup returns CANCEL → status transitions to CANCELLED_BY_ENGINE with reason "cancel_race_resolved_to_cancel", CANCEL_RACE_CANCEL_CONFIRMED logged
    4. CANCEL_RACE_UNKNOWN + on-chain lookup INCONCLUSIVE → anomaly row written, result.held_pending_review += 1, CANCEL_RACE_INCONCLUSIVE logged
    5. Normal CANCELLED_BY_ENGINE path (tesSUCCESS) still fires RECONCILER_SKIP_ENGINE_CANCEL — no regression
    6. Gateway exception during on-chain lookup → INCONCLUSIVE path (fail-closed)
    7. Mixed batch: one tecNO_TARGET + one tesSUCCESS in the same _cancel_all_live_orders loop → first transitions to CANCEL_RACE_UNKNOWN, second transitions to normal CANCELLED_BY_ENGINE → both handled correctly by reconciler.
  4. S48 reproduction. The S48 07:06:24 event (delta_xrp=−7.317607, delta_rlusd=+10.5, 0 fills recorded) would be replayed as: tecNO_TARGET on the cancel → CANCEL_RACE_UNKNOWN → on-chain lookup → FILL → record_full_fill(BUY, ~7.32 XRP, ~1.43 RLUSD/XRP). That's the smoke test. If fixture-replay tests can include a recorded account_tx response from S48, that's the strongest regression test. Will attempt — feasibility depends on whether Katja has the transactions archived or I need to reconstruct them.


Proposed commit sequence (revised — 5 commits)

# Commit Scope
C1 Schema + gateway additions CANCEL_RACE_UNKNOWN enum, cancel_race_detected_at + cancel_race_pivot_ledger columns, mark_cancel_race_unknown method, get_account_tx_for_offer gateway method, OfferResolution enum. Pure additions — no runtime behavior change.
C2 Cancel result branch _cancel_all_live_orders captures CancelResponse, branches on xrpl_result_code == "tecNO_TARGET", writes CANCEL_RACE_UNKNOWN via new state_manager method, emits CANCEL_RACE_DETECTED.
C3 Reconciler branch _handle_disappeared_active_order gets CANCEL_RACE_UNKNOWN branch BEFORE the CANCELLED_BY_ENGINE check. Calls get_account_tx_for_offer, dispatches to FILL / CANCEL / INCONCLUSIVE paths. _get_orders_for_reconciliation status filter extended. _reconcile_order dispatch extended.
C4 Fill-size extraction helper Helper in xrpl_gateway.py to parse AffectedNodes metadata into (fill_size_xrp, fill_size_rlusd, fill_side). Isolated so it can be unit-tested against recorded fixtures. Used by C3's FILL path. Split from C3 for clean diffs.
C5 Tests tests/test_cancel_fill_race.py with 7 cases above. Windows-safe teardown pattern (StateManager.close before TemporaryDirectory.cleanup, per FLAG-037 C7).

Why 5 instead of 4: C1 must include the gateway's new get_account_tx_for_offer method, and C4 breaking out the fill-size parser into its own commit makes it independently testable and reviewable. I can collapse C1 and C4 together if you prefer 4, at the cost of larger diff sizes per commit.


Open questions for Vesper

  1. Taxonomy correction on tasking doc. Use tecNO_TARGET not tecNO_ENTRY? (I've assumed yes throughout this memo.) Update to tasking doc log-token names: CANCEL_RACE_DETECTED is fine as-is since it doesn't reference the result code by name.

  2. mark_filled_after_race method vs direct record_full_fill? Atlas's canonical evaluation order in FLAG-037 says "no phantom fill for cancelled orders." A CANCEL_RACE_UNKNOWN → FILL transition is a REAL fill, not phantom, so record_full_fill on the correct accounting path is fine. But the order's status lifecycle needs to reflect "was race, resolved to FILL" for audit. A dedicated mark_filled_after_race(order_id, *, filled_at) method gives clean separation; alternative is to add a status transition side-effect to the existing fill-recording path. Either works — flagging for your pick.

  3. Confirm 5-commit sequence vs merge C1+C4 for a 4-commit sequence. Marginal preference: 5 — C4's extraction is nicer to review in isolation.

  4. Fill-size sourcing — on-chain-derived vs intended-size? Leaning on-chain-derived for accuracy, but that adds parsing risk. Intended-size is trivially correct at code level but may diverge from the actual executed fill. Your call.

  5. Fixture-replay test from S48? If Katja has the account_tx response from the S48 race moment archived, I'd like it as a fixture. Otherwise I'll synthesize one that matches the expected shape.

Standing by for your review. Implementation is straightforward once these are ruled; estimated ~4-5 hours for C1-C5 + regression once green-lit.

Standing-rule compliance

  1. No pre-created branch during investigation — confirmed, no branch yet.
  2. Delivery apply block will use Get-ChildItem ... | Sort-Object Name | ForEach-Object { git am $_.FullName }.
  3. Defensive git branch -D fix/cancel-fill-race before git checkout -b will be included.

Local main drift

Sandbox tree is PRE-FLAG-042/044/046 per my FLAG-046 memo. FLAG-037-ext (C5-C7) IS present locally (verified — the CANCELLED_BY_ENGINE guard, mark_cancelled_by_engine method, and write-before-submit pattern at line 1265 are all in the tree). My findings above reflect the post-FLAG-037-ext surface, which is the relevant one for FLAG-047. Will recut against canonical main if git am fuzz appears.

— orion