Summary¶
Pre-code investigation for fix/cancel-fill-race. Bottom line: the fix
is mechanically shallow but has one upstream dependency that affects
scope — the gateway does NOT currently expose an on-chain tx-history
lookup. Adding account_tx to the gateway is a prerequisite; it's a
small addition but it belongs in the branch, not retrofitted later.
Before anything else — flagging a taxonomy correction. The tasking
doc specifies tecNO_ENTRY as the cancel-race result code. The actual
XRPL code returned by submit_and_wait for an already-consumed offer
is tecNO_TARGET. The gateway already distinguishes this case
(xrpl_gateway.py:1220, 1234; models.py:392), and the CancelResponse
carries xrpl_result_code as a machine-readable field specifically
for this detection. Request: use tecNO_TARGET throughout the branch
and update the tasking doc + log token names accordingly. The rest of
this memo uses tecNO_TARGET.
Second — the race surface is broader than DEGRADED entry. The fix
must land uniformly in _cancel_all_live_orders because that one
function serves three call sites today (DEGRADED entry, shutdown
cancel) and a fourth tomorrow (ANCHOR_IDLE entry — FLAG-046).
Single-point fix is correct; no scoping needed. See Q5.
Proposed commit sequence: 5 commits (not 4), because C1 must add
account_tx to the gateway before C3 can call it. Detail below.
Q1 — Cancel result inspection¶
Primary call site: neo_engine/main_loop.py:1281, inside
_cancel_all_live_orders:
try:
self._gateway.submit_offer_cancel(order.offer_sequence) # ← CancelResponse discarded
sent += 1
log.info(f"{context}: sent", extra={...})
except Exception as exc:
log.error(f"{context}: failed", extra={...})
The returned CancelResponse is not assigned — the call site
ignores the result entirely. Exceptions are caught, but a returned
success=False with xrpl_result_code="tecNO_TARGET" silently
reports "sent" (the sent += 1 line still runs). That's the exact
hole FLAG-047 needs to close.
Secondary call sites that also use submit_offer_cancel (for
completeness — none is the FLAG-047 target, but they're relevant to
test scoping):
| file:line | context | reads result? | FLAG-047 scope? |
|---|---|---|---|
main_loop.py:1281 |
_cancel_all_live_orders |
NO | YES — primary fix site |
main_loop.py:841 |
FLAG-012 recovery cancel round 1 | yes (result.success or result.xrpl_result_code == "tecNO_TARGET") |
NO — different lifecycle path, no CANCELLED_BY_ENGINE write |
main_loop.py:901 |
FLAG-012 retry cancel | yes | NO — same as above |
main_loop.py:4275 |
_attempt_cancel (order-lifecycle CANCEL_PENDING path) |
yes, returns cancel_resp |
NO — orders here are CANCEL_PENDING, not CANCELLED_BY_ENGINE |
execution_engine.py (via order_lifecycle) |
three-phase cancel path | yes | NO — normal order lifecycle |
Only the _cancel_all_live_orders path is FLAG-047-relevant because
only that path writes CANCELLED_BY_ENGINE via mark_cancelled_by_engine
BEFORE gateway submit (the write-before-submit pattern introduced in
FLAG-037 C6). The other cancel paths follow request_cancel →
CANCEL_PENDING → confirm_cancel and do not use the
CANCELLED_BY_ENGINE guard.
Fix shape (C2): change line 1281 from discard-and-try to capture-and-branch. Pseudocode:
try:
resp = self._gateway.submit_offer_cancel(order.offer_sequence)
except Exception as exc:
log.error(f"{context}: failed", extra={..., "error": str(exc)})
continue
if resp.xrpl_result_code == "tecNO_TARGET":
# Race detected — offer was consumed before cancel arrived.
# Transition CANCELLED_BY_ENGINE → CANCEL_RACE_UNKNOWN so
# the reconciler's skip guard does NOT fire on this order.
try:
self._state.mark_cancel_race_unknown(
order.id,
race_detected_at=datetime.utcnow().isoformat() + "Z",
)
except Exception as exc:
log.error(f"{context}: race status write failed", extra={...})
log.warning(
"CANCEL_RACE_DETECTED — tecNO_TARGET on cancel; "
"order needs on-chain lookup",
extra={
"order_id": order.id,
"offer_sequence": order.offer_sequence,
},
)
sent += 1 # cancel was sent; semantics of "sent" preserved
continue
if resp.success:
sent += 1
log.info(f"{context}: sent", extra={...})
else:
# Other failure — non-race. Status stays CANCELLED_BY_ENGINE.
log.error(
f"{context}: cancel failed",
extra={..., "xrpl_result_code": resp.xrpl_result_code,
"failure_reason": resp.failure_reason},
)
Notes:
* The DB write for mark_cancelled_by_engine runs BEFORE this block
(line 1265) and stays. The CANCEL_RACE_UNKNOWN transition is a
status overwrite, not a conditional initial write — this preserves
the write-before-submit invariant against a crash between the DB
mark and the gateway call.
* mark_cancel_race_unknown is a new state_manager method,
structurally mirroring mark_cancelled_by_engine. See Q4.
Q2 — On-chain tx history¶
Current state of gateway: NO tx-history lookup exists.
xrpl_gateway.py:_build_xrpl_request dispatches only:
ServerInfo, AccountInfo, AccountOffers, AccountLines,
BookOffers, AMMInfo. There is no account_tx, no tx, no
ledger_entry.
XRPL API choice: account_tx is the idiomatic endpoint. Given
the offer_sequence we need to resolve, the query shape is:
from xrpl.models.requests import AccountTx
req = AccountTx(
account=self._wallet_address,
ledger_index_min=ledger_index_at_cancel - K, # K = small window,
ledger_index_max=ledger_index_at_cancel + K, # e.g. 10-20 ledgers
limit=200,
forward=False,
)
resp = self._client.request(req).result
Then walk resp["transactions"] looking at metadata AffectedNodes:
* If a DeletedNode of type Offer with PreviousFields.Sequence ==
offer_sequence appears under an OfferCreate or Payment tx
from a DIFFERENT account → fill (our offer was consumed by a
counterparty).
* If a DeletedNode of type Offer with that sequence appears
under an OfferCancel tx from OUR account → cancel (our
own cancel, or a prior cancel, did fire despite the late
tecNO_TARGET we observed).
* If neither branch matches → inconclusive.
Ledger-window sizing: cancel_tx_ledger_index gives us the pivot. Offers typically settle within 1-2 ledgers (each ~4s). A ±10-ledger window (~40-80 seconds total) should capture the event with high confidence. Larger windows risk matching unrelated tx and add cost; smaller windows risk missing a race that straddled validation. I'd default to ±15 ledgers; easy to tune once the first live race is observed.
Alternative considered + rejected: tx by cancel tx hash. The
OfferCancel tx hash is known (we just submitted it), but its metadata
only describes what the OfferCancel itself did — which was nothing, by
definition (tecNO_TARGET = offer was already gone). It doesn't tell us
what consumed the offer. account_tx is required.
Performance impact: one additional XRPL RPC call per race event (rare — only fires on tecNO_TARGET). Bounded, synchronous in the reconciler path. Acceptable.
Gateway addition (new method — C1 scope):
def get_account_tx_for_offer(
self,
offer_sequence: int,
ledger_window: int = 15,
pivot_ledger_index: Optional[int] = None,
) -> OfferResolution:
"""
Resolve what happened to a specific offer_sequence by querying
account_tx around a pivot ledger. Returns OfferResolution enum:
FILLED, CANCELLED, INCONCLUSIVE.
"""
Returns a new lightweight dataclass/enum (OfferResolution) so the
reconciler doesn't inspect raw XRPL JSON. Lives in models.py.
Fail-closed: exceptions during the RPC call, timeouts, or any
parse ambiguity → return OfferResolution.INCONCLUSIVE. The
reconciler then treats it as truth-check trigger per the Atlas
invariant ("if the engine cannot prove alignment with reality, it
does not act").
Q3 — Reconciler entry point¶
Exact location: neo_engine/ledger_reconciler.py:760-771, inside
_handle_disappeared_active_order:
# -----------------------------------------------------------------------
# FLAG-037 extension — CANCELLED_BY_ENGINE guard (Atlas ruling 2026-04-21)
# -----------------------------------------------------------------------
# FIRST check. Atlas's canonical evaluation order is:
# 1. CANCELLED_BY_ENGINE → never phantom fill.
# 2. cancel_tx_hash → existing cancel-race short-circuit.
# 3. Age-threshold gate → existing FLAG-037 C2 logic.
...
if order.status == OrderStatus.CANCELLED_BY_ENGINE:
log.info(
"RECONCILER_SKIP_ENGINE_CANCEL — disappeared order was "
"cancelled by engine; skipping phantom-fill path",
extra={...},
)
return
Also relevant:
* ledger_reconciler.py:355 — _get_orders_for_reconciliation
includes CANCELLED_BY_ENGINE in the status filter. CANCEL_RACE_UNKNOWN
must also be added here so the reconciler fetches these orders.
* ledger_reconciler.py:519-521 — _reconcile_order dispatch branch
routes CANCELLED_BY_ENGINE directly into
_handle_disappeared_active_order. CANCEL_RACE_UNKNOWN follows
the same dispatch — it's a disappeared-order case requiring
resolution.
Fix shape (C3): Insert a new branch BEFORE the CANCELLED_BY_ENGINE guard. Revised canonical evaluation order:
1. CANCEL_RACE_UNKNOWN → on-chain tx lookup → FILL / CANCEL / INCONCLUSIVE
2. CANCELLED_BY_ENGINE → skip (unchanged)
3. cancel_tx_hash → cancel race (unchanged)
4. Age-threshold gate → FLAG-037 C2 (unchanged)
Pseudocode:
if order.status == OrderStatus.CANCEL_RACE_UNKNOWN:
resolution = self._gateway.get_account_tx_for_offer(
order.offer_sequence,
pivot_ledger_index=order.cancel_race_pivot_ledger, # populated in C2
)
if resolution == OfferResolution.FILLED:
# Record as fill (normal fill path — not phantom). This runs
# the real fill accounting: engine.record_full_fill with the
# on-chain-derived size/price.
log.warning(
"CANCEL_RACE_FILL_CONFIRMED — on-chain tx lookup shows "
"offer was filled before cancel; recording fill",
extra={"order_id": order.id, "offer_sequence": ...},
)
engine.record_full_fill(order, size=..., price=...)
self._state.mark_filled_after_race(order.id)
result.full_fills += 1
return
if resolution == OfferResolution.CANCELLED:
log.info(
"CANCEL_RACE_CANCEL_CONFIRMED — on-chain tx lookup shows "
"offer was cancelled despite tecNO_TARGET",
extra={"order_id": order.id, "offer_sequence": ...},
)
self._state.mark_cancelled_by_engine(
order.id,
reason="cancel_race_resolved_to_cancel",
)
return
# INCONCLUSIVE — fail-closed. Anomaly row + result.held_pending_review
# increments, which triggers DEGRADED via reconciler at main loop.
log.error(
"CANCEL_RACE_INCONCLUSIVE — on-chain tx lookup could not "
"determine outcome; escalating to truth check",
extra={...},
)
self._write_anomaly_row(order, action_taken="cancel_race_inconclusive")
result.held_pending_review += 1
return
Fill-size sourcing on CANCEL_RACE_FILL_CONFIRMED: this is the
subtle bit. The on-chain lookup tells us the offer_sequence was
consumed, and the DeletedNode metadata for the Offer carries
PreviousFields.TakerPays and PreviousFields.TakerGets (what was
left before the consuming tx ate it) and FinalFields (what was left
after — typically 0 for full fills). Delta = PreviousFields -
FinalFields = the fill amount. I'll need to walk the AffectedNodes
carefully and convert XRPL amount encoding (drops for XRP, issued
amount for RLUSD). This is one of the riskier bits of the branch.
Alternative: invoke existing fill-recording telemetry the same way the
reconciler does today for phantom fills, using the order's stated
intended size. That's less precise but simpler. Lean toward the
on-chain-derived approach — we'd want the actual realized
fill, not the intended one. Will confirm in implementation.
Q4 — DB schema¶
Additive only. No migration required.
Current orders schema (per FLAG-037 C5):
* status TEXT — unconstrained (no CHECK). Accepts any string.
CANCELLED_BY_ENGINE was just a new value added Apr 21.
* cancelled_at TEXT — ISO 8601, populated only by
mark_cancelled_by_engine. Legacy rows NULL.
* cancel_reason TEXT — free-form. Legacy rows NULL.
Schema additions for FLAG-047 (C1 scope):
* New status value: CANCEL_RACE_UNKNOWN — added to OrderStatus
enum in models.py. Column needs no change.
* New optional column: cancel_race_detected_at TEXT — timestamp
when tecNO_TARGET was observed on the submit_offer_cancel return.
Added via _ensure_column("orders", "cancel_race_detected_at",
"TEXT") in initialize_database, mirroring the FLAG-037 C5
pattern exactly.
* New optional column: cancel_race_pivot_ledger INTEGER — the
ledger_index at which the cancel attempt validated. Used by the
reconciler to scope the account_tx window. Same idempotent
migration pattern.
* _row_to_order extended with the "col" in row.keys() else None
defensive pattern for both new columns, matching the FLAG-037 C5
treatment.
New state_manager methods:
* mark_cancel_race_unknown(order_id, *, race_detected_at, pivot_ledger)
— single-transaction write of status + two new fields + refresh of
updated_at. Mirrors mark_cancelled_by_engine exactly.
* mark_filled_after_race(order_id, *, filled_at=None) — transitions
CANCEL_RACE_UNKNOWN → FILLED with audit trail. Optional; could
also be handled by the existing record_full_fill pathway. Worth
a discussion — see "open question for Vesper" at bottom.
Legacy compatibility:
* Orders written before this branch have no cancel_race_* columns
— idempotent _ensure_column adds them as NULL. Reads work
because _row_to_order already uses the defensive
if "col" in row.keys() else None pattern.
* mark_cancelled_by_engine is unchanged and continues to produce
orders with status=CANCELLED_BY_ENGINE, cancelled_at populated,
and cancel_race_* fields NULL. Reconciler branch on
CANCELLED_BY_ENGINE is unchanged — the skip path still fires for
normal (non-race) engine cancels.
Q5 — Interaction with FLAG-046 (ANCHOR_IDLE)¶
Confirmed: the fix applies uniformly. No scoping needed.
Per my FLAG-046 pre-code findings (Vesper-approved earlier today), the
ANCHOR_IDLE entry helper _enter_anchor_idle_mode will call
_cancel_all_live_orders("Anchor idle entry cancel"). This is the
same single function that FLAG-047 is patching. One fix covers all
three call sites:
| call site | from | same race surface? |
|---|---|---|
| DEGRADED entry | _enter_degraded_mode → _cancel_all_live_orders("Degraded entry cancel") |
YES — primary FLAG-047 scope |
| ANCHOR_IDLE entry (future) | _enter_anchor_idle_mode → _cancel_all_live_orders("Anchor idle entry cancel") |
YES — inherited via shared function |
| Shutdown cancel | _cancel_live_orders_on_shutdown → _cancel_all_live_orders("Shutdown cancel") |
YES — also inherits, and arguably more important (shutdown without race handling strands fills) |
Implication for branch sequencing: * FLAG-047 lands first (SESSION-BLOCKING per tasking). * FLAG-046 (ANCHOR_IDLE) cuts after and inherits the fix automatically — no additional work needed in the FLAG-046 branch to handle the race, it just works because the race handling is at the shared function layer. * Tests for FLAG-047 should use DEGRADED entry as the trigger (the observed S48 failure mode). ANCHOR_IDLE-as-trigger test can be added when FLAG-046 lands — or as a regression test in the FLAG-046 branch.
Ordering with shutdown cancel: shutdown-cancel race resolution is MORE important than DEGRADED cancel race resolution. A stranded fill at shutdown sits uncredited until the next session's reconciler runs — which might be too late. The fix covers this case because the same function is patched, but it's worth calling out that we're improving shutdown correctness as a side effect.
Risk register¶
-
Gateway
account_txis a new XRPL request type for us. The dispatch pattern (_build_xrpl_request) is well-established from existing request types, but parsingAccountTxresponse metadata for our specific "which tx consumed this offer" question is new ground. Mitigated by tight unit tests with recorded XRPL response fixtures in C4. -
Fill-size derivation from AffectedNodes metadata. Getting the delta computation right (TakerPays/TakerGets deltas, drops-to-XRP conversion, issued-amount parsing) needs care. Consider a helper in
xrpl_gateway.pywith its own tests. -
Test surface. FLAG-037 C7 tests used the 5 Atlas-locked cases on CANCELLED_BY_ENGINE behavior. FLAG-047 doesn't change CANCELLED_BY_ENGINE behavior (non-race path unchanged) so those 5 tests should still pass. But I'll need a new file
tests/test_cancel_fill_race.pywith at least these cases:- tecNO_TARGET observed → order transitions to CANCEL_RACE_UNKNOWN, CANCEL_RACE_DETECTED logged
- CANCEL_RACE_UNKNOWN + on-chain lookup returns FILL → record_full_fill called with on-chain-derived size, CANCEL_RACE_FILL_CONFIRMED logged, result.full_fills += 1
- CANCEL_RACE_UNKNOWN + on-chain lookup returns CANCEL → status transitions to CANCELLED_BY_ENGINE with reason "cancel_race_resolved_to_cancel", CANCEL_RACE_CANCEL_CONFIRMED logged
- CANCEL_RACE_UNKNOWN + on-chain lookup INCONCLUSIVE → anomaly row written, result.held_pending_review += 1, CANCEL_RACE_INCONCLUSIVE logged
- Normal CANCELLED_BY_ENGINE path (tesSUCCESS) still fires RECONCILER_SKIP_ENGINE_CANCEL — no regression
- Gateway exception during on-chain lookup → INCONCLUSIVE path (fail-closed)
- Mixed batch: one tecNO_TARGET + one tesSUCCESS in the same
_cancel_all_live_ordersloop → first transitions to CANCEL_RACE_UNKNOWN, second transitions to normal CANCELLED_BY_ENGINE → both handled correctly by reconciler.
-
S48 reproduction. The S48 07:06:24 event (delta_xrp=−7.317607, delta_rlusd=+10.5, 0 fills recorded) would be replayed as: tecNO_TARGET on the cancel → CANCEL_RACE_UNKNOWN → on-chain lookup → FILL → record_full_fill(BUY, ~7.32 XRP, ~1.43 RLUSD/XRP). That's the smoke test. If fixture-replay tests can include a recorded account_tx response from S48, that's the strongest regression test. Will attempt — feasibility depends on whether Katja has the transactions archived or I need to reconstruct them.
Proposed commit sequence (revised — 5 commits)¶
| # | Commit | Scope |
|---|---|---|
| C1 | Schema + gateway additions | CANCEL_RACE_UNKNOWN enum, cancel_race_detected_at + cancel_race_pivot_ledger columns, mark_cancel_race_unknown method, get_account_tx_for_offer gateway method, OfferResolution enum. Pure additions — no runtime behavior change. |
| C2 | Cancel result branch | _cancel_all_live_orders captures CancelResponse, branches on xrpl_result_code == "tecNO_TARGET", writes CANCEL_RACE_UNKNOWN via new state_manager method, emits CANCEL_RACE_DETECTED. |
| C3 | Reconciler branch | _handle_disappeared_active_order gets CANCEL_RACE_UNKNOWN branch BEFORE the CANCELLED_BY_ENGINE check. Calls get_account_tx_for_offer, dispatches to FILL / CANCEL / INCONCLUSIVE paths. _get_orders_for_reconciliation status filter extended. _reconcile_order dispatch extended. |
| C4 | Fill-size extraction helper | Helper in xrpl_gateway.py to parse AffectedNodes metadata into (fill_size_xrp, fill_size_rlusd, fill_side). Isolated so it can be unit-tested against recorded fixtures. Used by C3's FILL path. Split from C3 for clean diffs. |
| C5 | Tests | tests/test_cancel_fill_race.py with 7 cases above. Windows-safe teardown pattern (StateManager.close before TemporaryDirectory.cleanup, per FLAG-037 C7). |
Why 5 instead of 4: C1 must include the gateway's new
get_account_tx_for_offer method, and C4 breaking out the fill-size
parser into its own commit makes it independently testable and
reviewable. I can collapse C1 and C4 together if you prefer 4, at the
cost of larger diff sizes per commit.
Open questions for Vesper¶
-
Taxonomy correction on tasking doc. Use
tecNO_TARGETnottecNO_ENTRY? (I've assumed yes throughout this memo.) Update to tasking doc log-token names:CANCEL_RACE_DETECTEDis fine as-is since it doesn't reference the result code by name. -
mark_filled_after_racemethod vs direct record_full_fill? Atlas's canonical evaluation order in FLAG-037 says "no phantom fill for cancelled orders." A CANCEL_RACE_UNKNOWN → FILL transition is a REAL fill, not phantom, sorecord_full_fillon the correct accounting path is fine. But the order's status lifecycle needs to reflect "was race, resolved to FILL" for audit. A dedicatedmark_filled_after_race(order_id, *, filled_at)method gives clean separation; alternative is to add a status transition side-effect to the existing fill-recording path. Either works — flagging for your pick. -
Confirm 5-commit sequence vs merge C1+C4 for a 4-commit sequence. Marginal preference: 5 — C4's extraction is nicer to review in isolation.
-
Fill-size sourcing — on-chain-derived vs intended-size? Leaning on-chain-derived for accuracy, but that adds parsing risk. Intended-size is trivially correct at code level but may diverge from the actual executed fill. Your call.
-
Fixture-replay test from S48? If Katja has the account_tx response from the S48 race moment archived, I'd like it as a fixture. Otherwise I'll synthesize one that matches the expected shape.
Standing by for your review. Implementation is straightforward once these are ruled; estimated ~4-5 hours for C1-C5 + regression once green-lit.
Standing-rule compliance¶
- No pre-created branch during investigation — confirmed, no branch yet.
- Delivery apply block will use
Get-ChildItem ... | Sort-Object Name | ForEach-Object { git am $_.FullName }. - Defensive
git branch -D fix/cancel-fill-racebeforegit checkout -bwill be included.
Local main drift¶
Sandbox tree is PRE-FLAG-042/044/046 per my FLAG-046 memo. FLAG-037-ext
(C5-C7) IS present locally (verified — the CANCELLED_BY_ENGINE guard,
mark_cancelled_by_engine method, and write-before-submit pattern at
line 1265 are all in the tree). My findings above reflect the
post-FLAG-037-ext surface, which is the relevant one for FLAG-047.
Will recut against canonical main if git am fuzz appears.
— orion