Orion Investigation — Branch #4: fix/flag-029-async-pin-and-orphan¶
To: Vesper (review), Katja (approve), Atlas (FYI) From: Orion Date: 2026-04-18
What this file is¶
Pre-code investigation memo. Same discipline as Branch #2: map the surface, show the proposed change, get sign-off, then commit. No code has been written on chore/archive-cleanup tip yet — the tree is clean at df168f0 (Branch #3 merge mirrored locally).
Read this, push back on anything that looks wrong, and I'll cut fix/flag-029-async-pin-and-orphan after your green light.
FLAG-029a — async safety¶
The three production submit_and_wait call sites¶
Every call comes from neo_engine/xrpl_gateway.py. None exist outside this file. (Tests mock it — tests/test_task4.py; not in scope.)
| # | Gateway method | Line | Purpose |
|---|---|---|---|
| 1 | submit_offer_create |
748 | New OfferCreate (quote placement) |
| 2 | submit_offer_replace_atomic |
925 | OfferCreate with OfferSequence (cancel-and-replace in one tx) |
| 3 | submit_offer_cancel |
1041 | OfferCancel |
All three do from xrpl.transaction import submit_and_wait at the top of their try block, then response = submit_and_wait(tx, client, wallet). Identical shape, three copies.
The dependency pin¶
No requirements.txt. Dependencies live in pyproject.toml:
Proposed pin: "xrpl-py>=2.4,<3.0".
Rationale: xrpl-py 2.x is the sync series. The audit-identified risk is a silent migration to async-first in 3.x, where submit_and_wait would return a coroutine that our try/except (ConnectionError, OSError, TimeoutError) block treats as a success object — quote lands on the ledger but the engine's order lifecycle writes "failure" because response.result is missing. The upper bound is a contract: "if someone bumps xrpl-py to 3.x, do it behind a deliberate porting PR, not by accident."
Optional: also add <4.0 instead of <3.0 if we want a wider net. My default is <3.0 because that's the version where the behavior is currently known to shift. I'll defer to Vesper.
The async-safety wrapper¶
Proposed new helper at module scope in neo_engine/xrpl_gateway.py:
import inspect
def _submit_and_wait_safe(tx, client, wallet):
"""
Belt to the pyproject.toml upper-bound suspenders.
Wraps xrpl.transaction.submit_and_wait. If xrpl-py ever returns a
coroutine (async migration) instead of a SubmissionResponse, we
raise immediately instead of silently misrouting the result and
losing track of an on-chain order.
Raises:
RuntimeError: if submit_and_wait returns a coroutine or is a
coroutine function.
"""
from xrpl.transaction import submit_and_wait as _saw # type: ignore[import]
if inspect.iscoroutinefunction(_saw):
raise RuntimeError(
"xrpl.transaction.submit_and_wait is a coroutine function — "
"NEO v1 is sync-only. Check xrpl-py version pin."
)
response = _saw(tx, client, wallet)
if inspect.iscoroutine(response):
# Defensive: close the coroutine to avoid RuntimeWarning, then raise.
response.close()
raise RuntimeError(
"xrpl.transaction.submit_and_wait returned a coroutine — "
"NEO v1 expects a synchronous SubmissionResponse. "
"Check xrpl-py version pin."
)
return response
Three call sites change from:
from xrpl.transaction import submit_and_wait
...
response = submit_and_wait(tx_to_submit, client, wallet)
to:
No behavior change on the current xrpl-py 2.4.x line. The check is cheap (one iscoroutine per call) and only fires on a genuinely misbehaving dependency.
The startup smoke check¶
Proposed addition to XRPLGateway.__init__ (line 337 of xrpl_gateway.py), directly after the existing log.info("XRPLGateway initialized", ...):
# FLAG-029a: fail fast if xrpl-py ever migrates submit_and_wait to async.
try:
from xrpl.transaction import submit_and_wait as _saw_probe # type: ignore[import]
if inspect.iscoroutinefunction(_saw_probe):
log.error(
"XRPL gateway refusing to start: xrpl.transaction.submit_and_wait "
"is async under the current xrpl-py version; NEO v1 is sync-only. "
"Pin xrpl-py to <3.0 or port the gateway."
)
raise GatewayError("submit_and_wait is async — sync-only engine cannot run")
except ImportError:
# No xrpl-py installed (paper/dry-run). Let the per-call ImportError
# handling in the submit_* methods downgrade to a SubmissionResponse.
pass
This turns a silent quote-then-lose-it outcome into a loud "won't start." Paper/dry-run paths keep working — the ImportError branch is how the gateway already tolerates missing xrpl-py today (see line 676).
FLAG-029b — orphan c7e14e73¶
Why it survives¶
c7e14e73 is status = 'SUBMITTED' with no offer_sequence. That lands in a gap between two safety nets:
-
The reconciler (
ledger_reconciler.py:320–325) explicitly excludes SUBMITTED.By design: SUBMITTED orders have nofor status in ( OrderStatus.ACTIVE, OrderStatus.PARTIALLY_FILLED, OrderStatus.CANCEL_PENDING, ): reconcilable.extend(state.get_orders_by_status(status))offer_sequenceyet, so there's nothing to match against ledger offers. The SUBMITTED→ACTIVE transition is owned byconfirm_active()on the happy path. -
Startup force-cancel (
main_loop.py:410 _startup_force_cancel_stuck_orders) only covers two categories: - PENDING_SUBMISSION (never reached gateway)
- CANCEL_PENDING with
cancel_tx_hashset (cancel confirmed but write not closed)
A SUBMITTED order that never got confirm_active'd — and never got its offer_sequence written — sits forever. That's the c7e14e73 failure mode.
"Reconciled every launch" = complained-about every launch, not actually processed. Anywhere the side-guard or an order-status scan iterates, this row shows up as "live" and pollutes logs.
Proposed fix — extend the existing function, don't add a one-shot¶
Two options considered:
- (A) One-shot script keyed to
id LIKE 'c7e14e73%'. Matches the audit memo's literal wording. Fixes exactly this row. - (B) Generalize
_startup_force_cancel_stuck_orderswith a Category 3: SUBMITTED +offer_sequence IS NULL+updated_at < cutoff. Same result for c7e14e73; also self-heals any future row that slips through the same gap. Mirrors the pattern of the existing two categories.
I recommend (B). The one-shot leaves the gap open. Category 3 closes it. The audit plan called for "orphan cleanup 2026-04-18" as the failure_reason — I'll keep that string for parity with the memo, but the mechanism generalizes.
Sketch:
# --- Category 3: SUBMITTED with no offer_sequence ---
# These orders never completed the SUBMITTED → ACTIVE transition.
# confirm_active() never fired (engine crash between submit and confirm,
# or submit raised after DB write but before offer_sequence capture).
# Without an offer_sequence, the reconciler cannot match them against
# the ledger and they sit forever. (FLAG-029b — orphan c7e14e73 class.)
for order in self._state.get_orders_by_status(OrderStatus.SUBMITTED):
if order.offer_sequence is not None:
continue
try:
updated = datetime.fromisoformat(order.updated_at)
if updated.tzinfo is None:
updated = updated.replace(tzinfo=timezone.utc)
except (ValueError, TypeError):
continue
if updated > cutoff:
continue
self._state.update_order_status(
order.id,
OrderStatus.CANCELED,
failure_reason="startup: force-canceled — SUBMITTED without offer_sequence (submit orphan)",
)
log.warning(
"Startup: force-canceled stuck SUBMITTED order (no offer_sequence)",
extra={
"order_id": order.id,
"side": order.side.value,
"updated_at": order.updated_at,
},
)
count += 1
Cutoff: the existing function uses self._STARTUP_STUCK_ORDER_CUTOFF_SECONDS (30 seconds). That's too short for the c7e14e73 case — the audit doc asks for ">7 days old". I'll either:
- Introduce a second cutoff constant _STARTUP_SUBMITTED_ORPHAN_CUTOFF_SECONDS = 7 * 24 * 3600 specific to Category 3, or
- Reuse the 30-second cutoff (anything older than 30s is plausibly orphaned since SUBMITTED→ACTIVE is sub-second on a healthy gateway).
I lean toward the second cutoff (7 days) to stay conservative — a slow ledger or a partial-confirmation race could leave a SUBMITTED row briefly. 7 days is well past any plausible in-flight window. Flagging for Vesper's call.
What I will NOT do¶
- Touch the reconciler's intentional exclusion of SUBMITTED. That exclusion is correct — no offer_sequence means no match key. Cleanup belongs at startup, not in the reconciler.
- Query the live DB or mutate any row on disk. The fix is a code change; Katja's existing c7e14e73 row will be swept by the next live/paper launch (subject to the ≥7d age guard) once the branch lands.
Test plan — two tests¶
tests/test_flag_029_async_safety.py(new)- Patches
xrpl.transaction.submit_and_waitto be (a) a regular function returning a plain object, (b) anasync def, (c) a sync function returning a coroutine. - Asserts
_submit_and_wait_safepasses through in case (a), raisesRuntimeErrorin cases (b) and (c). -
Asserts
XRPLGateway.__init__raisesGatewayErrorwhensubmit_and_waitisasync def. -
tests/test_flag_029b_orphan_cleanup.py(new) - Seeds a DB with: one SUBMITTED order older than the cutoff with
offer_sequence=None(should be canceled); one SUBMITTED order younger than the cutoff (should be left alone); one SUBMITTED order older than the cutoff withoffer_sequence=12345set (should be left alone — reconciler's job); one ACTIVE order (should be left alone). - Runs
_startup_force_cancel_stuck_orders, asserts only the first row transitions to CANCELED with the expectedfailure_reasonand the other three are unchanged. - Asserts idempotence: second run does nothing.
Both tests live in tests/ (not Archive/). No new fixtures outside the DB seeding pattern already used by tests/test_flag_034_display_overlay.py.
Commit structure — two commits¶
feat(xrpl-gateway): pin xrpl-py <3.0 and guard submit_and_wait against async migrationpyproject.toml:"xrpl-py>=2.4"→"xrpl-py>=2.4,<3.0"neo_engine/xrpl_gateway.py: module-level_submit_and_wait_safe, three call-site swaps,XRPLGateway.__init__smoke check-
tests/test_flag_029_async_safety.py: new -
fix(main-loop): force-cancel SUBMITTED orders with no offer_sequence at startup (FLAG-029b) neo_engine/main_loop.py: Category 3 branch + new_STARTUP_SUBMITTED_ORPHAN_CUTOFF_SECONDS = 7 * 24 * 3600constanttests/test_flag_029b_orphan_cleanup.py: new
Two commits, not one, for the same reason as Branch #2: if either needs to be reverted, the other stays standing.
Open questions for Vesper¶
- xrpl-py pin ceiling:
<3.0(my default) vs<4.0(looser)? - Category 3 cutoff: 7 days (my default, conservative) vs reuse the existing 30-second cutoff (aggressive self-heal)?
- Commit ordering as above, or swap (orphan first, async-pin second)? I don't see a strong argument either way — the two commits don't interact.
Open question for Atlas¶
None. This is hardening, no strategy surface touched.
Open question for Katja¶
Before Commit 2 lands on local-katja-main you should have Branch #4 compile in your box but not immediately run a live session. The first live launch after this merge will sweep c7e14e73 (good), but I'd like you to git diff --stat + read the cancel's failure_reason post-launch so there's a human-readable audit trail before we go further. Not blocking — just a "watch this happen" moment.
If any of the above looks off, push back before I cut code. Otherwise greenlight and I'll proceed.
— Orion