Skip to content

Orion Investigation — Branch #4: fix/flag-029-async-pin-and-orphan

To: Vesper (review), Katja (approve), Atlas (FYI) From: Orion Date: 2026-04-18


What this file is

Pre-code investigation memo. Same discipline as Branch #2: map the surface, show the proposed change, get sign-off, then commit. No code has been written on chore/archive-cleanup tip yet — the tree is clean at df168f0 (Branch #3 merge mirrored locally).

Read this, push back on anything that looks wrong, and I'll cut fix/flag-029-async-pin-and-orphan after your green light.


FLAG-029a — async safety

The three production submit_and_wait call sites

Every call comes from neo_engine/xrpl_gateway.py. None exist outside this file. (Tests mock it — tests/test_task4.py; not in scope.)

# Gateway method Line Purpose
1 submit_offer_create 748 New OfferCreate (quote placement)
2 submit_offer_replace_atomic 925 OfferCreate with OfferSequence (cancel-and-replace in one tx)
3 submit_offer_cancel 1041 OfferCancel

All three do from xrpl.transaction import submit_and_wait at the top of their try block, then response = submit_and_wait(tx, client, wallet). Identical shape, three copies.

The dependency pin

No requirements.txt. Dependencies live in pyproject.toml:

dependencies = [
    "pyyaml>=6.0",
    "python-dotenv>=1.0",
    "xrpl-py>=2.4",          # ← open upper bound
]

Proposed pin: "xrpl-py>=2.4,<3.0".

Rationale: xrpl-py 2.x is the sync series. The audit-identified risk is a silent migration to async-first in 3.x, where submit_and_wait would return a coroutine that our try/except (ConnectionError, OSError, TimeoutError) block treats as a success object — quote lands on the ledger but the engine's order lifecycle writes "failure" because response.result is missing. The upper bound is a contract: "if someone bumps xrpl-py to 3.x, do it behind a deliberate porting PR, not by accident."

Optional: also add <4.0 instead of <3.0 if we want a wider net. My default is <3.0 because that's the version where the behavior is currently known to shift. I'll defer to Vesper.

The async-safety wrapper

Proposed new helper at module scope in neo_engine/xrpl_gateway.py:

import inspect

def _submit_and_wait_safe(tx, client, wallet):
    """
    Belt to the pyproject.toml upper-bound suspenders.

    Wraps xrpl.transaction.submit_and_wait. If xrpl-py ever returns a
    coroutine (async migration) instead of a SubmissionResponse, we
    raise immediately instead of silently misrouting the result and
    losing track of an on-chain order.

    Raises:
        RuntimeError: if submit_and_wait returns a coroutine or is a
                      coroutine function.
    """
    from xrpl.transaction import submit_and_wait as _saw  # type: ignore[import]

    if inspect.iscoroutinefunction(_saw):
        raise RuntimeError(
            "xrpl.transaction.submit_and_wait is a coroutine function — "
            "NEO v1 is sync-only. Check xrpl-py version pin."
        )

    response = _saw(tx, client, wallet)

    if inspect.iscoroutine(response):
        # Defensive: close the coroutine to avoid RuntimeWarning, then raise.
        response.close()
        raise RuntimeError(
            "xrpl.transaction.submit_and_wait returned a coroutine — "
            "NEO v1 expects a synchronous SubmissionResponse. "
            "Check xrpl-py version pin."
        )

    return response

Three call sites change from:

from xrpl.transaction import submit_and_wait
...
response = submit_and_wait(tx_to_submit, client, wallet)

to:

response = _submit_and_wait_safe(tx_to_submit, client, wallet)

No behavior change on the current xrpl-py 2.4.x line. The check is cheap (one iscoroutine per call) and only fires on a genuinely misbehaving dependency.

The startup smoke check

Proposed addition to XRPLGateway.__init__ (line 337 of xrpl_gateway.py), directly after the existing log.info("XRPLGateway initialized", ...):

# FLAG-029a: fail fast if xrpl-py ever migrates submit_and_wait to async.
try:
    from xrpl.transaction import submit_and_wait as _saw_probe  # type: ignore[import]
    if inspect.iscoroutinefunction(_saw_probe):
        log.error(
            "XRPL gateway refusing to start: xrpl.transaction.submit_and_wait "
            "is async under the current xrpl-py version; NEO v1 is sync-only. "
            "Pin xrpl-py to <3.0 or port the gateway."
        )
        raise GatewayError("submit_and_wait is async — sync-only engine cannot run")
except ImportError:
    # No xrpl-py installed (paper/dry-run). Let the per-call ImportError
    # handling in the submit_* methods downgrade to a SubmissionResponse.
    pass

This turns a silent quote-then-lose-it outcome into a loud "won't start." Paper/dry-run paths keep working — the ImportError branch is how the gateway already tolerates missing xrpl-py today (see line 676).


FLAG-029b — orphan c7e14e73

Why it survives

c7e14e73 is status = 'SUBMITTED' with no offer_sequence. That lands in a gap between two safety nets:

  1. The reconciler (ledger_reconciler.py:320–325) explicitly excludes SUBMITTED.

    for status in (
        OrderStatus.ACTIVE,
        OrderStatus.PARTIALLY_FILLED,
        OrderStatus.CANCEL_PENDING,
    ):
        reconcilable.extend(state.get_orders_by_status(status))
    
    By design: SUBMITTED orders have no offer_sequence yet, so there's nothing to match against ledger offers. The SUBMITTED→ACTIVE transition is owned by confirm_active() on the happy path.

  2. Startup force-cancel (main_loop.py:410 _startup_force_cancel_stuck_orders) only covers two categories:

  3. PENDING_SUBMISSION (never reached gateway)
  4. CANCEL_PENDING with cancel_tx_hash set (cancel confirmed but write not closed)

A SUBMITTED order that never got confirm_active'd — and never got its offer_sequence written — sits forever. That's the c7e14e73 failure mode.

"Reconciled every launch" = complained-about every launch, not actually processed. Anywhere the side-guard or an order-status scan iterates, this row shows up as "live" and pollutes logs.

Proposed fix — extend the existing function, don't add a one-shot

Two options considered:

  • (A) One-shot script keyed to id LIKE 'c7e14e73%'. Matches the audit memo's literal wording. Fixes exactly this row.
  • (B) Generalize _startup_force_cancel_stuck_orders with a Category 3: SUBMITTED + offer_sequence IS NULL + updated_at < cutoff. Same result for c7e14e73; also self-heals any future row that slips through the same gap. Mirrors the pattern of the existing two categories.

I recommend (B). The one-shot leaves the gap open. Category 3 closes it. The audit plan called for "orphan cleanup 2026-04-18" as the failure_reason — I'll keep that string for parity with the memo, but the mechanism generalizes.

Sketch:

# --- Category 3: SUBMITTED with no offer_sequence ---
# These orders never completed the SUBMITTED → ACTIVE transition.
# confirm_active() never fired (engine crash between submit and confirm,
# or submit raised after DB write but before offer_sequence capture).
# Without an offer_sequence, the reconciler cannot match them against
# the ledger and they sit forever. (FLAG-029b — orphan c7e14e73 class.)
for order in self._state.get_orders_by_status(OrderStatus.SUBMITTED):
    if order.offer_sequence is not None:
        continue
    try:
        updated = datetime.fromisoformat(order.updated_at)
        if updated.tzinfo is None:
            updated = updated.replace(tzinfo=timezone.utc)
    except (ValueError, TypeError):
        continue
    if updated > cutoff:
        continue
    self._state.update_order_status(
        order.id,
        OrderStatus.CANCELED,
        failure_reason="startup: force-canceled — SUBMITTED without offer_sequence (submit orphan)",
    )
    log.warning(
        "Startup: force-canceled stuck SUBMITTED order (no offer_sequence)",
        extra={
            "order_id": order.id,
            "side": order.side.value,
            "updated_at": order.updated_at,
        },
    )
    count += 1

Cutoff: the existing function uses self._STARTUP_STUCK_ORDER_CUTOFF_SECONDS (30 seconds). That's too short for the c7e14e73 case — the audit doc asks for ">7 days old". I'll either: - Introduce a second cutoff constant _STARTUP_SUBMITTED_ORPHAN_CUTOFF_SECONDS = 7 * 24 * 3600 specific to Category 3, or - Reuse the 30-second cutoff (anything older than 30s is plausibly orphaned since SUBMITTED→ACTIVE is sub-second on a healthy gateway).

I lean toward the second cutoff (7 days) to stay conservative — a slow ledger or a partial-confirmation race could leave a SUBMITTED row briefly. 7 days is well past any plausible in-flight window. Flagging for Vesper's call.

What I will NOT do

  • Touch the reconciler's intentional exclusion of SUBMITTED. That exclusion is correct — no offer_sequence means no match key. Cleanup belongs at startup, not in the reconciler.
  • Query the live DB or mutate any row on disk. The fix is a code change; Katja's existing c7e14e73 row will be swept by the next live/paper launch (subject to the ≥7d age guard) once the branch lands.

Test plan — two tests

  1. tests/test_flag_029_async_safety.py (new)
  2. Patches xrpl.transaction.submit_and_wait to be (a) a regular function returning a plain object, (b) an async def, (c) a sync function returning a coroutine.
  3. Asserts _submit_and_wait_safe passes through in case (a), raises RuntimeError in cases (b) and (c).
  4. Asserts XRPLGateway.__init__ raises GatewayError when submit_and_wait is async def.

  5. tests/test_flag_029b_orphan_cleanup.py (new)

  6. Seeds a DB with: one SUBMITTED order older than the cutoff with offer_sequence=None (should be canceled); one SUBMITTED order younger than the cutoff (should be left alone); one SUBMITTED order older than the cutoff with offer_sequence=12345 set (should be left alone — reconciler's job); one ACTIVE order (should be left alone).
  7. Runs _startup_force_cancel_stuck_orders, asserts only the first row transitions to CANCELED with the expected failure_reason and the other three are unchanged.
  8. Asserts idempotence: second run does nothing.

Both tests live in tests/ (not Archive/). No new fixtures outside the DB seeding pattern already used by tests/test_flag_034_display_overlay.py.


Commit structure — two commits

  1. feat(xrpl-gateway): pin xrpl-py <3.0 and guard submit_and_wait against async migration
  2. pyproject.toml: "xrpl-py>=2.4""xrpl-py>=2.4,<3.0"
  3. neo_engine/xrpl_gateway.py: module-level _submit_and_wait_safe, three call-site swaps, XRPLGateway.__init__ smoke check
  4. tests/test_flag_029_async_safety.py: new

  5. fix(main-loop): force-cancel SUBMITTED orders with no offer_sequence at startup (FLAG-029b)

  6. neo_engine/main_loop.py: Category 3 branch + new _STARTUP_SUBMITTED_ORPHAN_CUTOFF_SECONDS = 7 * 24 * 3600 constant
  7. tests/test_flag_029b_orphan_cleanup.py: new

Two commits, not one, for the same reason as Branch #2: if either needs to be reverted, the other stays standing.


Open questions for Vesper

  1. xrpl-py pin ceiling: <3.0 (my default) vs <4.0 (looser)?
  2. Category 3 cutoff: 7 days (my default, conservative) vs reuse the existing 30-second cutoff (aggressive self-heal)?
  3. Commit ordering as above, or swap (orphan first, async-pin second)? I don't see a strong argument either way — the two commits don't interact.

Open question for Atlas

None. This is hardening, no strategy surface touched.

Open question for Katja

Before Commit 2 lands on local-katja-main you should have Branch #4 compile in your box but not immediately run a live session. The first live launch after this merge will sweep c7e14e73 (good), but I'd like you to git diff --stat + read the cancel's failure_reason post-launch so there's a human-readable audit trail before we go further. Not blocking — just a "watch this happen" moment.

If any of the above looks off, push back before I cut code. Otherwise greenlight and I'll proceed.

— Orion