Skip to content

[C] Orion Audit — Pre Phase 7.3 Full Engine Audit

To: Orion (he/him) CC: Katja (Captain), Atlas From: Vesper Date: 2026-04-18


Context

S39 (30-min system integrity check) completed successfully. Primary objective confirmed: ended_at is now written on clean shutdown. However, S39 surfaced two anomalies that require investigation before Phase 7.3 begins. Katja has also requested a full engine audit — wiring, config loading, dead code, and open flags — before the next live run.

Gate: No Phase 7.3 run until this audit is complete and findings are resolved or explicitly deferred by Atlas.


Audit Item 1 — Config Loading: max_xrp_exposure Mismatch (PRIORITY)

Observation: - config/config_live_stage1.yaml has max_xrp_exposure: 150.0 (on disk, confirmed via git diff — no uncommitted changes) - S39 DB shows halt_reason = 'xrp exposure limit (115.20 > 100.0)' — the engine used 100.0, not 150.0 - Dashboard showed "Engine halted — session ended" and terminal showed ~30 min elapsed — consistent with duration halt, not exposure halt - git log -- config/config_live_stage1.yaml (without --all) returns no output — no commits on main directly modify this file. With --all, commits exist on other branches including 9ac3561 fix: raise max_xrp_exposure 100->150

What we need to know: 1. What value does self._config.risk.max_xrp_exposure actually hold at engine startup? Add a startup log line that prints the loaded risk limits. 2. Is the config loading path reading from the correct YAML section? Confirm risk_raw.get("max_xrp_exposure") resolves correctly. 3. Is there any path where an old value (100.0) could be persisted — engine_state DB table, cached .pyc, or hardcoded fallback? 4. Is 9ac3561 (the commit raising max_xrp_exposure to 150.0) actually in the ancestry of the current HEAD? Run git merge-base --is-ancestor 9ac3561 HEAD to confirm. 5. The halt diagnosis in main_loop.py line 800 runs the exposure check only when risk_status == RiskStatus.HALT. But duration elapsed doesn't go through RiskStatus.HALT. Could there be a tick where exposure was briefly over 100.0 that triggered a real halt (not just diagnosis), and the duration also happened to elapse near the same time?

Deliverable: Log the loaded config risk limits at startup. Confirm or deny that the engine loaded 150.0 for S39. If it loaded 100.0, identify why.


Audit Item 2 — Inventory Snapshot at Shutdown (PRIORITY)

Observation: - Terminal ending inventory: 112.56 XRP + 176.03 RLUSD = 337 RLUSD total - Expected from known wallet state: ~197 RLUSD total (post-injection wallet: ~67 XRP + ~97 RLUSD) - Dashboard at 12 min into S39 showed: 67 XRP + 100 RLUSD = ~197 RLUSD total - With 8 fills (4/4 balanced net), inventory should be close to starting state

Hypothesis: get_snapshot() at shutdown includes XRP and RLUSD locked in open XRPL offers (both the working buy order and any sell order reserves not yet settled on-chain). If the engine queried the on-chain balance AFTER cancels but before the ledger confirmed those cancels, it might have captured inflated balances.

What we need to know: 1. In _shutdown(), when is get_snapshot() called — before or after _cancel_live_orders_on_shutdown()? 2. Does InventorySnapshot.xrp_balance include XRP reserved in open XRPL sell offers? (On XRPL, XRP in offers IS part of the account XRP balance, so yes unless explicitly excluded.) 3. Does InventorySnapshot.rlusd_balance include RLUSD locked in open buy offers? (RLUSD trust line balance should NOT include offer reserves — RLUSD in offers is subtracted from the trust line.) 4. After shutdown cancels, are the cancelled order reserves reflected immediately in the snapshot, or does it require a ledger close? 5. Is the inventory used for PnL accounting the correct closing balance, or is it inflated by in-flight order state?

Deliverable: Confirm whether the ending inventory snapshot in the DB matches the actual settled on-chain balances after all orders are cancelled and confirmed. If not, fix the snapshot timing so the DB records the true post-session wallet state.


Audit Item 3 — FLAG-035: WAL Checkpoint Hardening (IMPLEMENT)

Status: Atlas-approved (Apr 18). Deferred to separate branch after S39 — S39 is now complete.

Spec: - Add periodic PRAGMA wal_checkpoint(TRUNCATE) on a 60-second timer in the main loop - Must not interfere with main loop timing (run on a background thread or as a non-blocking call) - Add logging around checkpoint execution (success/failure/pages checkpointed) - Clean shutdown coordination: ensure checkpoint is not mid-execution when _shutdown() fires - Do not change any strategy logic, fills, or DB schema

Constraints (Atlas-set): - Separate branch: fix/wal-checkpoint-hardening - 60s interval (not shorter — don't thrash the checkpoint) - Log checkpoint stats (pages moved, pages remaining) - No other changes in the same commit


Audit Item 4 — FLAG-029: Async Warning + Orphan Reconciliation (ASSESS)

FLAG-029a: RuntimeWarning: coroutine 'submit_and_wait' was never awaited in xrpl_gateway.py:1044 during SIGINT cancel path. xrpl-py changed submit_and_wait to return a coroutine; cancel path doesn't await it. FLAG-029b: Engine reconciles stale order c7e14e73 on every launch.

What we need: 1. Confirm whether 029a still fires in S39 logs (Katja may have terminal scroll) 2. Fix the unawaited coroutine — this is a real risk: if xrpl-py changes API behavior, cancel errors will be silently swallowed 3. For 029b: query orders table to find c7e14e73, determine why it's never cleaned from snapshot, add cleanup pass to remove stale reconciled-but-never-filled records


Audit Item 5 — Full Config Wiring Pass

Review every parameter in config/config_live_stage1.yaml against the engine code. For each parameter: - Confirm it is read from the correct YAML section in config.py - Confirm the engine uses it (not shadowed or ignored) - Flag any parameters that exist in the YAML but aren't loaded, or loaded but not used

Pay particular attention to: - risk.* section (max_xrp_exposure, max_rlusd_exposure) — the audit item above - strategy.bid_offset_bps / ask_offset_bps — used in Phase 7.3 - accounting.valuation_snapshot_interval_seconds — currently 4s; confirm it's not creating excessive write pressure - Any field with a hardcoded default in config.py that overrides the YAML


Audit Item 6 — Dead Code + Stale Files

From the cleanup branch (Apr 18), max_inventory_usd was retired. Confirm: - No other deprecated fields remain in config.py dataclasses - No stale .py files (e.g., main_loop_Old.py, strategy_engine_old.py) exist in neo_engine/ - No tests referencing retired parameters still pass by coincidence - Run grep -r "max_inventory_usd" . — confirm zero results


Audit Item 7 — Halt Reason Classification

The current if-elif chain in main_loop.py:797–811 classifies the halt reason when risk_status == RiskStatus.HALT. But: - Duration elapsed does NOT go through RiskStatus.HALT — it has its own stop path - If the risk engine returns HALT at the same tick that duration would have fired, the halt_reason captures the risk condition, not duration

Action: Add "duration elapsed" as an explicit halt_reason when the session stops due to elapsed >= duration. Currently it may show as another condition or store the last risk check. This makes the DB session records accurate and avoids misleading halt reasons like S39's.


Scope Constraints

Do not change in this audit: - Strategy logic, offsets, spreads, sizing - Fill calculation paths - Phase 7.2 CLOB switch logic - DB schema (additive changes are ok if justified)

All changes go through the standard branch → PR → merge process. Each item above should be a separate branch unless tightly coupled.


Reporting

Return a brief for each item: finding, fix applied, test added (if applicable), files changed. Update AGENT_CHANGE_CONTROL.md per the standard.

— Vesper