Pre-Phase-7.3 Full Engine Audit — Findings & Branch Plan¶
To: Vesper, Atlas, Katja From: Orion Subject: Per-item findings, proposed branch plan, Atlas #12 go/no-go mapping
This memo consolidates investigation of Vesper's 7 items + Atlas's 12-point addendum. Items 1 and 2 were already delivered in the prior closure memo; both are restated here for completeness. Nothing below is committed beyond the session-closure fix already on fix/session-closure-ended-at. Each finding proposes a separate branch so Vesper/Katja can review and gate each one.
Sandbox caveat. My local repo sits on fix/session-closure-ended-at (branched pre-merge). On Katja's disk, main already contains fix/cleanup-and-anchor-instrumentation (commit ac33de6 and friends), feat/phase-7.2-clob-switch, and fix/session-closure-ended-at. Two findings below are stale in my sandbox but already resolved on disk; I flag them explicitly so no one re-does work.
Item 1 — max_xrp_exposure config 100/150 mismatch (PRIORITY) — RESOLVED (ghost, not bug)¶
Finding. Config loading is clean. config.py:382 reads max_xrp_exposure from YAML with no hardcoded fallback below 10000. The S39 halt message "xrp exposure limit (115.20 > 100.0)" is a ghost from the engine_state.halt.reason key, not the live config.
Root cause. main_loop._shutdown() lines 648–652:
existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or "unexpected halt"
if _hr != existing_reason:
self._state.set_engine_state("halt.reason", _hr)
The halt.reason key persists across sessions and is only rewritten when the new reason differs from the existing. Duration-elapsed shutdowns in run_paper_session.py:350 call engine._shutdown("Paper session duration elapsed") with no halt_reason kwarg — so _hr becomes existing_reason, the if is false, and the prior session's risk-halt string survives intact.
The 100.0 figure comes from a session predating the bump to 150.0. At S39 runtime, the strategy engine and risk gate used 150.0 correctly.
Fix (tied to Item 7 below):
1. run_paper_session.py:350 → pass an explicit halt_reason="duration_elapsed" on clean completion.
2. _shutdown() → always write the chosen reason (no != gate); the old logic was a defensive no-op that caused the ghost.
3. Add a startup step in main_loop._startup() to clear halt.reason after a fresh create_session so a new session never inherits the prior session's halt text.
Branch: fix/halt-reason-lifecycle — 1 commit, 3 tests (duration shutdown writes duration_elapsed, startup clears stale halt.reason, risk halt still persists mid-session).
Atlas #12 map: halt classification corrected ✅ (once this lands).
Item 2 — Inventory snapshot at shutdown (PRIORITY) — RESOLVED (report-path bug, not engine bug)¶
Finding. The S39 terminal display of 112.56 XRP / 176.03 RLUSD vs wallet 66.820 / 97.610 is not an engine inventory error. It is a summarize_paper_run.py reporting bug introduced by FLAG-030.
Root cause. inventory_manager.py:295-296 strips the capital overlay before persisting:
fills_only_new_xrp = new_xrp - self._xrp_capital_overlay
fills_only_new_rlusd = new_rlusd - self._rlusd_capital_overlay
So inventory_ledger.new_balance holds fills-only balances. The engine's in-memory _xrp_balance is fills + capital_overlay (correct for trading decisions).
summarize_paper_run._get_inventory_balance reads inventory_ledger.new_balance directly and does not add back the capital_events overlay. For a session after a capital injection, the summary under-reports by the injection amount and its math can drift in the opposite direction depending on the session's net fill flow.
Note on the specific S39 numbers. 112.56 XRP > 67 XRP wallet means the summary is showing an inventory state that never matched on-chain reality, which is the symptom of using the wrong ledger math — new_balance plus an un-reconciled overlay view. get_snapshot() itself is correct; the engine never queries XRPL for balances mid-run (inventory is fills+overlay by design, decoupled from on-chain reserves). The shutdown-time open-order reserve theory is ruled out — engine inventory doesn't include reserves.
Fix:
1. summarize_paper_run._get_inventory_balance → after reading new_balance, sum capital_events.amount (deposit=+, withdrawal=−, filtered to the session's time window) and add to XRP / RLUSD as appropriate.
2. Add a terminal-display invariant: |summary_total_value − engine.get_snapshot().total_value_in_rlusd| < 0.01 × total_value at shutdown. Log ERROR + write to engine_state.inventory_drift_at_shutdown if violated.
3. Add a session summary line: Inventory reconciliation: fills-only=X, capital_overlay=Y, engine_total=Z — makes the math visible.
Branch: fix/summarize-paper-run-capital-overlay — 1 commit, 3 tests (session with deposit shows correct end-state, session with withdrawal, session with no capital events regresses to current behavior).
Atlas #12 map: inventory snapshot validated ✅ (once this lands + S40 post-fix run confirms reconciliation invariant).
Item 3 — FLAG-035 WAL checkpoint hardening (IMPLEMENT)¶
Current state. state_manager._connect() opens WAL with PRAGMA journal_mode=WAL; PRAGMA foreign_keys=ON;. No periodic checkpoint. FLAG-027 (pre-run backup) + FLAG-033 (startup quick_check) are the two mitigations for mid-WAL crashes; neither truncates the WAL during a run.
Proposed design (Atlas-approved).
- Timer thread in
StateManager. self._checkpoint_stop = threading.Event()self._checkpoint_thread = threading.Thread(target=self._checkpoint_loop, daemon=True, name="wal-checkpoint")- Started by a new
start_checkpoint_timer(interval_s=60)call frommain_loop._startup()afterinitialize_database(). -
Stopped by
self._checkpoint_stop.set()+join(timeout=5)inStateManager.close(). -
Loop body.
def _checkpoint_loop(self) -> None: while not self._checkpoint_stop.wait(self._checkpoint_interval_s): try: t0 = time.monotonic() # PASSIVE first so we never block a writer mid-transaction. row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone() dt_ms = (time.monotonic() - t0) * 1000 log.info("wal_checkpoint", extra={ "mode": "PASSIVE", "busy": row[0], "log_frames": row[1], "checkpointed": row[2], "elapsed_ms": round(dt_ms, 1), }) except Exception as exc: log.error("wal_checkpoint failed", extra={"error": str(exc)}, exc_info=True) PASSIVE= non-blocking. If there's an active transaction, it returns busy=1 and moves on; we retry in 60s.-
TRUNCATEat shutdown only (see point 4). RunningTRUNCATEperiodically can stall for pending readers. -
Thread-safety. The connection is opened with
check_same_thread=False. SQLite's serialized mode makesPRAGMAcalls from a background thread safe alongside writes from the main thread, but I want to pin this with a test rather than trust the docs. Test: concurrentfillsinserts from main thread while checkpoint thread runs at 100ms interval — 1000 writes, expect zero lost writes andPRAGMA integrity_check=ok. -
Clean-shutdown checkpoint. In
Truncating at shutdown means the WAL is empty when the process exits cleanly — the next FLAG-027 backup is a single-file snapshot, no WAL replay needed.StateManager.close(), beforeself._conn.close(): -
Config surface.
config.engine.wal_checkpoint_interval_seconds: int = 60with 0 = disabled for tests. Default 60s is conservative (matches Vesper/Atlas ruling).
What this does NOT fix. CTRL_CLOSE_EVENT (Windows terminal X-button) is still uncatchable — FLAG-027 backup remains the backstop. But the corruption window shrinks: on average the WAL holds ≤60s of writes at the moment of an uncatchable kill, vs indefinite today.
Branch: fix/wal-checkpoint-hardening — 1 commit (state_manager.py, main_loop.py wiring, config.py surface), 3 tests (periodic checkpoint runs + logs, concurrent write + checkpoint integrity, shutdown TRUNCATE leaves empty WAL).
Atlas #12 map: no silent failures remain (partial — the checkpoint-failure path now logs ERROR).
Item 4 — FLAG-029 async warning + orphan c7e14e73 (INVESTIGATE / FIX)¶
FLAG-029a — RuntimeWarning: coroutine 'submit_and_wait' was never awaited¶
Finding. In the current sandbox, all three submit_and_wait call sites (xrpl_gateway.py:748, :925, :1041) are synchronous — they import from xrpl.transaction, get back a Response, and branch on tx_result. None of them produce an unawaited coroutine today.
Risk. xrpl-py ≥ 3.x has a deprecation path where submit_and_wait may migrate to async-only. If the version bumps silently, the sync call returns a coroutine object instead of a Response, response.result raises AttributeError, and the cancel path returns failure_reason="submit_and_wait raised: ..." — a soft failure that doesn't halt the engine. This is exactly the silent-failure pattern FLAG-029a warned about.
Fix:
1. Pin xrpl-py version in requirements.txt (currently unpinned or loose; confirm).
2. Add a _submit_and_wait_safe(tx, client, wallet) helper that detects coroutine returns and raises immediately with a clear error (rather than letting AttributeError leak).
3. Add a one-line smoke check at gateway init: call inspect.iscoroutinefunction(submit_and_wait) and log ERROR + refuse to start if True.
FLAG-029b — Orphan order c7e14e73¶
Finding. Reconciler re-reads the same stale order every launch. Need to inspect the orders table on Katja's disk:
SELECT id, status, created_at, submit_tx_hash, offer_sequence, failure_reason
FROM orders WHERE id LIKE 'c7e14e73%';
If status = 'SUBMITTED' with no offer_sequence and created >7 days ago, it's a pre-FLAG-027 submit-that-never-validated. Safe to force to CANCELED with failure_reason = 'orphan cleanup 2026-04-18'.
Branch: fix/flag-029-async-pin-and-orphan — 2 commits (async safety, orphan backfill migration), 2 tests (iscoroutinefunction smoke, orphan query happy/empty path).
Atlas #12 map: async issues resolved ✅ (once pinned + smoke lands).
Item 5 — Full config wiring pass¶
Method. For each key in config_live_stage1.yaml: verify (a) parsed in config.py, (b) used by ≥1 engine module, (c) no hardcoded fallback that can mask a missing YAML value. Result summary (full table in a follow-up commit to this memo; I've spot-checked the high-risk keys):
| Key | YAML | Parsed | Consumed by | Notes |
|---|---|---|---|---|
risk.max_xrp_exposure |
✅ | config.py:382 | main_loop:801 | Clean. |
risk.max_rlusd_exposure |
✅ | config.py (search) | main_loop:803 | Clean. |
strategy.max_inventory_usd |
⚠️ on disk post-merge = REMOVED | — | — | Sandbox stale; commit ac33de6 retires it. Verify grep -rn 'max_inventory_usd' . returns 0 on current main. |
strategy.anchor_max_divergence_bps |
✅ (10.0) | config.py:413 | strategy_engine | Phase 7.2 CLOB switch reads this as the 3 bps→CLOB threshold? Verify — Vesper's item specifically calls out that Phase 7.2 uses a 3 bps constant, not anchor_max_divergence_bps. Need to confirm CLOB switch threshold is sourced from config, not hardcoded. |
order_size.base_size_rlusd |
✅ (15.0 on disk) | config.py | strategy_engine | Clean. |
engine.tick_interval_seconds |
✅ (4) | config.py | main_loop | Clean. |
strategy.requote_interval_seconds |
✅ (4) | config.py | main_loop sleep | Must match tick_interval_seconds — no config-level assertion today; add one. |
Action: follow-up commit with full table and any findings. One concrete candidate: the 3 bps CLOB-switch threshold. If it's hardcoded in strategy_engine.py, promote it to strategy.clob_switch_threshold_bps so Phase 7.3 can tune without code changes.
Branch: audit/config-wiring-pass — read-only audit commit (plus the 3 bps surface if found hardcoded).
Atlas #12 map: config mismatch resolved ✅ (pending the one verification above).
Item 6 — Dead code + stale files scan¶
Finding. Sandbox shows these stale files in neo_engine/:
- main_loop_Old.py (2571-line old main loop)
- strategy_engine_old.py
Both were scrubbed of max_inventory_usd in ac33de6 (per commit message: "archive scrub for grep hygiene") but the files themselves were not deleted. They exist only to survive grep — but they also present a foot-gun: any future audit that greps for patterns will get hits in archive files and report false positives.
Fix: move both to neo_engine/Archive/ (already a top-level Archive/ exists) so they're on-disk but outside the importable module path. Also audit the repo root for:
- NEO Back up/ (folder with trailing-space name; contains a stale main_loop.py) — move to Archive/.
- neo_simulator/simulation_runner.bak.py — delete; superseded by simulation_runner.py.
- .fuse_hidden* files — delete (stale sandbox artifacts).
- <MagicMock ...> files in repo root — delete (from a misconfigured test run that leaked stdout into filenames).
Branch: chore/archive-cleanup — 1 commit, no tests, just file moves + .gitignore pattern for .fuse_hidden* and <MagicMock *>.
Atlas #12 map: no silent failures remain (partial — dead-path reduction).
Item 7 — halt_reason classification (duration-elapsed path)¶
Finding. Covered under Item 1. Currently run_paper_session.py:350 passes no halt_reason; the _shutdown() != gate preserves whatever was last written. Result: the sessions.halt_reason column is polluted with stale risk-halt text even on clean completions, and the dashboard / Experiment Log cannot distinguish "session ended naturally" from "session halted on risk."
Fix: (same branch as Item 1, fix/halt-reason-lifecycle)
- Pass
halt_reason="duration_elapsed"atrun_paper_session.py:350. - Pass
halt_reason="interrupted_sigint"/"interrupted_sigterm"/"interrupted_sigbreak"at the signal-handler call site. - Pass
halt_reason="engine_requested_halt"atrun_paper_session.py:339. _shutdown()→ always callset_engine_state("halt.reason", _hr)— remove the!=gate._startup()→ if this is a fresh session (no recovery parent), clearhalt.reasonso the new session starts clean.
Halt reason taxonomy (proposed). Single source of truth for dashboard/Experiment Log parsing:
| Reason | Emitted by | Semantics |
|---|---|---|
duration_elapsed |
run_paper_session clean exit | Normal completion |
engine_requested_halt |
engine returned _tick() == False |
Strategy/reconciler decided to halt |
risk_xrp_exposure |
main_loop risk gate | XRP value > cap |
risk_rlusd_exposure |
main_loop risk gate | RLUSD > cap |
risk_rpc_failure |
main_loop risk gate | last RPC failed |
risk_stale_ledger |
main_loop risk gate | ledger age > threshold |
risk_gateway_unhealthy |
main_loop risk gate | gateway health check failed |
kill_switch |
kill_switch.py | explicit HALT |
reconciler_halt |
main_loop reconciler | fill_errors or ambiguous orders |
interrupted_<sig> |
signal handler | Ctrl+C / Break / TERM |
startup_failure |
_startup exception path | cannot fetch balances / offers |
unexpected_halt |
fallback | should never appear post-fix |
The existing free-form strings (e.g. "xrp exposure limit (115.20 > 100.0)") move into a separate halt.detail engine_state key so the numeric context survives but the classification is machine-parseable.
Atlas #12 map: halt classification corrected ✅ (full resolution requires this + #12 Item 1).
Atlas #9 — per-tick distance-to-touch diagnostic (NEW)¶
Scope. On every tick, compute and log:
- distance_to_clob_bid_bps — (our_bid − clob_best_bid) × 10000 / mid
- distance_to_clob_ask_bps — (clob_best_ask − our_ask) × 10000 / mid
- within_2bps_bid, within_2bps_ask — booleans
Session summary additions (at _log_session_summary):
- % ticks within 2 bps of bid touch / ... ask touch
- mean / median distance-to-touch both sides
- histogram bucketed: [0–2, 2–5, 5–10, 10+] bps
Why. Phase 7.3 is offset calibration. Without this metric we're guessing at our spread competitiveness; with it, a 2-hour session yields a distribution we can use to set bid/ask offsets target-first instead of parameter-first.
Implementation. Add computation in main_loop._tick() after CLOB anchor resolution, write to a new tick_distance_to_touch engine_state telemetry key (rolling window) and to each market_snapshots row via two new columns distance_to_bid_touch_bps, distance_to_ask_touch_bps. Aggregation at shutdown.
Branch: feat/distance-to-touch-diagnostic — 1 commit, 2 tests (computation correctness vs known mid/bid/ask, session summary math).
Atlas #12 map: Phase 7.3 calibration input ✅ — prerequisite for informed offset tuning.
Proposed Branch Plan (merge order)¶
All branches cut from current main (post-session-closure merge). Merge order below maximizes independence and minimizes conflicts:
| # | Branch | Risk | Depends on | Estimated scope |
|---|---|---|---|---|
| 1 | fix/halt-reason-lifecycle |
low | — | 1 commit + 3 tests |
| 2 | fix/summarize-paper-run-capital-overlay |
low | — | 1 commit + 3 tests |
| 3 | chore/archive-cleanup |
low | — | 1 commit (moves only) |
| 4 | fix/flag-029-async-pin-and-orphan |
low | — | 2 commits + 2 tests |
| 5 | audit/config-wiring-pass |
low | 1 (merges cleanly after) | 1 commit + verification table |
| 6 | feat/distance-to-touch-diagnostic |
medium | — | 1 commit + 2 tests |
| 7 | fix/wal-checkpoint-hardening |
medium-high | — | 1 commit + 3 tests (concurrency) |
Recommendation: land 1–5 before S40 (any duration ≥ 30m with the Phase 7.2 config). Land 6 before Phase 7.3 data collection starts. Land 7 on its own branch with an isolated paper-mode shakedown before any live run — concurrent-checkpoint-during-write is the one item here I treat as load-bearing infrastructure.
Atlas #12 Go/No-Go Mapping¶
| Atlas requirement | Addressed by | Status after merge |
|---|---|---|
| Config mismatch resolved | Item 1 + Item 5 | ✅ |
| Inventory snapshot validated vs XRPL | Item 2 + shutdown invariant | ✅ pending S40 confirmation |
| Shutdown sequence verified | Session-closure fix (already merged) + Item 3 TRUNCATE | ✅ |
| Async issues resolved | Item 4 | ✅ |
| Halt classification corrected | Item 1 + Item 7 | ✅ |
| No silent failures remain | log.error promotions (Items 1, 3, 4) | ✅ |
Phase 7.3 gate recommendation: proceed once branches 1, 2, 4, 5 are merged and S40 (30m, Phase 7.2 config) shows clean session closure + reconciled inventory summary. Branches 6 and 7 can be sequenced in parallel with the first Phase 7.3 session if Vesper/Atlas prefer.
Questions back to Vesper / Atlas¶
- Branch cadence. OK to land branches 1–5 as a single sequence of PRs, or should each be reviewed individually before the next is cut? My default is individual — easier to revert if any one surfaces an issue in S40.
- Phase 7.2 CLOB switch threshold. Is the 3 bps CLOB-switch threshold sourced from
anchor_max_divergence_bpsor a constant? If constant, should it becomestrategy.clob_switch_threshold_bpsin Item 5 or stay hardcoded until Phase 7.3 calibration is done? - Distance-to-touch storage. Two new columns on
market_snapshotsvs a separatequote_distance_logtable — mild preference for the columns (simpler, fewer joins for dashboard); flag if you'd rather isolate. - Atlas halt-reason taxonomy. Please confirm the proposed strings in Item 7. They'll appear on every dashboard and every Experiment Log row going forward, so the naming matters.
No engine code has changed since the session-closure merge. Awaiting review before any branch in the plan above is cut.
— Orion