Skip to content

Pre-Phase-7.3 Full Engine Audit — Findings & Branch Plan

To: Vesper, Atlas, Katja From: Orion Subject: Per-item findings, proposed branch plan, Atlas #12 go/no-go mapping

This memo consolidates investigation of Vesper's 7 items + Atlas's 12-point addendum. Items 1 and 2 were already delivered in the prior closure memo; both are restated here for completeness. Nothing below is committed beyond the session-closure fix already on fix/session-closure-ended-at. Each finding proposes a separate branch so Vesper/Katja can review and gate each one.

Sandbox caveat. My local repo sits on fix/session-closure-ended-at (branched pre-merge). On Katja's disk, main already contains fix/cleanup-and-anchor-instrumentation (commit ac33de6 and friends), feat/phase-7.2-clob-switch, and fix/session-closure-ended-at. Two findings below are stale in my sandbox but already resolved on disk; I flag them explicitly so no one re-does work.


Item 1 — max_xrp_exposure config 100/150 mismatch (PRIORITY) — RESOLVED (ghost, not bug)

Finding. Config loading is clean. config.py:382 reads max_xrp_exposure from YAML with no hardcoded fallback below 10000. The S39 halt message "xrp exposure limit (115.20 > 100.0)" is a ghost from the engine_state.halt.reason key, not the live config.

Root cause. main_loop._shutdown() lines 648–652:

existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or "unexpected halt"
if _hr != existing_reason:
    self._state.set_engine_state("halt.reason", _hr)

The halt.reason key persists across sessions and is only rewritten when the new reason differs from the existing. Duration-elapsed shutdowns in run_paper_session.py:350 call engine._shutdown("Paper session duration elapsed") with no halt_reason kwarg — so _hr becomes existing_reason, the if is false, and the prior session's risk-halt string survives intact.

The 100.0 figure comes from a session predating the bump to 150.0. At S39 runtime, the strategy engine and risk gate used 150.0 correctly.

Fix (tied to Item 7 below): 1. run_paper_session.py:350 → pass an explicit halt_reason="duration_elapsed" on clean completion. 2. _shutdown() → always write the chosen reason (no != gate); the old logic was a defensive no-op that caused the ghost. 3. Add a startup step in main_loop._startup() to clear halt.reason after a fresh create_session so a new session never inherits the prior session's halt text.

Branch: fix/halt-reason-lifecycle — 1 commit, 3 tests (duration shutdown writes duration_elapsed, startup clears stale halt.reason, risk halt still persists mid-session).

Atlas #12 map: halt classification corrected ✅ (once this lands).


Item 2 — Inventory snapshot at shutdown (PRIORITY) — RESOLVED (report-path bug, not engine bug)

Finding. The S39 terminal display of 112.56 XRP / 176.03 RLUSD vs wallet 66.820 / 97.610 is not an engine inventory error. It is a summarize_paper_run.py reporting bug introduced by FLAG-030.

Root cause. inventory_manager.py:295-296 strips the capital overlay before persisting:

fills_only_new_xrp = new_xrp - self._xrp_capital_overlay
fills_only_new_rlusd = new_rlusd - self._rlusd_capital_overlay

So inventory_ledger.new_balance holds fills-only balances. The engine's in-memory _xrp_balance is fills + capital_overlay (correct for trading decisions).

summarize_paper_run._get_inventory_balance reads inventory_ledger.new_balance directly and does not add back the capital_events overlay. For a session after a capital injection, the summary under-reports by the injection amount and its math can drift in the opposite direction depending on the session's net fill flow.

Note on the specific S39 numbers. 112.56 XRP > 67 XRP wallet means the summary is showing an inventory state that never matched on-chain reality, which is the symptom of using the wrong ledger math — new_balance plus an un-reconciled overlay view. get_snapshot() itself is correct; the engine never queries XRPL for balances mid-run (inventory is fills+overlay by design, decoupled from on-chain reserves). The shutdown-time open-order reserve theory is ruled out — engine inventory doesn't include reserves.

Fix: 1. summarize_paper_run._get_inventory_balance → after reading new_balance, sum capital_events.amount (deposit=+, withdrawal=−, filtered to the session's time window) and add to XRP / RLUSD as appropriate. 2. Add a terminal-display invariant: |summary_total_value − engine.get_snapshot().total_value_in_rlusd| < 0.01 × total_value at shutdown. Log ERROR + write to engine_state.inventory_drift_at_shutdown if violated. 3. Add a session summary line: Inventory reconciliation: fills-only=X, capital_overlay=Y, engine_total=Z — makes the math visible.

Branch: fix/summarize-paper-run-capital-overlay — 1 commit, 3 tests (session with deposit shows correct end-state, session with withdrawal, session with no capital events regresses to current behavior).

Atlas #12 map: inventory snapshot validated ✅ (once this lands + S40 post-fix run confirms reconciliation invariant).


Item 3 — FLAG-035 WAL checkpoint hardening (IMPLEMENT)

Current state. state_manager._connect() opens WAL with PRAGMA journal_mode=WAL; PRAGMA foreign_keys=ON;. No periodic checkpoint. FLAG-027 (pre-run backup) + FLAG-033 (startup quick_check) are the two mitigations for mid-WAL crashes; neither truncates the WAL during a run.

Proposed design (Atlas-approved).

  1. Timer thread in StateManager.
  2. self._checkpoint_stop = threading.Event()
  3. self._checkpoint_thread = threading.Thread(target=self._checkpoint_loop, daemon=True, name="wal-checkpoint")
  4. Started by a new start_checkpoint_timer(interval_s=60) call from main_loop._startup() after initialize_database().
  5. Stopped by self._checkpoint_stop.set() + join(timeout=5) in StateManager.close().

  6. Loop body.

    def _checkpoint_loop(self) -> None:
        while not self._checkpoint_stop.wait(self._checkpoint_interval_s):
            try:
                t0 = time.monotonic()
                # PASSIVE first so we never block a writer mid-transaction.
                row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()
                dt_ms = (time.monotonic() - t0) * 1000
                log.info("wal_checkpoint", extra={
                    "mode": "PASSIVE",
                    "busy": row[0], "log_frames": row[1], "checkpointed": row[2],
                    "elapsed_ms": round(dt_ms, 1),
                })
            except Exception as exc:
                log.error("wal_checkpoint failed", extra={"error": str(exc)}, exc_info=True)
    

  7. PASSIVE = non-blocking. If there's an active transaction, it returns busy=1 and moves on; we retry in 60s.
  8. TRUNCATE at shutdown only (see point 4). Running TRUNCATE periodically can stall for pending readers.

  9. Thread-safety. The connection is opened with check_same_thread=False. SQLite's serialized mode makes PRAGMA calls from a background thread safe alongside writes from the main thread, but I want to pin this with a test rather than trust the docs. Test: concurrent fills inserts from main thread while checkpoint thread runs at 100ms interval — 1000 writes, expect zero lost writes and PRAGMA integrity_check=ok.

  10. Clean-shutdown checkpoint. In StateManager.close(), before self._conn.close():

    try:
        self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    except Exception as exc:
        log.error("shutdown wal_checkpoint(TRUNCATE) failed", ...)
    
    Truncating at shutdown means the WAL is empty when the process exits cleanly — the next FLAG-027 backup is a single-file snapshot, no WAL replay needed.

  11. Config surface. config.engine.wal_checkpoint_interval_seconds: int = 60 with 0 = disabled for tests. Default 60s is conservative (matches Vesper/Atlas ruling).

What this does NOT fix. CTRL_CLOSE_EVENT (Windows terminal X-button) is still uncatchable — FLAG-027 backup remains the backstop. But the corruption window shrinks: on average the WAL holds ≤60s of writes at the moment of an uncatchable kill, vs indefinite today.

Branch: fix/wal-checkpoint-hardening — 1 commit (state_manager.py, main_loop.py wiring, config.py surface), 3 tests (periodic checkpoint runs + logs, concurrent write + checkpoint integrity, shutdown TRUNCATE leaves empty WAL).

Atlas #12 map: no silent failures remain (partial — the checkpoint-failure path now logs ERROR).


Item 4 — FLAG-029 async warning + orphan c7e14e73 (INVESTIGATE / FIX)

FLAG-029a — RuntimeWarning: coroutine 'submit_and_wait' was never awaited

Finding. In the current sandbox, all three submit_and_wait call sites (xrpl_gateway.py:748, :925, :1041) are synchronous — they import from xrpl.transaction, get back a Response, and branch on tx_result. None of them produce an unawaited coroutine today.

Risk. xrpl-py ≥ 3.x has a deprecation path where submit_and_wait may migrate to async-only. If the version bumps silently, the sync call returns a coroutine object instead of a Response, response.result raises AttributeError, and the cancel path returns failure_reason="submit_and_wait raised: ..." — a soft failure that doesn't halt the engine. This is exactly the silent-failure pattern FLAG-029a warned about.

Fix: 1. Pin xrpl-py version in requirements.txt (currently unpinned or loose; confirm). 2. Add a _submit_and_wait_safe(tx, client, wallet) helper that detects coroutine returns and raises immediately with a clear error (rather than letting AttributeError leak). 3. Add a one-line smoke check at gateway init: call inspect.iscoroutinefunction(submit_and_wait) and log ERROR + refuse to start if True.

FLAG-029b — Orphan order c7e14e73

Finding. Reconciler re-reads the same stale order every launch. Need to inspect the orders table on Katja's disk:

SELECT id, status, created_at, submit_tx_hash, offer_sequence, failure_reason
FROM orders WHERE id LIKE 'c7e14e73%';

If status = 'SUBMITTED' with no offer_sequence and created >7 days ago, it's a pre-FLAG-027 submit-that-never-validated. Safe to force to CANCELED with failure_reason = 'orphan cleanup 2026-04-18'.

Branch: fix/flag-029-async-pin-and-orphan — 2 commits (async safety, orphan backfill migration), 2 tests (iscoroutinefunction smoke, orphan query happy/empty path).

Atlas #12 map: async issues resolved ✅ (once pinned + smoke lands).


Item 5 — Full config wiring pass

Method. For each key in config_live_stage1.yaml: verify (a) parsed in config.py, (b) used by ≥1 engine module, (c) no hardcoded fallback that can mask a missing YAML value. Result summary (full table in a follow-up commit to this memo; I've spot-checked the high-risk keys):

Key YAML Parsed Consumed by Notes
risk.max_xrp_exposure config.py:382 main_loop:801 Clean.
risk.max_rlusd_exposure config.py (search) main_loop:803 Clean.
strategy.max_inventory_usd ⚠️ on disk post-merge = REMOVED Sandbox stale; commit ac33de6 retires it. Verify grep -rn 'max_inventory_usd' . returns 0 on current main.
strategy.anchor_max_divergence_bps ✅ (10.0) config.py:413 strategy_engine Phase 7.2 CLOB switch reads this as the 3 bps→CLOB threshold? Verify — Vesper's item specifically calls out that Phase 7.2 uses a 3 bps constant, not anchor_max_divergence_bps. Need to confirm CLOB switch threshold is sourced from config, not hardcoded.
order_size.base_size_rlusd ✅ (15.0 on disk) config.py strategy_engine Clean.
engine.tick_interval_seconds ✅ (4) config.py main_loop Clean.
strategy.requote_interval_seconds ✅ (4) config.py main_loop sleep Must match tick_interval_seconds — no config-level assertion today; add one.

Action: follow-up commit with full table and any findings. One concrete candidate: the 3 bps CLOB-switch threshold. If it's hardcoded in strategy_engine.py, promote it to strategy.clob_switch_threshold_bps so Phase 7.3 can tune without code changes.

Branch: audit/config-wiring-pass — read-only audit commit (plus the 3 bps surface if found hardcoded).

Atlas #12 map: config mismatch resolved ✅ (pending the one verification above).


Item 6 — Dead code + stale files scan

Finding. Sandbox shows these stale files in neo_engine/: - main_loop_Old.py (2571-line old main loop) - strategy_engine_old.py

Both were scrubbed of max_inventory_usd in ac33de6 (per commit message: "archive scrub for grep hygiene") but the files themselves were not deleted. They exist only to survive grep — but they also present a foot-gun: any future audit that greps for patterns will get hits in archive files and report false positives.

Fix: move both to neo_engine/Archive/ (already a top-level Archive/ exists) so they're on-disk but outside the importable module path. Also audit the repo root for: - NEO Back up/ (folder with trailing-space name; contains a stale main_loop.py) — move to Archive/. - neo_simulator/simulation_runner.bak.py — delete; superseded by simulation_runner.py. - .fuse_hidden* files — delete (stale sandbox artifacts). - <MagicMock ...> files in repo root — delete (from a misconfigured test run that leaked stdout into filenames).

Branch: chore/archive-cleanup — 1 commit, no tests, just file moves + .gitignore pattern for .fuse_hidden* and <MagicMock *>.

Atlas #12 map: no silent failures remain (partial — dead-path reduction).


Item 7 — halt_reason classification (duration-elapsed path)

Finding. Covered under Item 1. Currently run_paper_session.py:350 passes no halt_reason; the _shutdown() != gate preserves whatever was last written. Result: the sessions.halt_reason column is polluted with stale risk-halt text even on clean completions, and the dashboard / Experiment Log cannot distinguish "session ended naturally" from "session halted on risk."

Fix: (same branch as Item 1, fix/halt-reason-lifecycle)

  1. Pass halt_reason="duration_elapsed" at run_paper_session.py:350.
  2. Pass halt_reason="interrupted_sigint" / "interrupted_sigterm" / "interrupted_sigbreak" at the signal-handler call site.
  3. Pass halt_reason="engine_requested_halt" at run_paper_session.py:339.
  4. _shutdown() → always call set_engine_state("halt.reason", _hr) — remove the != gate.
  5. _startup() → if this is a fresh session (no recovery parent), clear halt.reason so the new session starts clean.

Halt reason taxonomy (proposed). Single source of truth for dashboard/Experiment Log parsing:

Reason Emitted by Semantics
duration_elapsed run_paper_session clean exit Normal completion
engine_requested_halt engine returned _tick() == False Strategy/reconciler decided to halt
risk_xrp_exposure main_loop risk gate XRP value > cap
risk_rlusd_exposure main_loop risk gate RLUSD > cap
risk_rpc_failure main_loop risk gate last RPC failed
risk_stale_ledger main_loop risk gate ledger age > threshold
risk_gateway_unhealthy main_loop risk gate gateway health check failed
kill_switch kill_switch.py explicit HALT
reconciler_halt main_loop reconciler fill_errors or ambiguous orders
interrupted_<sig> signal handler Ctrl+C / Break / TERM
startup_failure _startup exception path cannot fetch balances / offers
unexpected_halt fallback should never appear post-fix

The existing free-form strings (e.g. "xrp exposure limit (115.20 > 100.0)") move into a separate halt.detail engine_state key so the numeric context survives but the classification is machine-parseable.

Atlas #12 map: halt classification corrected ✅ (full resolution requires this + #12 Item 1).


Atlas #9 — per-tick distance-to-touch diagnostic (NEW)

Scope. On every tick, compute and log: - distance_to_clob_bid_bps(our_bid − clob_best_bid) × 10000 / mid - distance_to_clob_ask_bps(clob_best_ask − our_ask) × 10000 / mid - within_2bps_bid, within_2bps_ask — booleans

Session summary additions (at _log_session_summary): - % ticks within 2 bps of bid touch / ... ask touch - mean / median distance-to-touch both sides - histogram bucketed: [0–2, 2–5, 5–10, 10+] bps

Why. Phase 7.3 is offset calibration. Without this metric we're guessing at our spread competitiveness; with it, a 2-hour session yields a distribution we can use to set bid/ask offsets target-first instead of parameter-first.

Implementation. Add computation in main_loop._tick() after CLOB anchor resolution, write to a new tick_distance_to_touch engine_state telemetry key (rolling window) and to each market_snapshots row via two new columns distance_to_bid_touch_bps, distance_to_ask_touch_bps. Aggregation at shutdown.

Branch: feat/distance-to-touch-diagnostic — 1 commit, 2 tests (computation correctness vs known mid/bid/ask, session summary math).

Atlas #12 map: Phase 7.3 calibration input ✅ — prerequisite for informed offset tuning.


Proposed Branch Plan (merge order)

All branches cut from current main (post-session-closure merge). Merge order below maximizes independence and minimizes conflicts:

# Branch Risk Depends on Estimated scope
1 fix/halt-reason-lifecycle low 1 commit + 3 tests
2 fix/summarize-paper-run-capital-overlay low 1 commit + 3 tests
3 chore/archive-cleanup low 1 commit (moves only)
4 fix/flag-029-async-pin-and-orphan low 2 commits + 2 tests
5 audit/config-wiring-pass low 1 (merges cleanly after) 1 commit + verification table
6 feat/distance-to-touch-diagnostic medium 1 commit + 2 tests
7 fix/wal-checkpoint-hardening medium-high 1 commit + 3 tests (concurrency)

Recommendation: land 1–5 before S40 (any duration ≥ 30m with the Phase 7.2 config). Land 6 before Phase 7.3 data collection starts. Land 7 on its own branch with an isolated paper-mode shakedown before any live run — concurrent-checkpoint-during-write is the one item here I treat as load-bearing infrastructure.


Atlas #12 Go/No-Go Mapping

Atlas requirement Addressed by Status after merge
Config mismatch resolved Item 1 + Item 5
Inventory snapshot validated vs XRPL Item 2 + shutdown invariant ✅ pending S40 confirmation
Shutdown sequence verified Session-closure fix (already merged) + Item 3 TRUNCATE
Async issues resolved Item 4
Halt classification corrected Item 1 + Item 7
No silent failures remain log.error promotions (Items 1, 3, 4)

Phase 7.3 gate recommendation: proceed once branches 1, 2, 4, 5 are merged and S40 (30m, Phase 7.2 config) shows clean session closure + reconciled inventory summary. Branches 6 and 7 can be sequenced in parallel with the first Phase 7.3 session if Vesper/Atlas prefer.


Questions back to Vesper / Atlas

  1. Branch cadence. OK to land branches 1–5 as a single sequence of PRs, or should each be reviewed individually before the next is cut? My default is individual — easier to revert if any one surfaces an issue in S40.
  2. Phase 7.2 CLOB switch threshold. Is the 3 bps CLOB-switch threshold sourced from anchor_max_divergence_bps or a constant? If constant, should it become strategy.clob_switch_threshold_bps in Item 5 or stay hardcoded until Phase 7.3 calibration is done?
  3. Distance-to-touch storage. Two new columns on market_snapshots vs a separate quote_distance_log table — mild preference for the columns (simpler, fewer joins for dashboard); flag if you'd rather isolate.
  4. Atlas halt-reason taxonomy. Please confirm the proposed strings in Item 7. They'll appear on every dashboard and every Experiment Log row going forward, so the naming matters.

No engine code has changed since the session-closure merge. Awaiting review before any branch in the plan above is cut.

— Orion