Pre-Phase-7.3 Full Engine Audit — Findings & Branch Plan¶

To: Vesper, Atlas, Katja From: Orion Subject: Per-item findings, proposed branch plan, Atlas #12 go/no-go mapping

This memo consolidates investigation of Vesper's 7 items + Atlas's 12-point addendum. Items 1 and 2 were already delivered in the prior closure memo; both are restated here for completeness. Nothing below is committed beyond the session-closure fix already on fix/session-closure-ended-at. Each finding proposes a separate branch so Vesper/Katja can review and gate each one.

Sandbox caveat. My local repo sits on fix/session-closure-ended-at (branched pre-merge). On Katja's disk, main already contains fix/cleanup-and-anchor-instrumentation (commit ac33de6 and friends), feat/phase-7.2-clob-switch, and fix/session-closure-ended-at. Two findings below are stale in my sandbox but already resolved on disk; I flag them explicitly so no one re-does work.

Item 1 — `max_xrp_exposure` mismatch — RESOLVED (ghost, not bug)¶

Finding. Config loading is clean. config.py:382 reads max_xrp_exposure from YAML with no hardcoded fallback below 10000. The S39 halt message "xrp exposure limit (115.20 > 100.0)" is a ghost from the engine_state.halt.reason key, not the live config.

Root cause. main_loop._shutdown() lines 648–652:

existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or "unexpected halt"
if _hr != existing_reason:
    self._state.set_engine_state("halt.reason", _hr)

The halt.reason key persists across sessions and is only rewritten when the new reason differs from the existing. Duration-elapsed shutdowns pass no halt_reason kwarg — so _hr becomes existing_reason, the if is false, and the prior session's risk-halt string survives intact. The 100.0 figure comes from a session predating the bump to 150.0. At S39 runtime, the strategy engine and risk gate used 150.0 correctly.

Fix (tied to Item 7): 1. run_paper_session.py:350 → pass explicit halt_reason="duration_elapsed" on clean completion. 2. _shutdown() → always write the chosen reason (remove != gate). 3. _startup() → clear halt.reason after fresh create_session so a new session never inherits the prior session's halt text.

Branch: fix/halt-reason-lifecycle — 1 commit, 3 tests.

Item 2 — Inventory snapshot at shutdown — RESOLVED (report-path bug, not engine bug)¶

Finding. The S39 terminal display of 112.56 XRP / 176.03 RLUSD vs wallet 66.820 / 97.610 is a summarize_paper_run.py reporting bug introduced by FLAG-030, not an engine inventory error.

Root cause. inventory_manager.py:295-296 strips the capital overlay before persisting:

fills_only_new_xrp = new_xrp - self._xrp_capital_overlay
fills_only_new_rlusd = new_rlusd - self._rlusd_capital_overlay

summarize_paper_run._get_inventory_balance reads inventory_ledger.new_balance directly and does not add back the capital_events overlay. For a session after a capital injection, the summary under-reports by the injection amount and its math drifts from reality. get_snapshot() itself is correct — the engine never queries XRPL for balances mid-run. The shutdown-time open-order reserve theory is ruled out — engine inventory doesn't include reserves.

Fix: 1. summarize_paper_run._get_inventory_balance → sum capital_events.amount filtered to session time window and add to XRP / RLUSD as appropriate. 2. Add terminal-display invariant: |summary_total_value − engine.get_snapshot().total_value_in_rlusd| < 0.01 × total_value at shutdown. Log ERROR + write to engine_state.inventory_drift_at_shutdown if violated. 3. Add session summary line: Inventory reconciliation: fills-only=X, capital_overlay=Y, engine_total=Z.

Branch: fix/summarize-paper-run-capital-overlay — 1 commit, 3 tests.

Item 3 — FLAG-035 WAL checkpoint hardening (IMPLEMENT)¶

Proposed design (Atlas-approved):

Timer thread in StateManager. PASSIVE checkpoint every 60s (non-blocking — returns busy=1 if transaction in flight, retries next interval). TRUNCATE at clean shutdown only. Config surface: config.engine.wal_checkpoint_interval_seconds: int = 60 (0 = disabled for tests).

def _checkpoint_loop(self) -> None:
    while not self._checkpoint_stop.wait(self._checkpoint_interval_s):
        try:
            t0 = time.monotonic()
            row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()
            dt_ms = (time.monotonic() - t0) * 1000
            log.info("wal_checkpoint", extra={
                "mode": "PASSIVE", "busy": row[0],
                "log_frames": row[1], "checkpointed": row[2],
                "elapsed_ms": round(dt_ms, 1),
            })
        except Exception as exc:
            log.error("wal_checkpoint failed", extra={"error": str(exc)}, exc_info=True)

Clean-shutdown: PRAGMA wal_checkpoint(TRUNCATE) in StateManager.close() before connection close — leaves WAL empty so FLAG-027 backup is a clean single-file snapshot.

What this does NOT fix: CTRL_CLOSE_EVENT is still uncatchable. Corruption window shrinks to ≤60s of writes on average.

Branch: fix/wal-checkpoint-hardening — 1 commit, 3 tests (periodic checkpoint runs + logs, concurrent write + checkpoint integrity, shutdown TRUNCATE leaves empty WAL).

Item 4 — FLAG-029 async warning + orphan `c7e14e73`¶

FLAG-029a. All three submit_and_wait call sites are currently synchronous. Risk: xrpl-py ≥ 3.x deprecation path may migrate to async-only — sync call would return a coroutine object, cancel path silently fails.

Fix: 1. Pin xrpl-py version in requirements.txt. 2. Add _submit_and_wait_safe() helper that detects coroutine returns and raises immediately. 3. Gateway init smoke check: inspect.iscoroutinefunction(submit_and_wait) → log ERROR + refuse to start if True.

FLAG-029b. Stale order c7e14e73 reconciled every launch. Query on Katja's disk:

SELECT id, status, created_at, submit_tx_hash, offer_sequence, failure_reason
FROM orders WHERE id LIKE 'c7e14e73%';

If status = 'SUBMITTED', no offer_sequence, created >7 days ago → force to CANCELED with failure_reason = 'orphan cleanup 2026-04-18'.

Branch: fix/flag-029-async-pin-and-orphan — 2 commits, 2 tests.

Item 5 — Full config wiring pass¶

Key	YAML	Parsed	Consumed by	Notes
`risk.max_xrp_exposure`	✅	config.py:382	main_loop:801	Clean.
`risk.max_rlusd_exposure`	✅	config.py	main_loop:803	Clean.
`strategy.max_inventory_usd`	⚠️ REMOVED	—	—	Verify `grep -rn 'max_inventory_usd' .` = 0 on current main.
`strategy.anchor_max_divergence_bps`	✅ (10.0)	config.py:413	strategy_engine	Phase 7.2 CLOB switch uses 3 bps — confirm sourced from config, not hardcoded.
`order_size.base_size_rlusd`	✅ (15.0)	config.py	strategy_engine	Clean.
`engine.tick_interval_seconds`	✅ (4)	config.py	main_loop	Clean.
`strategy.requote_interval_seconds`	✅ (4)	config.py	main_loop sleep	Must match `tick_interval_seconds` — no assertion today; add one.

One concrete candidate: the 3 bps CLOB-switch threshold. If hardcoded in strategy_engine.py, promote to strategy.clob_switch_threshold_bps (Atlas ruling: required for Phase 7.3 tuning).

Branch: audit/config-wiring-pass — read-only audit + 3 bps surface if found hardcoded.

Item 6 — Dead code + stale files¶

Stale files in neo_engine/: main_loop_Old.py, strategy_engine_old.py. Also: NEO Back up/ (trailing space folder), neo_simulator/simulation_runner.bak.py, .fuse_hidden* files, <MagicMock ...> files in repo root.

Fix: move to Archive/, add .gitignore pattern for .fuse_hidden* and <MagicMock *>. Archive/ excluded from grep-based audits (document this explicitly).

Branch: chore/archive-cleanup — 1 commit, no tests.

Item 7 — Halt reason classification¶

Same branch as Item 1 (fix/halt-reason-lifecycle). Halt reason taxonomy:

Reason	Emitted by	Semantics
`duration_elapsed`	run_paper_session clean exit	Normal completion
`engine_requested_halt`	engine returned `_tick() == False`	Strategy/reconciler halt
`risk_xrp_exposure`	main_loop risk gate	XRP value > cap
`risk_rlusd_exposure`	main_loop risk gate	RLUSD > cap
`risk_rpc_failure`	main_loop risk gate	last RPC failed
`risk_stale_ledger`	main_loop risk gate	ledger age > threshold
`risk_gateway_unhealthy`	main_loop risk gate	gateway health check failed
`kill_switch`	kill_switch.py	explicit HALT
`reconciler_halt`	main_loop reconciler	fill errors or ambiguous orders
`interrupted_<sig>`	signal handler	Ctrl+C / Break / TERM
`startup_failure`	_startup exception path	cannot fetch balances / offers
`config_mismatch`	startup invariant check	runtime config ≠ expected (Atlas addition)
`unexpected_halt`	fallback	should never appear post-fix

Free-form strings (e.g. "xrp exposure limit (115.20 > 100.0)") move to a separate halt.detail engine_state key so numeric context survives but classification is machine-parseable.

Atlas #9 — Distance-to-touch diagnostic (NEW, PRIMARY METRIC for Phase 7.3)¶

Per-tick computation and logging: - distance_to_clob_bid_bps — (our_bid − clob_best_bid) × 10000 / mid - distance_to_clob_ask_bps — (clob_best_ask − our_ask) × 10000 / mid - within_2bps_bid, within_2bps_ask — booleans

Session summary: % ticks within 2 bps of touch, mean / median both sides, histogram bucketed [0–2, 2–5, 5–10, 10+] bps.

Storage: two new columns on market_snapshots (distance_to_bid_touch_bps, distance_to_ask_touch_bps).

Branch: feat/distance-to-touch-diagnostic — 1 commit, 2 tests.

Proposed Branch Plan (merge order)¶

#	Branch	Risk	Estimated scope
1	`fix/halt-reason-lifecycle`	low	1 commit + 3 tests
2	`fix/summarize-paper-run-capital-overlay`	low	1 commit + 3 tests
3	`chore/archive-cleanup`	low	1 commit (moves only)
4	`fix/flag-029-async-pin-and-orphan`	low	2 commits + 2 tests
5	`audit/config-wiring-pass`	low	1 commit + verification table
6	`feat/distance-to-touch-diagnostic`	medium	1 commit + 2 tests
7	`fix/wal-checkpoint-hardening`	medium-high	1 commit + 3 tests (concurrency)

Land 1–5 before S40. Land 6 before Phase 7.3 data collection. Land 7 with isolated paper-mode shakedown before any live run.

Atlas #12 Go/No-Go Mapping¶

Atlas requirement	Addressed by	Status after merge
Config mismatch resolved	Items 1 + 5	✅
Inventory snapshot validated vs XRPL	Item 2 + shutdown invariant	✅ pending S40
Shutdown sequence verified	Session-closure fix (merged) + Item 3 TRUNCATE	✅
Async issues resolved	Item 4	✅
Halt classification corrected	Items 1 + 7	✅
No silent failures remain	log.error promotions (Items 1, 3, 4)	✅

Phase 7.3 gate recommendation: proceed once branches 1, 2, 4, 5 merged and S40 (≥30 min) shows clean session closure + reconciled inventory summary. Branches 6 and 7 can run in parallel with first Phase 7.3 session if Vesper/Atlas prefer.

Questions back to Vesper / Atlas¶

Branch cadence — individual PRs before next is cut, or batch 1–5 as a sequence?
CLOB switch threshold — sourced from anchor_max_divergence_bps or hardcoded constant?
Distance-to-touch storage — columns on market_snapshots vs separate table?
Halt taxonomy strings — confirm for dashboard + Experiment Log use.

No engine code changed since session-closure merge. Awaiting review before any branch is cut.

— Orion