Skip to content

Pre-Phase-7.3 Full Engine Audit — Findings & Branch Plan

To: Vesper, Atlas, Katja From: Orion Subject: Per-item findings, proposed branch plan, Atlas #12 go/no-go mapping

This memo consolidates investigation of Vesper's 7 items + Atlas's 12-point addendum. Items 1 and 2 were already delivered in the prior closure memo; both are restated here for completeness. Nothing below is committed beyond the session-closure fix already on fix/session-closure-ended-at. Each finding proposes a separate branch so Vesper/Katja can review and gate each one.

Sandbox caveat. My local repo sits on fix/session-closure-ended-at (branched pre-merge). On Katja's disk, main already contains fix/cleanup-and-anchor-instrumentation (commit ac33de6 and friends), feat/phase-7.2-clob-switch, and fix/session-closure-ended-at. Two findings below are stale in my sandbox but already resolved on disk; I flag them explicitly so no one re-does work.


Item 1 — max_xrp_exposure mismatch — RESOLVED (ghost, not bug)

Finding. Config loading is clean. config.py:382 reads max_xrp_exposure from YAML with no hardcoded fallback below 10000. The S39 halt message "xrp exposure limit (115.20 > 100.0)" is a ghost from the engine_state.halt.reason key, not the live config.

Root cause. main_loop._shutdown() lines 648–652:

existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or "unexpected halt"
if _hr != existing_reason:
    self._state.set_engine_state("halt.reason", _hr)

The halt.reason key persists across sessions and is only rewritten when the new reason differs from the existing. Duration-elapsed shutdowns pass no halt_reason kwarg — so _hr becomes existing_reason, the if is false, and the prior session's risk-halt string survives intact. The 100.0 figure comes from a session predating the bump to 150.0. At S39 runtime, the strategy engine and risk gate used 150.0 correctly.

Fix (tied to Item 7): 1. run_paper_session.py:350 → pass explicit halt_reason="duration_elapsed" on clean completion. 2. _shutdown() → always write the chosen reason (remove != gate). 3. _startup() → clear halt.reason after fresh create_session so a new session never inherits the prior session's halt text.

Branch: fix/halt-reason-lifecycle — 1 commit, 3 tests.


Item 2 — Inventory snapshot at shutdown — RESOLVED (report-path bug, not engine bug)

Finding. The S39 terminal display of 112.56 XRP / 176.03 RLUSD vs wallet 66.820 / 97.610 is a summarize_paper_run.py reporting bug introduced by FLAG-030, not an engine inventory error.

Root cause. inventory_manager.py:295-296 strips the capital overlay before persisting:

fills_only_new_xrp = new_xrp - self._xrp_capital_overlay
fills_only_new_rlusd = new_rlusd - self._rlusd_capital_overlay

summarize_paper_run._get_inventory_balance reads inventory_ledger.new_balance directly and does not add back the capital_events overlay. For a session after a capital injection, the summary under-reports by the injection amount and its math drifts from reality. get_snapshot() itself is correct — the engine never queries XRPL for balances mid-run. The shutdown-time open-order reserve theory is ruled out — engine inventory doesn't include reserves.

Fix: 1. summarize_paper_run._get_inventory_balance → sum capital_events.amount filtered to session time window and add to XRP / RLUSD as appropriate. 2. Add terminal-display invariant: |summary_total_value − engine.get_snapshot().total_value_in_rlusd| < 0.01 × total_value at shutdown. Log ERROR + write to engine_state.inventory_drift_at_shutdown if violated. 3. Add session summary line: Inventory reconciliation: fills-only=X, capital_overlay=Y, engine_total=Z.

Branch: fix/summarize-paper-run-capital-overlay — 1 commit, 3 tests.


Item 3 — FLAG-035 WAL checkpoint hardening (IMPLEMENT)

Proposed design (Atlas-approved):

Timer thread in StateManager. PASSIVE checkpoint every 60s (non-blocking — returns busy=1 if transaction in flight, retries next interval). TRUNCATE at clean shutdown only. Config surface: config.engine.wal_checkpoint_interval_seconds: int = 60 (0 = disabled for tests).

def _checkpoint_loop(self) -> None:
    while not self._checkpoint_stop.wait(self._checkpoint_interval_s):
        try:
            t0 = time.monotonic()
            row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()
            dt_ms = (time.monotonic() - t0) * 1000
            log.info("wal_checkpoint", extra={
                "mode": "PASSIVE", "busy": row[0],
                "log_frames": row[1], "checkpointed": row[2],
                "elapsed_ms": round(dt_ms, 1),
            })
        except Exception as exc:
            log.error("wal_checkpoint failed", extra={"error": str(exc)}, exc_info=True)

Clean-shutdown: PRAGMA wal_checkpoint(TRUNCATE) in StateManager.close() before connection close — leaves WAL empty so FLAG-027 backup is a clean single-file snapshot.

What this does NOT fix: CTRL_CLOSE_EVENT is still uncatchable. Corruption window shrinks to ≤60s of writes on average.

Branch: fix/wal-checkpoint-hardening — 1 commit, 3 tests (periodic checkpoint runs + logs, concurrent write + checkpoint integrity, shutdown TRUNCATE leaves empty WAL).


Item 4 — FLAG-029 async warning + orphan c7e14e73

FLAG-029a. All three submit_and_wait call sites are currently synchronous. Risk: xrpl-py ≥ 3.x deprecation path may migrate to async-only — sync call would return a coroutine object, cancel path silently fails.

Fix: 1. Pin xrpl-py version in requirements.txt. 2. Add _submit_and_wait_safe() helper that detects coroutine returns and raises immediately. 3. Gateway init smoke check: inspect.iscoroutinefunction(submit_and_wait) → log ERROR + refuse to start if True.

FLAG-029b. Stale order c7e14e73 reconciled every launch. Query on Katja's disk:

SELECT id, status, created_at, submit_tx_hash, offer_sequence, failure_reason
FROM orders WHERE id LIKE 'c7e14e73%';

If status = 'SUBMITTED', no offer_sequence, created >7 days ago → force to CANCELED with failure_reason = 'orphan cleanup 2026-04-18'.

Branch: fix/flag-029-async-pin-and-orphan — 2 commits, 2 tests.


Item 5 — Full config wiring pass

Key YAML Parsed Consumed by Notes
risk.max_xrp_exposure config.py:382 main_loop:801 Clean.
risk.max_rlusd_exposure config.py main_loop:803 Clean.
strategy.max_inventory_usd ⚠️ REMOVED Verify grep -rn 'max_inventory_usd' . = 0 on current main.
strategy.anchor_max_divergence_bps ✅ (10.0) config.py:413 strategy_engine Phase 7.2 CLOB switch uses 3 bps — confirm sourced from config, not hardcoded.
order_size.base_size_rlusd ✅ (15.0) config.py strategy_engine Clean.
engine.tick_interval_seconds ✅ (4) config.py main_loop Clean.
strategy.requote_interval_seconds ✅ (4) config.py main_loop sleep Must match tick_interval_seconds — no assertion today; add one.

One concrete candidate: the 3 bps CLOB-switch threshold. If hardcoded in strategy_engine.py, promote to strategy.clob_switch_threshold_bps (Atlas ruling: required for Phase 7.3 tuning).

Branch: audit/config-wiring-pass — read-only audit + 3 bps surface if found hardcoded.


Item 6 — Dead code + stale files

Stale files in neo_engine/: main_loop_Old.py, strategy_engine_old.py. Also: NEO Back up/ (trailing space folder), neo_simulator/simulation_runner.bak.py, .fuse_hidden* files, <MagicMock ...> files in repo root.

Fix: move to Archive/, add .gitignore pattern for .fuse_hidden* and <MagicMock *>. Archive/ excluded from grep-based audits (document this explicitly).

Branch: chore/archive-cleanup — 1 commit, no tests.


Item 7 — Halt reason classification

Same branch as Item 1 (fix/halt-reason-lifecycle). Halt reason taxonomy:

Reason Emitted by Semantics
duration_elapsed run_paper_session clean exit Normal completion
engine_requested_halt engine returned _tick() == False Strategy/reconciler halt
risk_xrp_exposure main_loop risk gate XRP value > cap
risk_rlusd_exposure main_loop risk gate RLUSD > cap
risk_rpc_failure main_loop risk gate last RPC failed
risk_stale_ledger main_loop risk gate ledger age > threshold
risk_gateway_unhealthy main_loop risk gate gateway health check failed
kill_switch kill_switch.py explicit HALT
reconciler_halt main_loop reconciler fill errors or ambiguous orders
interrupted_<sig> signal handler Ctrl+C / Break / TERM
startup_failure _startup exception path cannot fetch balances / offers
config_mismatch startup invariant check runtime config ≠ expected (Atlas addition)
unexpected_halt fallback should never appear post-fix

Free-form strings (e.g. "xrp exposure limit (115.20 > 100.0)") move to a separate halt.detail engine_state key so numeric context survives but classification is machine-parseable.


Atlas #9 — Distance-to-touch diagnostic (NEW, PRIMARY METRIC for Phase 7.3)

Per-tick computation and logging: - distance_to_clob_bid_bps(our_bid − clob_best_bid) × 10000 / mid - distance_to_clob_ask_bps(clob_best_ask − our_ask) × 10000 / mid - within_2bps_bid, within_2bps_ask — booleans

Session summary: % ticks within 2 bps of touch, mean / median both sides, histogram bucketed [0–2, 2–5, 5–10, 10+] bps.

Storage: two new columns on market_snapshots (distance_to_bid_touch_bps, distance_to_ask_touch_bps).

Branch: feat/distance-to-touch-diagnostic — 1 commit, 2 tests.


Proposed Branch Plan (merge order)

# Branch Risk Estimated scope
1 fix/halt-reason-lifecycle low 1 commit + 3 tests
2 fix/summarize-paper-run-capital-overlay low 1 commit + 3 tests
3 chore/archive-cleanup low 1 commit (moves only)
4 fix/flag-029-async-pin-and-orphan low 2 commits + 2 tests
5 audit/config-wiring-pass low 1 commit + verification table
6 feat/distance-to-touch-diagnostic medium 1 commit + 2 tests
7 fix/wal-checkpoint-hardening medium-high 1 commit + 3 tests (concurrency)

Land 1–5 before S40. Land 6 before Phase 7.3 data collection. Land 7 with isolated paper-mode shakedown before any live run.


Atlas #12 Go/No-Go Mapping

Atlas requirement Addressed by Status after merge
Config mismatch resolved Items 1 + 5
Inventory snapshot validated vs XRPL Item 2 + shutdown invariant ✅ pending S40
Shutdown sequence verified Session-closure fix (merged) + Item 3 TRUNCATE
Async issues resolved Item 4
Halt classification corrected Items 1 + 7
No silent failures remain log.error promotions (Items 1, 3, 4)

Phase 7.3 gate recommendation: proceed once branches 1, 2, 4, 5 merged and S40 (≥30 min) shows clean session closure + reconciled inventory summary. Branches 6 and 7 can run in parallel with first Phase 7.3 session if Vesper/Atlas prefer.


Questions back to Vesper / Atlas

  1. Branch cadence — individual PRs before next is cut, or batch 1–5 as a sequence?
  2. CLOB switch threshold — sourced from anchor_max_divergence_bps or hardcoded constant?
  3. Distance-to-touch storage — columns on market_snapshots vs separate table?
  4. Halt taxonomy strings — confirm for dashboard + Experiment Log use.

No engine code changed since session-closure merge. Awaiting review before any branch is cut.

— Orion