Pre-Phase-7.3 Full Engine Audit — Findings & Branch Plan¶
To: Vesper, Atlas, Katja From: Orion Subject: Per-item findings, proposed branch plan, Atlas #12 go/no-go mapping
This memo consolidates investigation of Vesper's 7 items + Atlas's 12-point addendum. Items 1 and 2 were already delivered in the prior closure memo; both are restated here for completeness. Nothing below is committed beyond the session-closure fix already on fix/session-closure-ended-at. Each finding proposes a separate branch so Vesper/Katja can review and gate each one.
Sandbox caveat. My local repo sits on fix/session-closure-ended-at (branched pre-merge). On Katja's disk, main already contains fix/cleanup-and-anchor-instrumentation (commit ac33de6 and friends), feat/phase-7.2-clob-switch, and fix/session-closure-ended-at. Two findings below are stale in my sandbox but already resolved on disk; I flag them explicitly so no one re-does work.
Item 1 — max_xrp_exposure mismatch — RESOLVED (ghost, not bug)¶
Finding. Config loading is clean. config.py:382 reads max_xrp_exposure from YAML with no hardcoded fallback below 10000. The S39 halt message "xrp exposure limit (115.20 > 100.0)" is a ghost from the engine_state.halt.reason key, not the live config.
Root cause. main_loop._shutdown() lines 648–652:
existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or "unexpected halt"
if _hr != existing_reason:
self._state.set_engine_state("halt.reason", _hr)
The halt.reason key persists across sessions and is only rewritten when the new reason differs from the existing. Duration-elapsed shutdowns pass no halt_reason kwarg — so _hr becomes existing_reason, the if is false, and the prior session's risk-halt string survives intact. The 100.0 figure comes from a session predating the bump to 150.0. At S39 runtime, the strategy engine and risk gate used 150.0 correctly.
Fix (tied to Item 7):
1. run_paper_session.py:350 → pass explicit halt_reason="duration_elapsed" on clean completion.
2. _shutdown() → always write the chosen reason (remove != gate).
3. _startup() → clear halt.reason after fresh create_session so a new session never inherits the prior session's halt text.
Branch: fix/halt-reason-lifecycle — 1 commit, 3 tests.
Item 2 — Inventory snapshot at shutdown — RESOLVED (report-path bug, not engine bug)¶
Finding. The S39 terminal display of 112.56 XRP / 176.03 RLUSD vs wallet 66.820 / 97.610 is a summarize_paper_run.py reporting bug introduced by FLAG-030, not an engine inventory error.
Root cause. inventory_manager.py:295-296 strips the capital overlay before persisting:
fills_only_new_xrp = new_xrp - self._xrp_capital_overlay
fills_only_new_rlusd = new_rlusd - self._rlusd_capital_overlay
summarize_paper_run._get_inventory_balance reads inventory_ledger.new_balance directly and does not add back the capital_events overlay. For a session after a capital injection, the summary under-reports by the injection amount and its math drifts from reality. get_snapshot() itself is correct — the engine never queries XRPL for balances mid-run. The shutdown-time open-order reserve theory is ruled out — engine inventory doesn't include reserves.
Fix:
1. summarize_paper_run._get_inventory_balance → sum capital_events.amount filtered to session time window and add to XRP / RLUSD as appropriate.
2. Add terminal-display invariant: |summary_total_value − engine.get_snapshot().total_value_in_rlusd| < 0.01 × total_value at shutdown. Log ERROR + write to engine_state.inventory_drift_at_shutdown if violated.
3. Add session summary line: Inventory reconciliation: fills-only=X, capital_overlay=Y, engine_total=Z.
Branch: fix/summarize-paper-run-capital-overlay — 1 commit, 3 tests.
Item 3 — FLAG-035 WAL checkpoint hardening (IMPLEMENT)¶
Proposed design (Atlas-approved):
Timer thread in StateManager. PASSIVE checkpoint every 60s (non-blocking — returns busy=1 if transaction in flight, retries next interval). TRUNCATE at clean shutdown only. Config surface: config.engine.wal_checkpoint_interval_seconds: int = 60 (0 = disabled for tests).
def _checkpoint_loop(self) -> None:
while not self._checkpoint_stop.wait(self._checkpoint_interval_s):
try:
t0 = time.monotonic()
row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()
dt_ms = (time.monotonic() - t0) * 1000
log.info("wal_checkpoint", extra={
"mode": "PASSIVE", "busy": row[0],
"log_frames": row[1], "checkpointed": row[2],
"elapsed_ms": round(dt_ms, 1),
})
except Exception as exc:
log.error("wal_checkpoint failed", extra={"error": str(exc)}, exc_info=True)
Clean-shutdown: PRAGMA wal_checkpoint(TRUNCATE) in StateManager.close() before connection close — leaves WAL empty so FLAG-027 backup is a clean single-file snapshot.
What this does NOT fix: CTRL_CLOSE_EVENT is still uncatchable. Corruption window shrinks to ≤60s of writes on average.
Branch: fix/wal-checkpoint-hardening — 1 commit, 3 tests (periodic checkpoint runs + logs, concurrent write + checkpoint integrity, shutdown TRUNCATE leaves empty WAL).
Item 4 — FLAG-029 async warning + orphan c7e14e73¶
FLAG-029a. All three submit_and_wait call sites are currently synchronous. Risk: xrpl-py ≥ 3.x deprecation path may migrate to async-only — sync call would return a coroutine object, cancel path silently fails.
Fix:
1. Pin xrpl-py version in requirements.txt.
2. Add _submit_and_wait_safe() helper that detects coroutine returns and raises immediately.
3. Gateway init smoke check: inspect.iscoroutinefunction(submit_and_wait) → log ERROR + refuse to start if True.
FLAG-029b. Stale order c7e14e73 reconciled every launch. Query on Katja's disk:
SELECT id, status, created_at, submit_tx_hash, offer_sequence, failure_reason
FROM orders WHERE id LIKE 'c7e14e73%';
If status = 'SUBMITTED', no offer_sequence, created >7 days ago → force to CANCELED with failure_reason = 'orphan cleanup 2026-04-18'.
Branch: fix/flag-029-async-pin-and-orphan — 2 commits, 2 tests.
Item 5 — Full config wiring pass¶
| Key | YAML | Parsed | Consumed by | Notes |
|---|---|---|---|---|
risk.max_xrp_exposure |
✅ | config.py:382 | main_loop:801 | Clean. |
risk.max_rlusd_exposure |
✅ | config.py | main_loop:803 | Clean. |
strategy.max_inventory_usd |
⚠️ REMOVED | — | — | Verify grep -rn 'max_inventory_usd' . = 0 on current main. |
strategy.anchor_max_divergence_bps |
✅ (10.0) | config.py:413 | strategy_engine | Phase 7.2 CLOB switch uses 3 bps — confirm sourced from config, not hardcoded. |
order_size.base_size_rlusd |
✅ (15.0) | config.py | strategy_engine | Clean. |
engine.tick_interval_seconds |
✅ (4) | config.py | main_loop | Clean. |
strategy.requote_interval_seconds |
✅ (4) | config.py | main_loop sleep | Must match tick_interval_seconds — no assertion today; add one. |
One concrete candidate: the 3 bps CLOB-switch threshold. If hardcoded in strategy_engine.py, promote to strategy.clob_switch_threshold_bps (Atlas ruling: required for Phase 7.3 tuning).
Branch: audit/config-wiring-pass — read-only audit + 3 bps surface if found hardcoded.
Item 6 — Dead code + stale files¶
Stale files in neo_engine/: main_loop_Old.py, strategy_engine_old.py. Also: NEO Back up/ (trailing space folder), neo_simulator/simulation_runner.bak.py, .fuse_hidden* files, <MagicMock ...> files in repo root.
Fix: move to Archive/, add .gitignore pattern for .fuse_hidden* and <MagicMock *>. Archive/ excluded from grep-based audits (document this explicitly).
Branch: chore/archive-cleanup — 1 commit, no tests.
Item 7 — Halt reason classification¶
Same branch as Item 1 (fix/halt-reason-lifecycle). Halt reason taxonomy:
| Reason | Emitted by | Semantics |
|---|---|---|
duration_elapsed |
run_paper_session clean exit | Normal completion |
engine_requested_halt |
engine returned _tick() == False |
Strategy/reconciler halt |
risk_xrp_exposure |
main_loop risk gate | XRP value > cap |
risk_rlusd_exposure |
main_loop risk gate | RLUSD > cap |
risk_rpc_failure |
main_loop risk gate | last RPC failed |
risk_stale_ledger |
main_loop risk gate | ledger age > threshold |
risk_gateway_unhealthy |
main_loop risk gate | gateway health check failed |
kill_switch |
kill_switch.py | explicit HALT |
reconciler_halt |
main_loop reconciler | fill errors or ambiguous orders |
interrupted_<sig> |
signal handler | Ctrl+C / Break / TERM |
startup_failure |
_startup exception path | cannot fetch balances / offers |
config_mismatch |
startup invariant check | runtime config ≠ expected (Atlas addition) |
unexpected_halt |
fallback | should never appear post-fix |
Free-form strings (e.g. "xrp exposure limit (115.20 > 100.0)") move to a separate halt.detail engine_state key so numeric context survives but classification is machine-parseable.
Atlas #9 — Distance-to-touch diagnostic (NEW, PRIMARY METRIC for Phase 7.3)¶
Per-tick computation and logging:
- distance_to_clob_bid_bps — (our_bid − clob_best_bid) × 10000 / mid
- distance_to_clob_ask_bps — (clob_best_ask − our_ask) × 10000 / mid
- within_2bps_bid, within_2bps_ask — booleans
Session summary: % ticks within 2 bps of touch, mean / median both sides, histogram bucketed [0–2, 2–5, 5–10, 10+] bps.
Storage: two new columns on market_snapshots (distance_to_bid_touch_bps, distance_to_ask_touch_bps).
Branch: feat/distance-to-touch-diagnostic — 1 commit, 2 tests.
Proposed Branch Plan (merge order)¶
| # | Branch | Risk | Estimated scope |
|---|---|---|---|
| 1 | fix/halt-reason-lifecycle |
low | 1 commit + 3 tests |
| 2 | fix/summarize-paper-run-capital-overlay |
low | 1 commit + 3 tests |
| 3 | chore/archive-cleanup |
low | 1 commit (moves only) |
| 4 | fix/flag-029-async-pin-and-orphan |
low | 2 commits + 2 tests |
| 5 | audit/config-wiring-pass |
low | 1 commit + verification table |
| 6 | feat/distance-to-touch-diagnostic |
medium | 1 commit + 2 tests |
| 7 | fix/wal-checkpoint-hardening |
medium-high | 1 commit + 3 tests (concurrency) |
Land 1–5 before S40. Land 6 before Phase 7.3 data collection. Land 7 with isolated paper-mode shakedown before any live run.
Atlas #12 Go/No-Go Mapping¶
| Atlas requirement | Addressed by | Status after merge |
|---|---|---|
| Config mismatch resolved | Items 1 + 5 | ✅ |
| Inventory snapshot validated vs XRPL | Item 2 + shutdown invariant | ✅ pending S40 |
| Shutdown sequence verified | Session-closure fix (merged) + Item 3 TRUNCATE | ✅ |
| Async issues resolved | Item 4 | ✅ |
| Halt classification corrected | Items 1 + 7 | ✅ |
| No silent failures remain | log.error promotions (Items 1, 3, 4) | ✅ |
Phase 7.3 gate recommendation: proceed once branches 1, 2, 4, 5 merged and S40 (≥30 min) shows clean session closure + reconciled inventory summary. Branches 6 and 7 can run in parallel with first Phase 7.3 session if Vesper/Atlas prefer.
Questions back to Vesper / Atlas¶
- Branch cadence — individual PRs before next is cut, or batch 1–5 as a sequence?
- CLOB switch threshold — sourced from
anchor_max_divergence_bpsor hardcoded constant? - Distance-to-touch storage — columns on
market_snapshotsvs separate table? - Halt taxonomy strings — confirm for dashboard + Experiment Log use.
No engine code changed since session-closure merge. Awaiting review before any branch is cut.
— Orion