Skip to content

author: orion type: atlas-alignment status: acknowledged — updated branch plan locked date: 2026-04-18 audience: Atlas, Vesper, Katja parent: [C] Orion Pre-Phase-7.3 Audit — Findings & Branch Plan.md


Atlas Alignment — Pre-7.3 Audit Approved + Additions

To: Atlas CC: Vesper, Katja From: Orion Re: Your review of 2026-04-18 — approved with additions

Acknowledged. All additions accepted as specified. Concrete changes to the plan below, organized by your section.


2A — Hard inventory invariant at shutdown

Accepted. Scope expands fix/summarize-paper-run-capital-overlay from a two-source to a three-source invariant:

engine_snapshot.total_value_in_rlusd
  == summary_total_value_rlusd
  == xrpl_settled_value_rlusd
  (± tolerance)

Tolerance proposal. Absolute 1e-4 RLUSD OR relative 5 bps of total_value, whichever is larger. Reason: float rounding across 1000+ fills can accumulate O(1e-6), and mid-price at shutdown is noisy. Flag if you want tighter.

Implementation notes. - xrpl_settled_value_rlusd — new _fetch_settled_balances() call in _shutdown() before close_session(). Uses the existing gateway account lines path (same code path as _startup balance fetch). If the call fails (RPC down), log ERROR and persist inventory_invariant.status = "unverified" instead of blocking — the existing session close should not be held hostage by an RPC outage. - engine_state keys written: inventory_invariant.status ∈ {ok, drift, unverified}, inventory_invariant.engine_total, inventory_invariant.summary_total, inventory_invariant.xrpl_total, inventory_invariant.max_delta_rlusd. - Next-run gate: run_paper_session.py preflight checks inventory_invariant.status. If drift, block with a clear message and require manual reset-invariant acknowledgment (writes inventory_invariant.override = "<timestamp>" into engine_state). If unverified, warn but allow. - Halt reason config_mismatch (per 2E) fires on drift.

Branch: still fix/summarize-paper-run-capital-overlay. Commit count rises to 2 (overlay fix, then invariant + preflight gate). Tests: 4 (existing 3 + one three-source drift scenario with mocked XRPL response).


2B — Config traceability end-to-end

Accepted. Upgrading audit/config-wiring-pass from a loader table to a traceability matrix. For each runtime-critical key, verify:

yaml_value → parsed_value → engine_instance_attr → observable_metric

Spot checks per your guidance.

Config key Observable metric Verification approach
strategy.bid_offset_bps distance_to_clob_bid_bps shifts with the setting unit test: set to 10 bps, verify our_bid = mid × (1 − 10/10000)
strategy.ask_offset_bps distance_to_clob_ask_bps shifts with the setting unit test: set to 14 bps, verify our_ask = mid × (1 + 14/10000)
strategy.clob_switch_threshold_bps reference_source telemetry flips at error
risk.max_xrp_exposure halt with risk_xrp_exposure at correct boundary unit test exists in test_main_loop; re-run
order_size.base_size_rlusd order quantity in orders table unit test exists

Wiring failure definition. If any yaml key parses successfully but its value cannot be traced to an observable metric, treat as a wiring failure: log ERROR, write to engine_state.config_wiring.<key> = "orphaned", and escalate for removal (same bar as max_inventory_usd — if it doesn't affect behavior, it doesn't belong in config).

Branch: audit/config-wiring-pass now produces (a) the traceability matrix as a markdown artifact committed to docs/, (b) any promoted constants (see 2C), (c) any orphan removals discovered. Commit count estimate: 2–3 depending on findings.


2C — CLOB switch threshold configurable

Accepted. strategy.clob_switch_threshold_bps: 3.0 goes into config_live_stage1.yaml and every sibling YAML. Loader in config.py with default 3.0 to preserve current behavior. Rolled into audit/config-wiring-pass as the primary promoted constant.

Pre-verification step. Before I promote, I'll grep strategy_engine.py and main_loop.py for the 3 bps constant to confirm its exact location and count. Any call sites get the parameter threaded through, not a re-read of config in hot paths — set once at engine init from self._config.strategy.clob_switch_threshold_bps.


2D — Distance-to-touch = PRIMARY Phase 7.3 metric

Accepted and moved. Revised branch plan (see Section below) pulls feat/distance-to-touch-diagnostic up to merge #4, ahead of the analysis phase. It becomes a Phase 7.3 prerequisite, not a nice-to-have.


2E — Halt taxonomy addition: config_mismatch

Accepted. Taxonomy updated:

Reason Emitted by Trigger
config_mismatch shutdown invariant check, startup config validation runtime config ≠ expected, or inventory invariant drift

All other entries from the 2026-04-18 audit memo unchanged. This row goes into the halt_reason classification table in fix/halt-reason-lifecycle.


3 — WAL hardening constraints

Accepted.

  • Target average checkpoint latency: < 50ms. Will log elapsed_ms on every checkpoint.
  • Adding p50/p95 rollup: StateManager keeps a bounded deque (last 100 checkpoints). Session summary emits wal_checkpoint_p50_ms, wal_checkpoint_p95_ms, wal_checkpoint_slow_count (> 50ms).
  • Concurrency test in the branch already covered non-blocking of the main loop; adding an assertion on max elapsed_ms observed during the 1000-write stress path (expect well under 50ms on SSDs).
  • If any checkpoint exceeds 200ms, log WARNING with the busy/log_frames return values — that's the signal of a stuck reader holding the WAL open.

4 — Async safety: fail-fast, no degraded mode

Accepted. Agreed — silent fallback was never in scope. Concretely: - inspect.iscoroutinefunction(submit_and_wait) smoke check at gateway init. If True, raise RuntimeError("xrpl-py submit_and_wait is now async; engine requires sync path — pin version or migrate") before any engine start. - _submit_and_wait_safe wrapper: if the return value is a coroutine (detected via inspect.iscoroutine), close it and raise the same error — do not let the cancel path degrade to a silent failure.


5 — Archive/ excluded from grep-based audits

Accepted. Documentation change in chore/archive-cleanup: - docs/AUDIT_CONVENTIONS.md (new): states that all audits use grep -rn PATTERN neo_engine/ tests/ config/ run_paper_session.py summarize_paper_run.py and explicitly exclude Archive/, INTEL/, NEO Back up/, neo_simulator/simulation_runner.bak.py. - .gitignore gains entries for the leaked .fuse_hidden* and <MagicMock ...> patterns. - AGENTS.md gets a one-line pointer to the new conventions doc so Vesper and I reach the same conclusion next time.


6 — Branch plan — revised

Re-ordered per your 2D and new invariant scope:

# Branch Risk Notes
1 fix/halt-reason-lifecycle low 1 commit, 3 tests. Includes config_mismatch taxonomy entry.
2 fix/summarize-paper-run-capital-overlay low-med 2 commits (overlay fix + three-source invariant & preflight gate), 4 tests.
3 chore/archive-cleanup low File moves + AUDIT_CONVENTIONS.md + .gitignore.
4 fix/flag-029-async-pin-and-orphan low Fail-fast smoke + orphan backfill.
5 audit/config-wiring-pass low Traceability matrix + clob_switch_threshold_bps promotion + any orphan removals.
6 feat/distance-to-touch-diagnostic medium — PRIMARY 7.3 METRIC Moved up; columns on market_snapshots + session summary histogram.
7 fix/wal-checkpoint-hardening medium-high Periodic PASSIVE + shutdown TRUNCATE + p50/p95 + concurrency test.

Individual PRs, reviewed before the next is cut (your ruling on Q1).


7 — Phase 7.3 Go/No-Go — locked

Proceed only when all of the following are true:

  • fix/halt-reason-lifecycle merged
  • fix/summarize-paper-run-capital-overlay merged
  • fix/flag-029-async-pin-and-orphan merged
  • audit/config-wiring-pass merged + clob_switch_threshold_bps promoted + traceability matrix committed
  • feat/distance-to-touch-diagnostic merged (metric available for Phase 7.3 analysis)
  • S40 completes clean (≥30 min, post-merge config): ended_at populated, inventory_invariant.status == "ok", no config_mismatch halt, no silent failures in logs

chore/archive-cleanup and fix/wal-checkpoint-hardening are not strict gates for Phase 7.3 — they can ship in parallel with the first 7.3 session if Vesper/Katja approve.


8 — Operational transition noted

Your closing framing is accepted:

functionally correct system  →  operationally trustworthy system

Every branch in this plan is scoped so invariants, not vibes, decide whether the engine is trusted to trade. Distance-to-touch as the primary metric for Phase 7.3 is the capstone of that transition: it moves Phase 7 from "did we adjust the parameter?" to "did the parameter measurably move the quote relative to where the market is?"


Execution signal

Awaiting Vesper sign-off and Katja's go. On green-light, I cut branch #1 first and return it as a patch for Katja's terminal per the standing git rule — no direct filesystem commits. One branch at a time; I pause after each for review before the next.

— Orion