Atlas Alignment — Pre-7.3 Audit Approved + Additions¶

To: Atlas CC: Vesper, Katja From: Orion Re: Your review of 2026-04-18 — approved with additions

Acknowledged. All additions accepted as specified. Concrete changes to the plan below, organized by your section.

2A — Hard Inventory Invariant at Shutdown¶

Accepted. Scope expands fix/summarize-paper-run-capital-overlay from a two-source to a three-source invariant:

engine_snapshot.total_value_in_rlusd
  == summary_total_value_rlusd
  == xrpl_settled_value_rlusd
  (± tolerance)

Tolerance proposal. Absolute 1e-4 RLUSD OR relative 5 bps of total_value, whichever is larger. Reason: float rounding across 1000+ fills can accumulate O(1e-6), and mid-price at shutdown is noisy. Flag if you want tighter.

Implementation notes.

xrpl_settled_value_rlusd — new _fetch_settled_balances() call in _shutdown() before close_session(). Uses the existing gateway account lines path (same code path as _startup balance fetch). If the call fails (RPC down), log ERROR and persist inventory_invariant.status = "unverified" instead of blocking — the existing session close should not be held hostage by an RPC outage.
engine_state keys written: inventory_invariant.status ∈ {ok, drift, unverified}, inventory_invariant.engine_total, inventory_invariant.summary_total, inventory_invariant.xrpl_total, inventory_invariant.max_delta_rlusd.
Next-run gate: run_paper_session.py preflight checks inventory_invariant.status. If drift, block with a clear message and require manual reset-invariant acknowledgment (writes inventory_invariant.override = "<timestamp>" into engine_state). If unverified, warn but allow.
Halt reason config_mismatch (per 2E) fires on drift.

Branch: still fix/summarize-paper-run-capital-overlay. Commit count rises to 2 (overlay fix, then invariant + preflight gate). Tests: 4 (existing 3 + one three-source drift scenario with mocked XRPL response).

2B — Config Traceability End-to-End¶

Accepted. Upgrading audit/config-wiring-pass from a loader table to a traceability matrix. For each runtime-critical key, verify:

yaml_value → parsed_value → engine_instance_attr → observable_metric

Spot checks per your guidance.

Config key	Observable metric	Verification approach
`strategy.bid_offset_bps`	`distance_to_clob_bid_bps` shifts with the setting	unit test: set to 10 bps, verify `our_bid = mid × (1 − 10/10000)`
`strategy.ask_offset_bps`	`distance_to_clob_ask_bps` shifts with the setting	unit test: set to 14 bps, verify `our_ask = mid × (1 + 14/10000)`
`strategy.clob_switch_threshold_bps`	`reference_source` telemetry flips at threshold	unit test: set to 3.0, simulate 2.9 / 3.1 bps spread error
`risk.max_xrp_exposure`	halt with `risk_xrp_exposure` at correct boundary	unit test exists in test_main_loop; re-run
`order_size.base_size_rlusd`	order `quantity` in orders table	unit test exists

Wiring failure definition. If any yaml key parses successfully but its value cannot be traced to an observable metric, treat as a wiring failure: log ERROR, write to engine_state.config_wiring.<key> = "orphaned", and escalate for removal (same bar as max_inventory_usd — if it doesn't affect behavior, it doesn't belong in config).

Branch: audit/config-wiring-pass now produces (a) the traceability matrix as a markdown artifact committed to docs/, (b) any promoted constants (see 2C), (c) any orphan removals discovered. Commit count estimate: 2–3 depending on findings.

2C — CLOB Switch Threshold Configurable¶

Accepted. strategy.clob_switch_threshold_bps: 3.0 goes into config_live_stage1.yaml and every sibling YAML. Loader in config.py with default 3.0 to preserve current behavior. Rolled into audit/config-wiring-pass as the primary promoted constant.

Pre-verification step. Before I promote, I'll grep strategy_engine.py and main_loop.py for the 3 bps constant to confirm its exact location and count. Any call sites get the parameter threaded through, not a re-read of config in hot paths — set once at engine init from self._config.strategy.clob_switch_threshold_bps.

2D — Distance-to-Touch = PRIMARY Phase 7.3 Metric¶

Accepted and moved. Revised branch plan (see Section 6 below) pulls feat/distance-to-touch-diagnostic up to merge #6, ahead of the WAL hardening branch. It becomes a Phase 7.3 prerequisite, not a nice-to-have.

2E — Halt Taxonomy Addition: `config_mismatch`¶

Accepted. Taxonomy updated:

Reason	Emitted by	Trigger
`config_mismatch`	shutdown invariant check, startup config validation	runtime config ≠ expected, or inventory invariant `drift`

All other entries from the 2026-04-18 audit memo unchanged. This row goes into the halt_reason classification table in fix/halt-reason-lifecycle.

3 — WAL Hardening Constraints¶

Accepted.

Target average checkpoint latency: < 50ms. Will log elapsed_ms on every checkpoint.
Adding p50/p95 rollup: StateManager keeps a bounded deque (last 100 checkpoints). Session summary emits wal_checkpoint_p50_ms, wal_checkpoint_p95_ms, wal_checkpoint_slow_count (> 50ms).
Concurrency test in the branch already covered non-blocking of the main loop; adding an assertion on max elapsed_ms observed during the 1000-write stress path (expect well under 50ms on SSDs).
If any checkpoint exceeds 200ms, log WARNING with the busy/log_frames return values — that's the signal of a stuck reader holding the WAL open.

4 — Async Safety: Fail-Fast, No Degraded Mode¶

Accepted. Agreed — silent fallback was never in scope. Concretely:

inspect.iscoroutinefunction(submit_and_wait) smoke check at gateway init. If True, raise RuntimeError("xrpl-py submit_and_wait is now async; engine requires sync path — pin version or migrate") before any engine start.
_submit_and_wait_safe wrapper: if the return value is a coroutine (detected via inspect.iscoroutine), close it and raise the same error — do not let the cancel path degrade to a silent failure.

5 — Archive/ Excluded from Grep-Based Audits¶

Accepted. Documentation change in chore/archive-cleanup:

docs/AUDIT_CONVENTIONS.md (new): states that all audits use grep -rn PATTERN neo_engine/ tests/ config/ run_paper_session.py summarize_paper_run.py and explicitly exclude Archive/, INTEL/, NEO Back up/, neo_simulator/simulation_runner.bak.py.
.gitignore gains entries for the leaked .fuse_hidden* and <MagicMock ...> patterns.
AGENTS.md gets a one-line pointer to the new conventions doc so Vesper and I reach the same conclusion next time.

6 — Branch Plan — Revised (Final)¶

Re-ordered per 2D (distance-to-touch elevated) and new invariant scope on branch 2:

#	Branch	Risk	Notes
1	`fix/halt-reason-lifecycle`	low	1 commit, 3 tests. Includes `config_mismatch` taxonomy entry.
2	`fix/summarize-paper-run-capital-overlay`	low-med	2 commits (overlay fix + three-source invariant & preflight gate), 4 tests.
3	`chore/archive-cleanup`	low	File moves + `AUDIT_CONVENTIONS.md` + `.gitignore`.
4	`fix/flag-029-async-pin-and-orphan`	low	Fail-fast smoke + orphan backfill.
5	`audit/config-wiring-pass`	low	Traceability matrix + `clob_switch_threshold_bps` promotion + any orphan removals.
6	`feat/distance-to-touch-diagnostic`	medium	PRIMARY 7.3 METRIC — Moved up; columns on `market_snapshots` + session summary histogram.
7	`fix/wal-checkpoint-hardening`	medium-high	Periodic PASSIVE + shutdown TRUNCATE + p50/p95 + concurrency test.

Individual PRs, reviewed before the next is cut.

7 — Phase 7.3 Go/No-Go — Locked¶

Proceed only when all of the following are true:

fix/halt-reason-lifecycle merged
fix/summarize-paper-run-capital-overlay merged
fix/flag-029-async-pin-and-orphan merged
audit/config-wiring-pass merged + clob_switch_threshold_bps promoted + traceability matrix committed
feat/distance-to-touch-diagnostic merged (metric available for Phase 7.3 analysis)
S40 completes clean (≥30 min, post-merge config): ended_at populated, inventory_invariant.status == "ok", no config_mismatch halt, no silent failures in logs

chore/archive-cleanup and fix/wal-checkpoint-hardening are not strict gates for Phase 7.3 — they can ship in parallel with the first 7.3 session if Vesper/Katja approve.

8 — Operational Transition Noted¶

Your closing framing is accepted:

functionally correct system  →  operationally trustworthy system

Every branch in this plan is scoped so invariants, not vibes, decide whether the engine is trusted to trade. Distance-to-touch as the primary metric for Phase 7.3 is the capstone of that transition: it moves Phase 7 from "did we adjust the parameter?" to "did the parameter measurably move the quote relative to where the market is?"

Execution Signal¶

Awaiting Vesper sign-off and Katja's go. On green-light, I cut branch #1 first and return it as a patch for Katja's terminal per the standing git rule — no direct filesystem commits. One branch at a time; I pause after each for review before the next.

— Orion