Orion Delivery — Branch #7 `fix/wal-checkpoint-hardening`¶

To: Katja CC: Vesper, Atlas From: Orion Date: 2026-04-19

Status¶

Branch #7 complete. Single commit on fix/wal-checkpoint-hardening, cut off 83b87e6 (main tip with Branch #6 merged). 4 new tests passing, zero regressions — +4 passing / 0 new failed vs Branch #6 baseline (491/371 → 495/371; failure set byte-identical).

Commit	SHA	Subject
1	`673fe20`	feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)

Patch at 02 Projects/NEO Trading Engine/patches/branch-7-wal-checkpoint-hardening/0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch.

Built against Vesper's Q1–Q5 rulings and all four Atlas additions. All are pinned by tests or by inline audit comments in the commit body.

What landed¶

neo_engine/state_manager.py — new imports (threading, time, statistics, collections.deque); StateManager.__init__ gains four fields (interval, thread handle, stop event, latency deque maxlen=512); new methods start_wal_checkpoint_loop, _checkpoint_loop, _run_checkpoint, _log_checkpoint_aggregate; close() extended to the documented 5-step shutdown (stop.set() → join(timeout=5.0) → aggregate → TRUNCATE → conn.close()), all with log.error(exc_info=True) on any exception.

neo_engine/config.py — EngineConfig.wal_checkpoint_interval_seconds: int = 60. Loader reads the key with default 60 and coerces negatives to 0 per Vesper Q1 (no hard minimum; policy guards deferred).

neo_engine/main_loop.py — one-line call inserted in _startup between _check_config_invariants() and set_engine_state("engine_status", RUNNING):

self._state.start_wal_checkpoint_loop(
    self._config.engine.wal_checkpoint_interval_seconds
)

<=0 is a quiet no-op inside the method, so one-shot scripts that construct StateManager directly (scripts/inject_capital.py, scripts/write_synthetic_initial_basis.py) stay thread-free as designed.

config/config.yaml + config/config.example.yaml — new key wal_checkpoint_interval_seconds: 60 under engine: in both, with comments pointing at FLAG-035 and the "set to 0 to disable" knob.

tests/test_wal_checkpoint_hardening.py (new, 4 tests, all on-disk via tempfile.TemporaryDirectory() — :memory: cannot produce WAL files):

test_periodic_checkpoint_logs_elapsed_and_counters — starts a 1s-interval loop, sleeps 2.5s, calls close(). Asserts ≥2 wal_checkpoint INFO records (PASSIVE) + 1 TRUNCATE + exactly 1 wal_checkpoint aggregate with monotone max_ms ≥ p95_ms ≥ p50_ms. Every per-checkpoint record carries mode, busy, log_frames, checkpointed_frames, elapsed_ms.
test_concurrent_writes_and_checkpoint_preserve_integrity — 200 engine_state upserts racing a 1s-interval checkpoint loop. Pre-close PRAGMA quick_check == "ok" and exact row count intact.
test_shutdown_truncate_leaves_empty_wal — 50 upserts generate a non-trivial -wal sidecar; after close() the sidecar is either removed or size zero.
test_slow_checkpoint_emits_warning — connection wrapped in a proxy that sleeps 250 ms on every PRAGMA wal_checkpoint call; asserts both the INFO line and the WARNING line ("wal_checkpoint slow") are emitted with elapsed_ms > 200.

sqlite3.Connection.execute is a read-only C attribute that cannot be mock.patch.object'd directly, so test 4 uses a thin _SlowExecuteConn proxy (delegates every other attribute to the real connection). Documented in the test file.

tests/test_halt_reason_lifecycle.py fixture update — _make_startup_engine now sets engine._config.engine.wal_checkpoint_interval_seconds = 0. The existing MagicMock-based fixture returned a MagicMock for the new attribute, which fails the <= 0 comparison. Same pattern as the Branch #5 invariant-fields fixture update. Without this the two TestStartupHaltReasonLifecycle tests regress.

How it maps to the Vesper/Atlas rulings¶

Ruling	Implementation	Pinned by
Vesper Q1 — ≤0 disabled, no minimum	Loader coerces negatives to 0; `start_wal_checkpoint_loop` early-returns quietly on `interval <= 0`	Not explicitly pinned by a test — a test would just assert a disabled-path no-op. Can add if desired.
Vesper Q2 — explicit opt-in	`__init__` initializes state but no thread; `start_wal_checkpoint_loop` is the only entry point; called from `_startup`	Tests 2 and 3 rely on the explicit call. One-shot scripts work unchanged.
Vesper Q3 — TRUNCATE log-and-continue	`close()` wraps TRUNCATE in try/except with `log.error(exc_info=True)`; `_conn.close()` runs unconditionally after	Commit body §close-sequence. Could add a test that injects a TRUNCATE exception and asserts `_conn.close()` still ran — flag if you want it.
Vesper Q4 — aggregate required	`_log_checkpoint_aggregate` emitted once in close after thread join; `statistics.quantiles(method="inclusive")` at n=100 indices 49/94	Test 1 asserts the aggregate record + monotone p50/p95/max
Vesper Q5 — `deque(maxlen=512)`	`self._checkpoint_latencies: Deque[float] = deque(maxlen=512)`	Inline comment at `__init__`; ~8.5h coverage at 60s cadence
Atlas #1 — 200ms WARNING	Per-checkpoint INFO always emitted; additional WARNING when `elapsed_ms > 200`	Test 4
Atlas #2 — no overlap	Single-thread sequential loop; documented in the class-level audit comment	Class-level audit comment; the single-thread design is the mechanism itself
Atlas #3 — shutdown ordering	5-step sequence documented in `close()` docstring	Test 1 (aggregate appears, then TRUNCATE log) + commit body
Atlas #4 — no silent except	Every except in the checkpoint path: `log.error(..., exc_info=True)`	Inline across `_checkpoint_loop`, `_run_checkpoint`, `_log_checkpoint_aggregate`, `close()`

Test posture¶

Branch #7-specific tests: 4 new passing (test_wal_checkpoint_hardening.py) + 2 fixture-updated tests still passing (test_halt_reason_lifecycle.py TestStartupHaltReasonLifecycle).
Full suite vs Branch #6 baseline: 495 passed / 371 failed (was 491 / 371). Net +4 passed, 0 new failed. Zero regressions — the 371 failure set is byte-identical to baseline.
The 371 remaining failures are the pre-existing cluster carried through Branches #1–#6 (test_xrpl_gateway, test_execution_engine, etc.). Unchanged by Branch #7.

End-to-end smoke (sandbox Linux)¶

INFO:neo_engine.state_manager:wal_checkpoint started
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint aggregate
INFO:neo_engine.state_manager:wal_checkpoint

Two PASSIVE ticks in a 2.5s run, aggregate on close, one TRUNCATE. Post-close tempdir contains only the main .db file — -wal and -shm sidecars both gone.

Copy-paste for Windows VS Code terminal¶

# Mirror branch from commit (cut off main tip with Branch #6 merged)
git fetch origin
git checkout -b fix/wal-checkpoint-hardening 83b87e6

# Apply the patch
git am "02 Projects\NEO Trading Engine\patches\branch-7-wal-checkpoint-hardening\0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch"

# Verify
git log --oneline -2
python -m pytest tests/test_wal_checkpoint_hardening.py tests/test_halt_reason_lifecycle.py -q

# After Vesper sign-off:
git push -u origin fix/wal-checkpoint-hardening

What this unlocks¶

Paper shakedown before live. The S38-shape corruption mode (a Windows CTRL_CLOSE_EVENT hard kill leaving the WAL mid-frame and corrupting the DB header) is now bounded at 60 seconds of writes. A clean shutdown additionally truncates the WAL to zero, so a graceful close leaves nothing recoverable to lose.
Operational visibility. Every checkpoint logs elapsed_ms plus SQLite's own busy/log_frames/checkpointed_frames. The wal_checkpoint slow warning surfaces checkpoints over 200 ms directly in the dashboard log stream. The end-of-session aggregate line gives you n/p50/p95/max in one row at close so the Phase 7.3 runbook has a latency posture number to watch.
Config knob ready. config.engine.wal_checkpoint_interval_seconds is a first-class YAML knob — 60 in paper, tunable for live without a code change. Setting to 0 cleanly disables the thread for tests or one-shot scripts.

Branch queue after this¶

S40 — Branches #6 and #7 merge, then S40 paper-shakedown.
Phase 7.3 offset calibration — unlocks once S40 is clean.

Explicit non-scope for this branch (from the investigation memo, reiterated here so the audit trail is clear): catching Windows CTRL_CLOSE_EVENT itself is out. Python's signal module cannot intercept it — only ctypes.SetConsoleCtrlHandler can. If and when we want a best-effort catch-window, that's a separate branch with its own investigation.

Standing by for Vesper review.

— Orion

Orion Delivery — Branch #7 fix/wal-checkpoint-hardening¶