Orion Delivery — Branch #7 fix/wal-checkpoint-hardening¶
To: Katja CC: Vesper, Atlas From: Orion Date: 2026-04-19
Status¶
Branch #7 complete. Single commit on fix/wal-checkpoint-hardening, cut off 83b87e6 (main tip with Branch #6 merged). 4 new tests passing, zero regressions — +4 passing / 0 new failed vs Branch #6 baseline (491/371 → 495/371; failure set byte-identical).
| Commit | SHA | Subject |
|---|---|---|
| 1 | 673fe20 |
feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035) |
Patch at 02 Projects/NEO Trading Engine/patches/branch-7-wal-checkpoint-hardening/0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch.
Built against Vesper's Q1–Q5 rulings and all four Atlas additions. All are pinned by tests or by inline audit comments in the commit body.
What landed¶
neo_engine/state_manager.py — new imports (threading, time, statistics, collections.deque); StateManager.__init__ gains four fields (interval, thread handle, stop event, latency deque maxlen=512); new methods start_wal_checkpoint_loop, _checkpoint_loop, _run_checkpoint, _log_checkpoint_aggregate; close() extended to the documented 5-step shutdown (stop.set() → join(timeout=5.0) → aggregate → TRUNCATE → conn.close()), all with log.error(exc_info=True) on any exception.
neo_engine/config.py — EngineConfig.wal_checkpoint_interval_seconds: int = 60. Loader reads the key with default 60 and coerces negatives to 0 per Vesper Q1 (no hard minimum; policy guards deferred).
neo_engine/main_loop.py — one-line call inserted in _startup between _check_config_invariants() and set_engine_state("engine_status", RUNNING):
<=0 is a quiet no-op inside the method, so one-shot scripts that construct StateManager directly (scripts/inject_capital.py, scripts/write_synthetic_initial_basis.py) stay thread-free as designed.
config/config.yaml + config/config.example.yaml — new key wal_checkpoint_interval_seconds: 60 under engine: in both, with comments pointing at FLAG-035 and the "set to 0 to disable" knob.
tests/test_wal_checkpoint_hardening.py (new, 4 tests, all on-disk via tempfile.TemporaryDirectory() — :memory: cannot produce WAL files):
test_periodic_checkpoint_logs_elapsed_and_counters— starts a 1s-interval loop, sleeps 2.5s, callsclose(). Asserts ≥2wal_checkpointINFO records (PASSIVE) + 1 TRUNCATE + exactly 1wal_checkpoint aggregatewith monotonemax_ms ≥ p95_ms ≥ p50_ms. Every per-checkpoint record carriesmode,busy,log_frames,checkpointed_frames,elapsed_ms.test_concurrent_writes_and_checkpoint_preserve_integrity— 200engine_stateupserts racing a 1s-interval checkpoint loop. Pre-closePRAGMA quick_check == "ok"and exact row count intact.test_shutdown_truncate_leaves_empty_wal— 50 upserts generate a non-trivial-walsidecar; afterclose()the sidecar is either removed or size zero.test_slow_checkpoint_emits_warning— connection wrapped in a proxy that sleeps 250 ms on everyPRAGMA wal_checkpointcall; asserts both the INFO line and the WARNING line ("wal_checkpoint slow") are emitted withelapsed_ms > 200.
sqlite3.Connection.execute is a read-only C attribute that cannot be mock.patch.object'd directly, so test 4 uses a thin _SlowExecuteConn proxy (delegates every other attribute to the real connection). Documented in the test file.
tests/test_halt_reason_lifecycle.py fixture update — _make_startup_engine now sets engine._config.engine.wal_checkpoint_interval_seconds = 0. The existing MagicMock-based fixture returned a MagicMock for the new attribute, which fails the <= 0 comparison. Same pattern as the Branch #5 invariant-fields fixture update. Without this the two TestStartupHaltReasonLifecycle tests regress.
How it maps to the Vesper/Atlas rulings¶
| Ruling | Implementation | Pinned by |
|---|---|---|
| Vesper Q1 — ≤0 disabled, no minimum | Loader coerces negatives to 0; start_wal_checkpoint_loop early-returns quietly on interval <= 0 |
Not explicitly pinned by a test — a test would just assert a disabled-path no-op. Can add if desired. |
| Vesper Q2 — explicit opt-in | __init__ initializes state but no thread; start_wal_checkpoint_loop is the only entry point; called from _startup |
Tests 2 and 3 rely on the explicit call. One-shot scripts work unchanged. |
| Vesper Q3 — TRUNCATE log-and-continue | close() wraps TRUNCATE in try/except with log.error(exc_info=True); _conn.close() runs unconditionally after |
Commit body §close-sequence. Could add a test that injects a TRUNCATE exception and asserts _conn.close() still ran — flag if you want it. |
| Vesper Q4 — aggregate required | _log_checkpoint_aggregate emitted once in close after thread join; statistics.quantiles(method="inclusive") at n=100 indices 49/94 |
Test 1 asserts the aggregate record + monotone p50/p95/max |
Vesper Q5 — deque(maxlen=512) |
self._checkpoint_latencies: Deque[float] = deque(maxlen=512) |
Inline comment at __init__; ~8.5h coverage at 60s cadence |
| Atlas #1 — 200ms WARNING | Per-checkpoint INFO always emitted; additional WARNING when elapsed_ms > 200 |
Test 4 |
| Atlas #2 — no overlap | Single-thread sequential loop; documented in the class-level audit comment | Class-level audit comment; the single-thread design is the mechanism itself |
| Atlas #3 — shutdown ordering | 5-step sequence documented in close() docstring |
Test 1 (aggregate appears, then TRUNCATE log) + commit body |
| Atlas #4 — no silent except | Every except in the checkpoint path: log.error(..., exc_info=True) |
Inline across _checkpoint_loop, _run_checkpoint, _log_checkpoint_aggregate, close() |
Test posture¶
- Branch #7-specific tests: 4 new passing (
test_wal_checkpoint_hardening.py) + 2 fixture-updated tests still passing (test_halt_reason_lifecycle.py TestStartupHaltReasonLifecycle). - Full suite vs Branch #6 baseline: 495 passed / 371 failed (was 491 / 371). Net +4 passed, 0 new failed. Zero regressions — the 371 failure set is byte-identical to baseline.
- The 371 remaining failures are the pre-existing cluster carried through Branches #1–#6 (test_xrpl_gateway, test_execution_engine, etc.). Unchanged by Branch #7.
End-to-end smoke (sandbox Linux)¶
INFO:neo_engine.state_manager:wal_checkpoint started
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint aggregate
INFO:neo_engine.state_manager:wal_checkpoint
Two PASSIVE ticks in a 2.5s run, aggregate on close, one TRUNCATE. Post-close tempdir contains only the main .db file — -wal and -shm sidecars both gone.
Copy-paste for Windows VS Code terminal¶
# Mirror branch from commit (cut off main tip with Branch #6 merged)
git fetch origin
git checkout -b fix/wal-checkpoint-hardening 83b87e6
# Apply the patch
git am "02 Projects\NEO Trading Engine\patches\branch-7-wal-checkpoint-hardening\0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch"
# Verify
git log --oneline -2
python -m pytest tests/test_wal_checkpoint_hardening.py tests/test_halt_reason_lifecycle.py -q
# After Vesper sign-off:
git push -u origin fix/wal-checkpoint-hardening
What this unlocks¶
- Paper shakedown before live. The S38-shape corruption mode (a Windows
CTRL_CLOSE_EVENThard kill leaving the WAL mid-frame and corrupting the DB header) is now bounded at 60 seconds of writes. A clean shutdown additionally truncates the WAL to zero, so a graceful close leaves nothing recoverable to lose. - Operational visibility. Every checkpoint logs
elapsed_msplus SQLite's ownbusy/log_frames/checkpointed_frames. Thewal_checkpoint slowwarning surfaces checkpoints over 200 ms directly in the dashboard log stream. The end-of-session aggregate line gives youn/p50/p95/maxin one row at close so the Phase 7.3 runbook has a latency posture number to watch. - Config knob ready.
config.engine.wal_checkpoint_interval_secondsis a first-class YAML knob — 60 in paper, tunable for live without a code change. Setting to 0 cleanly disables the thread for tests or one-shot scripts.
Branch queue after this¶
- S40 — Branches #6 and #7 merge, then S40 paper-shakedown.
- Phase 7.3 offset calibration — unlocks once S40 is clean.
Explicit non-scope for this branch (from the investigation memo, reiterated here so the audit trail is clear): catching Windows CTRL_CLOSE_EVENT itself is out. Python's signal module cannot intercept it — only ctypes.SetConsoleCtrlHandler can. If and when we want a best-effort catch-window, that's a separate branch with its own investigation.
Standing by for Vesper review.
— Orion