Skip to content

Orion Delivery — Branch #7 fix/wal-checkpoint-hardening

To: Katja CC: Vesper, Atlas From: Orion Date: 2026-04-19


Status

Branch #7 complete. Single commit on fix/wal-checkpoint-hardening, cut off 83b87e6 (main tip with Branch #6 merged). 4 new tests passing, zero regressions — +4 passing / 0 new failed vs Branch #6 baseline (491/371 → 495/371; failure set byte-identical).

Commit SHA Subject
1 673fe20 feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)

Patch at 02 Projects/NEO Trading Engine/patches/branch-7-wal-checkpoint-hardening/0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch.

Built against Vesper's Q1–Q5 rulings and all four Atlas additions. All are pinned by tests or by inline audit comments in the commit body.

What landed

neo_engine/state_manager.py — new imports (threading, time, statistics, collections.deque); StateManager.__init__ gains four fields (interval, thread handle, stop event, latency deque maxlen=512); new methods start_wal_checkpoint_loop, _checkpoint_loop, _run_checkpoint, _log_checkpoint_aggregate; close() extended to the documented 5-step shutdown (stop.set() → join(timeout=5.0) → aggregate → TRUNCATE → conn.close()), all with log.error(exc_info=True) on any exception.

neo_engine/config.pyEngineConfig.wal_checkpoint_interval_seconds: int = 60. Loader reads the key with default 60 and coerces negatives to 0 per Vesper Q1 (no hard minimum; policy guards deferred).

neo_engine/main_loop.py — one-line call inserted in _startup between _check_config_invariants() and set_engine_state("engine_status", RUNNING):

self._state.start_wal_checkpoint_loop(
    self._config.engine.wal_checkpoint_interval_seconds
)
<=0 is a quiet no-op inside the method, so one-shot scripts that construct StateManager directly (scripts/inject_capital.py, scripts/write_synthetic_initial_basis.py) stay thread-free as designed.

config/config.yaml + config/config.example.yaml — new key wal_checkpoint_interval_seconds: 60 under engine: in both, with comments pointing at FLAG-035 and the "set to 0 to disable" knob.

tests/test_wal_checkpoint_hardening.py (new, 4 tests, all on-disk via tempfile.TemporaryDirectory():memory: cannot produce WAL files):

  1. test_periodic_checkpoint_logs_elapsed_and_counters — starts a 1s-interval loop, sleeps 2.5s, calls close(). Asserts ≥2 wal_checkpoint INFO records (PASSIVE) + 1 TRUNCATE + exactly 1 wal_checkpoint aggregate with monotone max_ms ≥ p95_ms ≥ p50_ms. Every per-checkpoint record carries mode, busy, log_frames, checkpointed_frames, elapsed_ms.
  2. test_concurrent_writes_and_checkpoint_preserve_integrity — 200 engine_state upserts racing a 1s-interval checkpoint loop. Pre-close PRAGMA quick_check == "ok" and exact row count intact.
  3. test_shutdown_truncate_leaves_empty_wal — 50 upserts generate a non-trivial -wal sidecar; after close() the sidecar is either removed or size zero.
  4. test_slow_checkpoint_emits_warning — connection wrapped in a proxy that sleeps 250 ms on every PRAGMA wal_checkpoint call; asserts both the INFO line and the WARNING line ("wal_checkpoint slow") are emitted with elapsed_ms > 200.

sqlite3.Connection.execute is a read-only C attribute that cannot be mock.patch.object'd directly, so test 4 uses a thin _SlowExecuteConn proxy (delegates every other attribute to the real connection). Documented in the test file.

tests/test_halt_reason_lifecycle.py fixture update — _make_startup_engine now sets engine._config.engine.wal_checkpoint_interval_seconds = 0. The existing MagicMock-based fixture returned a MagicMock for the new attribute, which fails the <= 0 comparison. Same pattern as the Branch #5 invariant-fields fixture update. Without this the two TestStartupHaltReasonLifecycle tests regress.

How it maps to the Vesper/Atlas rulings

Ruling Implementation Pinned by
Vesper Q1 — ≤0 disabled, no minimum Loader coerces negatives to 0; start_wal_checkpoint_loop early-returns quietly on interval <= 0 Not explicitly pinned by a test — a test would just assert a disabled-path no-op. Can add if desired.
Vesper Q2 — explicit opt-in __init__ initializes state but no thread; start_wal_checkpoint_loop is the only entry point; called from _startup Tests 2 and 3 rely on the explicit call. One-shot scripts work unchanged.
Vesper Q3 — TRUNCATE log-and-continue close() wraps TRUNCATE in try/except with log.error(exc_info=True); _conn.close() runs unconditionally after Commit body §close-sequence. Could add a test that injects a TRUNCATE exception and asserts _conn.close() still ran — flag if you want it.
Vesper Q4 — aggregate required _log_checkpoint_aggregate emitted once in close after thread join; statistics.quantiles(method="inclusive") at n=100 indices 49/94 Test 1 asserts the aggregate record + monotone p50/p95/max
Vesper Q5 — deque(maxlen=512) self._checkpoint_latencies: Deque[float] = deque(maxlen=512) Inline comment at __init__; ~8.5h coverage at 60s cadence
Atlas #1 — 200ms WARNING Per-checkpoint INFO always emitted; additional WARNING when elapsed_ms > 200 Test 4
Atlas #2 — no overlap Single-thread sequential loop; documented in the class-level audit comment Class-level audit comment; the single-thread design is the mechanism itself
Atlas #3 — shutdown ordering 5-step sequence documented in close() docstring Test 1 (aggregate appears, then TRUNCATE log) + commit body
Atlas #4 — no silent except Every except in the checkpoint path: log.error(..., exc_info=True) Inline across _checkpoint_loop, _run_checkpoint, _log_checkpoint_aggregate, close()

Test posture

  • Branch #7-specific tests: 4 new passing (test_wal_checkpoint_hardening.py) + 2 fixture-updated tests still passing (test_halt_reason_lifecycle.py TestStartupHaltReasonLifecycle).
  • Full suite vs Branch #6 baseline: 495 passed / 371 failed (was 491 / 371). Net +4 passed, 0 new failed. Zero regressions — the 371 failure set is byte-identical to baseline.
  • The 371 remaining failures are the pre-existing cluster carried through Branches #1–#6 (test_xrpl_gateway, test_execution_engine, etc.). Unchanged by Branch #7.

End-to-end smoke (sandbox Linux)

INFO:neo_engine.state_manager:wal_checkpoint started
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint
INFO:neo_engine.state_manager:wal_checkpoint aggregate
INFO:neo_engine.state_manager:wal_checkpoint

Two PASSIVE ticks in a 2.5s run, aggregate on close, one TRUNCATE. Post-close tempdir contains only the main .db file — -wal and -shm sidecars both gone.

Copy-paste for Windows VS Code terminal

# Mirror branch from commit (cut off main tip with Branch #6 merged)
git fetch origin
git checkout -b fix/wal-checkpoint-hardening 83b87e6

# Apply the patch
git am "02 Projects\NEO Trading Engine\patches\branch-7-wal-checkpoint-hardening\0001-feat-state-periodic-WAL-checkpoint-TRUNCATE-at-shutd.patch"

# Verify
git log --oneline -2
python -m pytest tests/test_wal_checkpoint_hardening.py tests/test_halt_reason_lifecycle.py -q

# After Vesper sign-off:
git push -u origin fix/wal-checkpoint-hardening

What this unlocks

  • Paper shakedown before live. The S38-shape corruption mode (a Windows CTRL_CLOSE_EVENT hard kill leaving the WAL mid-frame and corrupting the DB header) is now bounded at 60 seconds of writes. A clean shutdown additionally truncates the WAL to zero, so a graceful close leaves nothing recoverable to lose.
  • Operational visibility. Every checkpoint logs elapsed_ms plus SQLite's own busy/log_frames/checkpointed_frames. The wal_checkpoint slow warning surfaces checkpoints over 200 ms directly in the dashboard log stream. The end-of-session aggregate line gives you n/p50/p95/max in one row at close so the Phase 7.3 runbook has a latency posture number to watch.
  • Config knob ready. config.engine.wal_checkpoint_interval_seconds is a first-class YAML knob — 60 in paper, tunable for live without a code change. Setting to 0 cleanly disables the thread for tests or one-shot scripts.

Branch queue after this

  • S40 — Branches #6 and #7 merge, then S40 paper-shakedown.
  • Phase 7.3 offset calibration — unlocks once S40 is clean.

Explicit non-scope for this branch (from the investigation memo, reiterated here so the audit trail is clear): catching Windows CTRL_CLOSE_EVENT itself is out. Python's signal module cannot intercept it — only ctypes.SetConsoleCtrlHandler can. If and when we want a best-effort catch-window, that's a separate branch with its own investigation.

Standing by for Vesper review.

— Orion