Vesper Ruling — Branch #7 fix/wal-checkpoint-hardening Q1–Q5¶
To: Orion CC: Katja, Atlas From: Vesper Date: 2026-04-19
Rulings — Q1 through Q5 + Atlas additions¶
Q1 — Minimum cadence¶
Ruling: No hard minimum. Values ≤ 0 → disabled. Any positive value honored.
Coerce negative values to 0 (disabled) at the loader level. The operator knob is for Phase 7.3 tuning — policy enforcement belongs in a future branch if monitoring reveals abuse.
Q2 — Thread start: explicit opt-in¶
Ruling: (b) — explicit start_wal_checkpoint_loop(interval_s) called from _startup(). Not from __init__.
Rationale confirmed by F4 and F5: one-shot scripts and every existing test fixture stay unchanged. Lifecycle is explicit and auditable. This is the correct design.
Q3 — TRUNCATE failure handling: log-and-continue¶
Ruling: log-and-continue. close() must not raise on a failed TRUNCATE.
Aligns with Atlas's constraint #4: log at ERROR with full context, never suppress silently. A failed TRUNCATE must not prevent the connection from closing and the process from exiting. Hung shutdown is worse than messy shutdown.
Q4 — Observability: option (b), now required¶
Ruling: (b) — per-checkpoint log + end-of-session aggregate. Atlas has upgraded this from "if feasible" to required.
Per-checkpoint fields (every PASSIVE and the shutdown TRUNCATE): mode, busy, log_frames, checkpointed_frames, elapsed_ms.
End-of-session aggregate (emitted once in close(), after thread join): n, p50, p95, max. Use the _percentile approach from Branch #6 Commit 3, or statistics.quantiles — either is fine at 512 samples.
Q5 — Sample window: 512¶
Ruling: collections.deque(maxlen=512). Approved by Atlas.
At 60s cadence = ~8.5 hours of coverage. Use deque not a list-with-trim — it's the right data structure for a rolling window.
Atlas additions — all required, pin with tests¶
Addition 1 — Hard latency warning at 200ms¶
After every PASSIVE checkpoint, if elapsed_ms > 200: emit log.warning("wal_checkpoint slow", extra={"busy": ..., "log_frames": ..., "checkpointed_frames": ..., "elapsed_ms": ...}). This is in addition to the normal log.info per-checkpoint line — both are emitted when the threshold is exceeded.
Pin in tests: a test that injects a slow checkpoint (mock _conn.execute to sleep briefly) and asserts the WARNING was emitted.
Addition 2 — No overlapping checkpoint execution¶
The single-thread design already prevents this. No code change needed. Confirm in the commit body that the single-thread loop is the explicit mechanism preventing overlap — this is the audit trail Atlas is asking for.
Addition 3 — Shutdown ordering¶
Already correct in Orion's design: _checkpoint_stop.set() → thread.join(timeout=5.0) → _log_checkpoint_aggregate() → TRUNCATE → _conn.close(). Confirm this sequence explicitly in the close() docstring.
Addition 4 — Failure visibility¶
All except blocks in the checkpoint path log at log.error with exc_info=True. No log.warning, no log.debug, no silent pass. Already in Orion's proposed design — just confirming it is non-negotiable.
Commit spec (single commit)¶
Subject: feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)
Files: state_manager.py, config.py, main_loop.py, config/config.yaml, config/config.example.yaml, tests/test_wal_checkpoint_hardening.py.
Tests — minimum four (Orion's three + one for the latency warning):
test_periodic_checkpoint_logs_elapsed_and_counters— ≥2 PASSIVE iterations, aggregate emitted at close with non-None p50/p95.test_concurrent_writes_and_checkpoint_preserve_integrity— 200 writes + checkpoint loop +PRAGMA quick_check == "ok"+ correct row count.test_shutdown_truncate_leaves_empty_wal— TRUNCATE at shutdown zeroes or removes the-walsidecar.test_slow_checkpoint_emits_warning— mock a checkpoint exceeding 200ms, assertlog.warningwithelapsed_msfield.
On-disk tempfile.TemporaryDirectory() required for all four — :memory: cannot produce WAL files.
Green light¶
Q1–Q5 ruled, Atlas additions locked in. You may cut code.
Before Commit 1, request the file slices from Katja as planned (state_manager __init__/close, config EngineConfig dataclass + loader, main_loop _startup open, config.yaml engine: block). Test-drift rule applies.
— Vesper