Skip to content

Vesper Ruling — Branch #7 fix/wal-checkpoint-hardening Q1–Q5

To: Orion CC: Katja, Atlas From: Vesper Date: 2026-04-19


Rulings — Q1 through Q5 + Atlas additions

Q1 — Minimum cadence

Ruling: No hard minimum. Values ≤ 0 → disabled. Any positive value honored.

Coerce negative values to 0 (disabled) at the loader level. The operator knob is for Phase 7.3 tuning — policy enforcement belongs in a future branch if monitoring reveals abuse.

Q2 — Thread start: explicit opt-in

Ruling: (b) — explicit start_wal_checkpoint_loop(interval_s) called from _startup(). Not from __init__.

Rationale confirmed by F4 and F5: one-shot scripts and every existing test fixture stay unchanged. Lifecycle is explicit and auditable. This is the correct design.

Q3 — TRUNCATE failure handling: log-and-continue

Ruling: log-and-continue. close() must not raise on a failed TRUNCATE.

Aligns with Atlas's constraint #4: log at ERROR with full context, never suppress silently. A failed TRUNCATE must not prevent the connection from closing and the process from exiting. Hung shutdown is worse than messy shutdown.

Q4 — Observability: option (b), now required

Ruling: (b) — per-checkpoint log + end-of-session aggregate. Atlas has upgraded this from "if feasible" to required.

Per-checkpoint fields (every PASSIVE and the shutdown TRUNCATE): mode, busy, log_frames, checkpointed_frames, elapsed_ms.

End-of-session aggregate (emitted once in close(), after thread join): n, p50, p95, max. Use the _percentile approach from Branch #6 Commit 3, or statistics.quantiles — either is fine at 512 samples.

Q5 — Sample window: 512

Ruling: collections.deque(maxlen=512). Approved by Atlas.

At 60s cadence = ~8.5 hours of coverage. Use deque not a list-with-trim — it's the right data structure for a rolling window.


Atlas additions — all required, pin with tests

Addition 1 — Hard latency warning at 200ms

After every PASSIVE checkpoint, if elapsed_ms > 200: emit log.warning("wal_checkpoint slow", extra={"busy": ..., "log_frames": ..., "checkpointed_frames": ..., "elapsed_ms": ...}). This is in addition to the normal log.info per-checkpoint line — both are emitted when the threshold is exceeded.

Pin in tests: a test that injects a slow checkpoint (mock _conn.execute to sleep briefly) and asserts the WARNING was emitted.

Addition 2 — No overlapping checkpoint execution

The single-thread design already prevents this. No code change needed. Confirm in the commit body that the single-thread loop is the explicit mechanism preventing overlap — this is the audit trail Atlas is asking for.

Addition 3 — Shutdown ordering

Already correct in Orion's design: _checkpoint_stop.set()thread.join(timeout=5.0)_log_checkpoint_aggregate()TRUNCATE_conn.close(). Confirm this sequence explicitly in the close() docstring.

Addition 4 — Failure visibility

All except blocks in the checkpoint path log at log.error with exc_info=True. No log.warning, no log.debug, no silent pass. Already in Orion's proposed design — just confirming it is non-negotiable.


Commit spec (single commit)

Subject: feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)

Files: state_manager.py, config.py, main_loop.py, config/config.yaml, config/config.example.yaml, tests/test_wal_checkpoint_hardening.py.

Tests — minimum four (Orion's three + one for the latency warning):

  1. test_periodic_checkpoint_logs_elapsed_and_counters — ≥2 PASSIVE iterations, aggregate emitted at close with non-None p50/p95.
  2. test_concurrent_writes_and_checkpoint_preserve_integrity — 200 writes + checkpoint loop + PRAGMA quick_check == "ok" + correct row count.
  3. test_shutdown_truncate_leaves_empty_wal — TRUNCATE at shutdown zeroes or removes the -wal sidecar.
  4. test_slow_checkpoint_emits_warning — mock a checkpoint exceeding 200ms, assert log.warning with elapsed_ms field.

On-disk tempfile.TemporaryDirectory() required for all four — :memory: cannot produce WAL files.


Green light

Q1–Q5 ruled, Atlas additions locked in. You may cut code.

Before Commit 1, request the file slices from Katja as planned (state_manager __init__/close, config EngineConfig dataclass + loader, main_loop _startup open, config.yaml engine: block). Test-drift rule applies.

— Vesper