Atlas Alignment — Branch #7 WAL Checkpoint Hardening Brief¶
To: Atlas CC: Vesper, Orion From: Katja Date: 2026-04-19
Status¶
Branches #1–#6 are merged. Branch #7 (fix/wal-checkpoint-hardening) is the last gate before S40 and Phase 7.3. Orion has delivered his pre-code investigation memo and is holding for rulings on Q1–Q5 before cutting code.
Branch #7 scope (recap)¶
Implements FLAG-035 per your Pre-7.3 Review §3: a background timer thread running PRAGMA wal_checkpoint(PASSIVE) every 60s during a live session, with PRAGMA wal_checkpoint(TRUNCATE) at clean shutdown. Goal: bound the S38 blast radius (DB corruption on CTRL_CLOSE_EVENT) to ≤60 seconds of writes on average.
Design is single commit. Files touched: state_manager.py, config.py, main_loop.py, both YAMLs, one new test file (3 tests).
Key design decisions already made:
- Thread lives on StateManager, not on the engine — clean separation, no engine loop involvement.
- Opt-in start via start_wal_checkpoint_loop(interval_s) called from _startup(), not from __init__. Keeps one-shot scripts and test fixtures clean.
- sqlite3's check_same_thread=False already set — concurrent PASSIVE checkpoint + write is safe at the driver level.
- TRUNCATE at close is best-effort (log-and-continue on failure) — a failed TRUNCATE must not hang shutdown.
CTRL_CLOSE_EVENT handler explicitly out of scope for this branch (Windows-specific, 5s window, separate risk/effort tradeoff).
Q1–Q5 — where Atlas input helps¶
Vesper will rule on most of these, but two touch things you specified:
Q4 — Observability granularity. Your §3 asked to "log elapsed_ms (p50/p95 if feasible)." Orion proposes (b): per-checkpoint log line (mode, busy, log_frames, checkpointed, elapsed_ms) plus an end-of-session aggregate log with n, p50, p95, max. This is the full implementation of your ask. If you want to confirm (b) explicitly or clarify "if feasible," say so now — otherwise Vesper will approve (b) as the correct read of your spec.
Q5 — Sample window. For the aggregate: cap at 512 samples (rolling window, drop oldest). At 60s cadence = ~8.5 hours. Purely an engineering detail — flagging in case you have a preference on memory budget vs. session-length assumptions.
The other three questions (Q1 — no hard minimum on cadence, Q2 — opt-in thread start, Q3 — TRUNCATE failure handling) are engineering decisions within Orion's remit. No Atlas input needed unless you disagree with Orion's recommendations.
Request¶
Confirm Q4 (or defer to Vesper), and note any concerns on the overall design before Vesper greenlights Orion to cut code. Orion is standing by.
— Katja