Orion Investigation — Branch #7 fix/wal-checkpoint-hardening (FLAG-035)¶
To: Katja CC: Vesper, Atlas From: Orion Date: 2026-04-19
TL;DR¶
Pre-code investigation memo. Branch #7 implements FLAG-035 per the design Atlas approved in the Pre-7.3 Review §3: a background timer thread that runs PRAGMA wal_checkpoint(PASSIVE) every 60s during a session, with PRAGMA wal_checkpoint(TRUNCATE) at clean shutdown. Goal is to bound the S38-shape blast radius — corruption on CTRL_CLOSE_EVENT — to ≤60 seconds of writes on average, down from "whatever the auto-checkpoint decided."
Design is mostly locked. I'm holding on five implementation choices I want you to rule on before I cut code. Nothing blocks Atlas; nothing conflicts with what's been approved. Commit layout is a single commit + 3 tests per the plan.
Scope¶
In scope
- New config field engine.wal_checkpoint_interval_seconds: int = 60. Zero disables (test affordance).
- New method on StateManager: start_wal_checkpoint_loop(interval_s: float) — spawns a daemon thread that runs PASSIVE checkpoints on a cadence.
- Amend StateManager.close(): stop the thread (if running), run a TRUNCATE checkpoint, then close the connection.
- NEOEngine._startup() calls self._state.start_wal_checkpoint_loop(self._config.engine.wal_checkpoint_interval_seconds) after the DB is initialized and before the main loop starts.
- Per-checkpoint structured log line with mode, busy, log_frames, checkpointed, elapsed_ms. End-of-session aggregate (p50/p95 of elapsed_ms) logged at close().
- config/config.yaml + config/config.example.yaml: new key in engine: block.
- Three new tests (see Proposed Tests).
Out of scope
- CTRL_CLOSE_EVENT handler on Windows (SetConsoleCtrlHandler via ctypes). Called out explicitly in the original FLAG-035 audit and confirmed out-of-scope for this branch. The shutdown sequence it would provide only gives 5s — it's a bandaid, not a replacement, and introduces a Windows-specific code path with real testing cost. Can land separately if Phase 7.3 reveals it's still a live concern after the checkpoint thread lands.
- Any change to the FLAG-027 backup cadence or retention.
- Historical DB repair (S38 already restored from backup).
Findings¶
F1 — StateManager currently has no threading¶
neo_engine/state_manager.py never imports threading. In fact, no file in the codebase does — this branch introduces the first background thread into the engine. That's worth flagging explicitly because it affects the review lens for thread-safety invariants and shutdown coordination.
Positive side: the sqlite3 connection already uses check_same_thread=False (state_manager.py:62), which was originally added for the dashboard read-access path. Python's sqlite3 driver serializes operations on a single connection — a PASSIVE checkpoint on thread B while record_* runs on thread A is safe at the driver level (worst case the PASSIVE call waits microseconds for a write commit to land). No explicit lock needed on our side.
F2 — _current_session_id is not read by the checkpoint path¶
The only state the checkpoint thread touches is self._conn (to issue the PRAGMA) and self._checkpoint_* fields (its own stop-event, stats buffer). It does not read or write _current_session_id, engine_state, or any domain tables. This limits the coordination surface to the connection itself, which is already thread-safe per F1.
F3 — Shutdown ordering is already correct¶
NEOEngine._shutdown (main_loop.py:790–882) currently does:
- Cancel live orders, write valuation snapshot, etc.
self._state.close_session(...)— the Branch-session-closure fix (line ~860).self._state.set_engine_state("engine_status", HALTED)— line 873.self._state.close()— line 877.
If close() gains "stop checkpoint thread + TRUNCATE checkpoint + close connection," that runs strictly after every session-accounting write. Correct ordering: we do not want TRUNCATE racing with close_session. The existing shutdown structure already serializes these correctly.
F4 — One-shot script callers do not need a checkpoint thread¶
scripts/inject_capital.py (line 122) and scripts/write_synthetic_initial_basis.py (line 77) instantiate StateManager(db_path) for a single transaction and close immediately. They should not start a checkpoint loop. This is the second reason to make loop-start an explicit opt-in method rather than something __init__ does automatically.
F5 — Tests currently instantiate StateManager(":memory:") in many files¶
If the checkpoint loop were started from __init__, every test in test_state_manager.py, test_halt_reason_lifecycle.py, test_config_invariants.py, test_distance_to_touch_summary.py, etc. would spin a daemon thread per test. That's fine threading-wise (daemon threads don't block process exit) but it pollutes the test harness and any test-side assertion on the connection behavior becomes racy. The opt-in method keeps every existing test run unchanged by default.
F6 — Atlas's "avg checkpoint latency < 50ms" is a post-hoc observability ask¶
I can't enforce 50ms in code without introducing alert logic that doesn't belong here. The deliverable is: per-checkpoint elapsed_ms logged, end-of-session p50/p95 aggregate logged, and a test that exercises a paper-shake-down cadence. If p95 exceeds 50ms in a live run, that's an operator-visible signal in the logs and we revisit cadence. Branch #7 does not self-police the 50ms budget.
F7 — Dashboard second connection¶
dashboard.py opens its own read-only sqlite3 connection to the same DB. That's fine: WAL mode allows concurrent readers, and a PASSIVE checkpoint on the writer's connection does not evict the reader's connection. No coordination with the dashboard process needed.
Proposed design¶
Single commit on fix/wal-checkpoint-hardening, cut off 99483c6 (Branch #6 tip).
neo_engine/config.py¶
Add field to EngineConfig:
@dataclass(frozen=True)
class EngineConfig:
tick_interval_seconds: int
dry_run: bool
db_path: str
wal_checkpoint_interval_seconds: int = 60 # FLAG-035
Loader fallback in config.py around line 345:
engine = EngineConfig(
tick_interval_seconds=int(engine_raw.get("tick_interval_seconds", 4)),
dry_run=bool(engine_raw.get("dry_run", False)),
db_path=str(engine_raw.get("db_path", "neo_state.db")),
wal_checkpoint_interval_seconds=int(
engine_raw.get("wal_checkpoint_interval_seconds", 60)
),
)
YAML default 60 added to both config/config.yaml and config/config.example.yaml.
neo_engine/state_manager.py¶
New imports: threading, time, statistics (stdlib).
New instance fields initialized in __init__:
self._checkpoint_thread: Optional[threading.Thread] = None
self._checkpoint_stop = threading.Event()
self._checkpoint_elapsed_ms: list[float] = [] # rolling, for end-of-session p50/p95
self._checkpoint_interval_s: float = 0.0
New method start_wal_checkpoint_loop(interval_s: float):
- If
interval_s <= 0: logsinfo("wal_checkpoint disabled", interval_s=…)and returns. No thread spawned. - If already started: no-op + warning. Idempotent.
- Spawns daemon thread
wal-checkpoint. Loop body: self._checkpoint_stop.wait(interval_s)— exit on stop or interval elapsed.t0 = time.monotonic()row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()dt_ms = (time.monotonic() - t0) * 1000self._checkpoint_elapsed_ms.append(dt_ms)(capped window, say last 120 samples = 2h at 60s)log.info("wal_checkpoint", extra={"mode": "PASSIVE", "busy": row[0], "log_frames": row[1], "checkpointed": row[2], "elapsed_ms": round(dt_ms, 1)})- On any exception:
log.error("wal_checkpoint failed", ..., exc_info=True)and continue — the loop must never die silently.
close() amended:
def close(self) -> None:
self._checkpoint_stop.set()
if self._checkpoint_thread is not None:
self._checkpoint_thread.join(timeout=5.0)
self._log_checkpoint_aggregate() # p50/p95/max/n
try:
self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except Exception as exc:
log.error("final wal_checkpoint(TRUNCATE) failed",
extra={"error": str(exc)}, exc_info=True)
self._conn.close()
neo_engine/main_loop.py¶
In NEOEngine._startup(), after self._state.initialize_database() and before the main loop starts, one line:
No other main_loop changes. _shutdown() already calls self._state.close() in a finally block — that is now the single choke-point for thread teardown + TRUNCATE.
Proposed commit structure¶
Single commit (per Atlas's Pre-7.3 plan). Subject:
feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)
Body covers: motivation (S38 blast radius), design choices (PASSIVE vs TRUNCATE, thread on StateManager, opt-in), concurrency posture (sqlite3 driver serialization + check_same_thread=False), observability (per-checkpoint + aggregate), default cadence and its disable-for-tests affordance.
Files touched:
- neo_engine/state_manager.py (add fields, method, amend close)
- neo_engine/config.py (new field + loader fallback)
- neo_engine/main_loop.py (one line in _startup())
- config/config.yaml + config/config.example.yaml (new key)
- tests/test_wal_checkpoint_hardening.py (new, 3 tests)
No changes outside this list. No schema, no halt taxonomy, no strategy, no fill path.
Proposed tests (tests/test_wal_checkpoint_hardening.py)¶
Three tests, per Atlas's spec. All use real on-disk SQLite files in tempfile.TemporaryDirectory() so we can inspect WAL file size and fire concurrent writes — :memory: is not viable for this suite.
-
test_periodic_checkpoint_logs_elapsed_and_countersStart loop atinterval_s=0.1. Write a handful of rows (record_system_metric). Sleep long enough for ≥2 iterations. Callclose(). Assert: log records withmode="PASSIVE"and numericelapsed_mswere emitted ≥2 times, and the end-of-session aggregate log record was emitted once with non-None p50/p95. -
test_concurrent_writes_and_checkpoint_preserve_integrityStart loop atinterval_s=0.05. On the main thread, fire 200record_system_metriccalls back-to-back. Callclose(). Re-open the DB with a freshStateManager, runPRAGMA quick_checkandSELECT COUNT(*) FROM system_metrics. Assertquick_check == "ok"andcount == 200. No writes lost, no corruption. -
test_shutdown_truncate_leaves_empty_walStart loop atinterval_s=9999(effectively disabled — we don't want the periodic path interfering). Do a write (so the WAL file is non-empty mid-session). Callclose(). Assert that either the-walsidecar does not exist OR its size on disk is zero. This pins the TRUNCATE-at-shutdown contract.
If Q5 below resolves to "skip the aggregate log," drop the p50/p95 assertion in Test 1 but keep the rest.
Open questions for ruling (Katja; Atlas CC'd; Vesper to audit)¶
Q1 — Minimum cadence?¶
Atlas approved the 60s default; FLAG-035 audit mentioned "0 = disabled for tests." Should there be a lower bound on non-zero values (say 1s)? Pro: prevents a YAML typo (5 for "five minutes" meaning 5 seconds) from hammering the DB. Con: hard-coded policy in a config field.
Recommendation: No hard minimum. Values < 0 coerced to 0 (disabled); any positive value is honored. Intent is that this is an operator knob for tuning during Phase 7.3 and beyond.
Q2 — Thread start: __init__ vs explicit start_wal_checkpoint_loop?¶
(a) StateManager.__init__ auto-starts the thread using a constructor arg wal_checkpoint_interval_seconds.
(b) Explicit opt-in method start_wal_checkpoint_loop(interval_s) called from NEOEngine._startup() after DB init.
Recommendation: (b). One-shot scripts (inject_capital.py, write_synthetic_initial_basis.py) don't want a background thread for a one-write session; every existing test fixture stays unchanged (F5); and the start/stop lifecycle is explicit and auditable in _startup() / close().
Q3 — TRUNCATE at close: best-effort or hard requirement?¶
If the final TRUNCATE fails (e.g., the connection is in a bad state), should close() raise, or log-and-continue?
Recommendation: log-and-continue (log.error + exc_info=True). The TRUNCATE is a hygiene op, not a correctness op. A failure here must not prevent the connection from closing and the process from exiting cleanly — that would regress the S38 scenario (hung shutdown is worse than messy shutdown). We'll see any failure in logs.
Q4 — Observability granularity: aggregate p50/p95 at close?¶
Atlas's review asks to "log elapsed_ms (p50/p95 if feasible)."
(a) Per-checkpoint elapsed_ms only (simplest — matches Atlas's "if feasible" caveat).
(b) Per-checkpoint log plus end-of-session aggregate log record with n, p50, p95, max.
Recommendation: (b). Adds ~10 lines of code (statistics.quantiles or an inlined percentile helper like the one from Branch #6 Commit 3), one extra test assertion, and gives the summarize/experiment-log path a natural hook if we want to surface checkpoint health per-session in the summary block later.
Q5 — Sample window size for the aggregate¶
If (b) on Q4: unbounded list, or last-N deque?
Recommendation: cap at 512 samples, drop oldest (simple rolling window). At 60s cadence that's ~8.5 hours of samples — more than any session we've run. Prevents unbounded memory growth for the (future) long-lived-process case without complicating the read.
Green light request¶
I'm holding until Q1–Q5 are ruled. Once ruled, I proceed straight to code and deliver patch + memo. No Atlas hold needed — the design here is entirely inside what §3 and the Phase 7.3 Gate approved.
Before I cut Commit 1, per the test-drift rule I'll ask for pasted head-of-main slices from Katja's Windows repo of:
- neo_engine/state_manager.py — the __init__ and close() regions (lines ~196–208).
- neo_engine/config.py — the EngineConfig dataclass (lines 56–60) and its loader call (lines 343–348).
- neo_engine/main_loop.py — the _startup() opening so I can pick the right insertion point for the one new line.
- config/config.yaml — the engine: block.
That keeps Branch #7 cleanly rebased on what Katja actually has checked out, not what the sandbox has post-Branch-6.
— Orion