Orion Investigation — Branch #7 `fix/wal-checkpoint-hardening` (FLAG-035)¶

To: Katja CC: Vesper, Atlas From: Orion Date: 2026-04-19

TL;DR¶

Pre-code investigation memo. Branch #7 implements FLAG-035 per the design Atlas approved in the Pre-7.3 Review §3: a background timer thread that runs PRAGMA wal_checkpoint(PASSIVE) every 60s during a session, with PRAGMA wal_checkpoint(TRUNCATE) at clean shutdown. Goal is to bound the S38-shape blast radius — corruption on CTRL_CLOSE_EVENT — to ≤60 seconds of writes on average, down from "whatever the auto-checkpoint decided."

Design is mostly locked. I'm holding on five implementation choices I want you to rule on before I cut code. Nothing blocks Atlas; nothing conflicts with what's been approved. Commit layout is a single commit + 3 tests per the plan.

Scope¶

In scope - New config field engine.wal_checkpoint_interval_seconds: int = 60. Zero disables (test affordance). - New method on StateManager: start_wal_checkpoint_loop(interval_s: float) — spawns a daemon thread that runs PASSIVE checkpoints on a cadence. - Amend StateManager.close(): stop the thread (if running), run a TRUNCATE checkpoint, then close the connection. - NEOEngine._startup() calls self._state.start_wal_checkpoint_loop(self._config.engine.wal_checkpoint_interval_seconds) after the DB is initialized and before the main loop starts. - Per-checkpoint structured log line with mode, busy, log_frames, checkpointed, elapsed_ms. End-of-session aggregate (p50/p95 of elapsed_ms) logged at close(). - config/config.yaml + config/config.example.yaml: new key in engine: block. - Three new tests (see Proposed Tests).

Out of scope - CTRL_CLOSE_EVENT handler on Windows (SetConsoleCtrlHandler via ctypes). Called out explicitly in the original FLAG-035 audit and confirmed out-of-scope for this branch. The shutdown sequence it would provide only gives 5s — it's a bandaid, not a replacement, and introduces a Windows-specific code path with real testing cost. Can land separately if Phase 7.3 reveals it's still a live concern after the checkpoint thread lands. - Any change to the FLAG-027 backup cadence or retention. - Historical DB repair (S38 already restored from backup).

Findings¶

F1 — StateManager currently has no threading¶

neo_engine/state_manager.py never imports threading. In fact, no file in the codebase does — this branch introduces the first background thread into the engine. That's worth flagging explicitly because it affects the review lens for thread-safety invariants and shutdown coordination.

Positive side: the sqlite3 connection already uses check_same_thread=False (state_manager.py:62), which was originally added for the dashboard read-access path. Python's sqlite3 driver serializes operations on a single connection — a PASSIVE checkpoint on thread B while record_* runs on thread A is safe at the driver level (worst case the PASSIVE call waits microseconds for a write commit to land). No explicit lock needed on our side.

F2 — `_current_session_id` is not read by the checkpoint path¶

The only state the checkpoint thread touches is self._conn (to issue the PRAGMA) and self._checkpoint_* fields (its own stop-event, stats buffer). It does not read or write _current_session_id, engine_state, or any domain tables. This limits the coordination surface to the connection itself, which is already thread-safe per F1.

F3 — Shutdown ordering is already correct¶

NEOEngine._shutdown (main_loop.py:790–882) currently does:

Cancel live orders, write valuation snapshot, etc.
self._state.close_session(...) — the Branch-session-closure fix (line ~860).
self._state.set_engine_state("engine_status", HALTED) — line 873.
self._state.close() — line 877.

If close() gains "stop checkpoint thread + TRUNCATE checkpoint + close connection," that runs strictly after every session-accounting write. Correct ordering: we do not want TRUNCATE racing with close_session. The existing shutdown structure already serializes these correctly.

F4 — One-shot script callers do not need a checkpoint thread¶

scripts/inject_capital.py (line 122) and scripts/write_synthetic_initial_basis.py (line 77) instantiate StateManager(db_path) for a single transaction and close immediately. They should not start a checkpoint loop. This is the second reason to make loop-start an explicit opt-in method rather than something __init__ does automatically.

F5 — Tests currently instantiate `StateManager(":memory:")` in many files¶

If the checkpoint loop were started from __init__, every test in test_state_manager.py, test_halt_reason_lifecycle.py, test_config_invariants.py, test_distance_to_touch_summary.py, etc. would spin a daemon thread per test. That's fine threading-wise (daemon threads don't block process exit) but it pollutes the test harness and any test-side assertion on the connection behavior becomes racy. The opt-in method keeps every existing test run unchanged by default.

F6 — Atlas's "avg checkpoint latency < 50ms" is a post-hoc observability ask¶

I can't enforce 50ms in code without introducing alert logic that doesn't belong here. The deliverable is: per-checkpoint elapsed_ms logged, end-of-session p50/p95 aggregate logged, and a test that exercises a paper-shake-down cadence. If p95 exceeds 50ms in a live run, that's an operator-visible signal in the logs and we revisit cadence. Branch #7 does not self-police the 50ms budget.

F7 — Dashboard second connection¶

dashboard.py opens its own read-only sqlite3 connection to the same DB. That's fine: WAL mode allows concurrent readers, and a PASSIVE checkpoint on the writer's connection does not evict the reader's connection. No coordination with the dashboard process needed.

Proposed design¶

Single commit on fix/wal-checkpoint-hardening, cut off 99483c6 (Branch #6 tip).

`neo_engine/config.py`¶

Add field to EngineConfig:

@dataclass(frozen=True)
class EngineConfig:
    tick_interval_seconds: int
    dry_run: bool
    db_path: str
    wal_checkpoint_interval_seconds: int = 60  # FLAG-035

Loader fallback in config.py around line 345:

engine = EngineConfig(
    tick_interval_seconds=int(engine_raw.get("tick_interval_seconds", 4)),
    dry_run=bool(engine_raw.get("dry_run", False)),
    db_path=str(engine_raw.get("db_path", "neo_state.db")),
    wal_checkpoint_interval_seconds=int(
        engine_raw.get("wal_checkpoint_interval_seconds", 60)
    ),
)

YAML default 60 added to both config/config.yaml and config/config.example.yaml.

`neo_engine/state_manager.py`¶

New imports: threading, time, statistics (stdlib).

New instance fields initialized in __init__:

self._checkpoint_thread: Optional[threading.Thread] = None
self._checkpoint_stop = threading.Event()
self._checkpoint_elapsed_ms: list[float] = []   # rolling, for end-of-session p50/p95
self._checkpoint_interval_s: float = 0.0

New method start_wal_checkpoint_loop(interval_s: float):

If interval_s <= 0: logs info("wal_checkpoint disabled", interval_s=…) and returns. No thread spawned.
If already started: no-op + warning. Idempotent.
Spawns daemon thread wal-checkpoint. Loop body:
self._checkpoint_stop.wait(interval_s) — exit on stop or interval elapsed.
t0 = time.monotonic()
row = self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)").fetchone()
dt_ms = (time.monotonic() - t0) * 1000
self._checkpoint_elapsed_ms.append(dt_ms) (capped window, say last 120 samples = 2h at 60s)
log.info("wal_checkpoint", extra={"mode": "PASSIVE", "busy": row[0], "log_frames": row[1], "checkpointed": row[2], "elapsed_ms": round(dt_ms, 1)})
On any exception: log.error("wal_checkpoint failed", ..., exc_info=True) and continue — the loop must never die silently.

close() amended:

def close(self) -> None:
    self._checkpoint_stop.set()
    if self._checkpoint_thread is not None:
        self._checkpoint_thread.join(timeout=5.0)
    self._log_checkpoint_aggregate()  # p50/p95/max/n
    try:
        self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    except Exception as exc:
        log.error("final wal_checkpoint(TRUNCATE) failed",
                  extra={"error": str(exc)}, exc_info=True)
    self._conn.close()

`neo_engine/main_loop.py`¶

In NEOEngine._startup(), after self._state.initialize_database() and before the main loop starts, one line:

self._state.start_wal_checkpoint_loop(
    self._config.engine.wal_checkpoint_interval_seconds
)

No other main_loop changes. _shutdown() already calls self._state.close() in a finally block — that is now the single choke-point for thread teardown + TRUNCATE.

Proposed commit structure¶

Single commit (per Atlas's Pre-7.3 plan). Subject:

feat(state): periodic WAL checkpoint + TRUNCATE at shutdown (FLAG-035)

Body covers: motivation (S38 blast radius), design choices (PASSIVE vs TRUNCATE, thread on StateManager, opt-in), concurrency posture (sqlite3 driver serialization + check_same_thread=False), observability (per-checkpoint + aggregate), default cadence and its disable-for-tests affordance.

Files touched: - neo_engine/state_manager.py (add fields, method, amend close) - neo_engine/config.py (new field + loader fallback) - neo_engine/main_loop.py (one line in _startup()) - config/config.yaml + config/config.example.yaml (new key) - tests/test_wal_checkpoint_hardening.py (new, 3 tests)

No changes outside this list. No schema, no halt taxonomy, no strategy, no fill path.

Proposed tests (`tests/test_wal_checkpoint_hardening.py`)¶

Three tests, per Atlas's spec. All use real on-disk SQLite files in tempfile.TemporaryDirectory() so we can inspect WAL file size and fire concurrent writes — :memory: is not viable for this suite.

test_periodic_checkpoint_logs_elapsed_and_counters Start loop at interval_s=0.1. Write a handful of rows (record_system_metric). Sleep long enough for ≥2 iterations. Call close(). Assert: log records with mode="PASSIVE" and numeric elapsed_ms were emitted ≥2 times, and the end-of-session aggregate log record was emitted once with non-None p50/p95.
test_concurrent_writes_and_checkpoint_preserve_integrity Start loop at interval_s=0.05. On the main thread, fire 200 record_system_metric calls back-to-back. Call close(). Re-open the DB with a fresh StateManager, run PRAGMA quick_check and SELECT COUNT(*) FROM system_metrics. Assert quick_check == "ok" and count == 200. No writes lost, no corruption.
test_shutdown_truncate_leaves_empty_wal Start loop at interval_s=9999 (effectively disabled — we don't want the periodic path interfering). Do a write (so the WAL file is non-empty mid-session). Call close(). Assert that either the -wal sidecar does not exist OR its size on disk is zero. This pins the TRUNCATE-at-shutdown contract.

If Q5 below resolves to "skip the aggregate log," drop the p50/p95 assertion in Test 1 but keep the rest.

Open questions for ruling (Katja; Atlas CC'd; Vesper to audit)¶

Q1 — Minimum cadence?¶

Atlas approved the 60s default; FLAG-035 audit mentioned "0 = disabled for tests." Should there be a lower bound on non-zero values (say 1s)? Pro: prevents a YAML typo (5 for "five minutes" meaning 5 seconds) from hammering the DB. Con: hard-coded policy in a config field.

Recommendation: No hard minimum. Values < 0 coerced to 0 (disabled); any positive value is honored. Intent is that this is an operator knob for tuning during Phase 7.3 and beyond.

Q2 — Thread start: `init` vs explicit `start_wal_checkpoint_loop`?¶

(a) StateManager.__init__ auto-starts the thread using a constructor arg wal_checkpoint_interval_seconds. (b) Explicit opt-in method start_wal_checkpoint_loop(interval_s) called from NEOEngine._startup() after DB init.

Recommendation: (b). One-shot scripts (inject_capital.py, write_synthetic_initial_basis.py) don't want a background thread for a one-write session; every existing test fixture stays unchanged (F5); and the start/stop lifecycle is explicit and auditable in _startup() / close().

Q3 — TRUNCATE at close: best-effort or hard requirement?¶

If the final TRUNCATE fails (e.g., the connection is in a bad state), should close() raise, or log-and-continue?

Recommendation: log-and-continue (log.error + exc_info=True). The TRUNCATE is a hygiene op, not a correctness op. A failure here must not prevent the connection from closing and the process from exiting cleanly — that would regress the S38 scenario (hung shutdown is worse than messy shutdown). We'll see any failure in logs.

Q4 — Observability granularity: aggregate p50/p95 at close?¶

Atlas's review asks to "log elapsed_ms (p50/p95 if feasible)."

(a) Per-checkpoint elapsed_ms only (simplest — matches Atlas's "if feasible" caveat). (b) Per-checkpoint log plus end-of-session aggregate log record with n, p50, p95, max.

Recommendation: (b). Adds ~10 lines of code (statistics.quantiles or an inlined percentile helper like the one from Branch #6 Commit 3), one extra test assertion, and gives the summarize/experiment-log path a natural hook if we want to surface checkpoint health per-session in the summary block later.

Q5 — Sample window size for the aggregate¶

If (b) on Q4: unbounded list, or last-N deque?

Recommendation: cap at 512 samples, drop oldest (simple rolling window). At 60s cadence that's ~8.5 hours of samples — more than any session we've run. Prevents unbounded memory growth for the (future) long-lived-process case without complicating the read.

Green light request¶

I'm holding until Q1–Q5 are ruled. Once ruled, I proceed straight to code and deliver patch + memo. No Atlas hold needed — the design here is entirely inside what §3 and the Phase 7.3 Gate approved.

Before I cut Commit 1, per the test-drift rule I'll ask for pasted head-of-main slices from Katja's Windows repo of: - neo_engine/state_manager.py — the __init__ and close() regions (lines ~196–208). - neo_engine/config.py — the EngineConfig dataclass (lines 56–60) and its loader call (lines 343–348). - neo_engine/main_loop.py — the _startup() opening so I can pick the right insertion point for the one new line. - config/config.yaml — the engine: block.

That keeps Branch #7 cleanly rebased on what Katja actually has checked out, not what the sandbox has post-Branch-6.

— Orion

Orion Investigation — Branch #7 fix/wal-checkpoint-hardening (FLAG-035)¶