[C] Orion Audit — Session Closure + DB Corruption
TL;DR¶
Issue 1 (CRITICAL) — found, fixed, tested. Root cause is a one-character typo in _shutdown(): it read _end_inv.total_value_rlusd (no _in_) from an InventorySnapshot dataclass whose actual field is total_value_in_rlusd. The AttributeError fired on every shutdown, was silently swallowed by log.debug, so close_session() was never executed. Every sessions row from S1 through S37 landed with ended_at = NULL as a consequence. Fix is 1 line of code + log-level promotion + 4 regression tests. Patch attached.
Issue 2 (HIGH) — confirmed mechanism, hardening recommended, backup restore safe. S38 DB corruption is consistent with CTRL_CLOSE_EVENT on Windows (terminal X-button close). Python's signal module does not deliver this event — our FLAG-027 signal handlers catch SIGINT / SIGTERM / SIGBREAK but nothing else. The neo_live_stage1.db.bak.20260418T192119Z pre-S38 backup was taken via SQLite's atomic backup() API and is a consistent snapshot — safe to restore, with the caveat that anything written during S38 (which is already in a malformed DB you can't recover from anyway) is lost. Recommended hardening: add a periodic synchronous PRAGMA wal_checkpoint(TRUNCATE) on a 60-second timer during the session, to bound the blast radius of any future hard-kill.
Recommendation: clear to run S39 after applying the Issue 1 patch and restoring the backup. The Issue 2 hardening is additive defense-in-depth and can ship on a later branch.
Issue 1 — Root Cause¶
The bug (neo_engine/main_loop.py:668, pre-fix)¶
self._state.close_session(
ending_xrp=_end_inv.xrp_balance if _end_inv else 0.0,
ending_rlusd=_end_inv.rlusd_balance if _end_inv else 0.0,
ending_value=_end_inv.total_value_rlusd if _end_inv else 0.0, # ← WRONG
halt_reason=_final_reason,
)
InventorySnapshot (neo_engine/models.py:176) has the fields:
xrp_balance, rlusd_balance, xrp_value_in_rlusd, total_value_in_rlusd,
xrp_pct, drift_pct, skew_tier, drift_velocity_pct_per_min
There is no total_value_rlusd attribute. _end_inv.total_value_rlusd raises AttributeError: 'InventorySnapshot' object has no attribute 'total_value_rlusd' every single time.
Verified in isolation:
$ python -c "from neo_engine.models import InventorySnapshot; \
s = InventorySnapshot(); s.total_value_rlusd"
AttributeError: 'InventorySnapshot' object has no attribute 'total_value_rlusd'
The confusion is real: the DB column in valuation_snapshots is named total_value_rlusd (no _in_), while the in-memory dataclass field has _in_. The two are inconsistent. _shutdown() picked the DB spelling on a dataclass access.
Why it has been silent since S1¶
The attribute access is inside a try/except block (main_loop.py:655–672) whose handler was:
except Exception as exc:
log.debug("Session close failed — continuing shutdown", extra={"error": str(exc)})
log.debug is filtered out at default log levels. The failure is invisible in every file log, every console log, every telemetry export.
Answers to your four questions (direct)¶
| # | Question | Answer |
|---|---|---|
| 1 | Is close_session raising an exception on every call? If so, what? |
No — close_session is never reached. The exception (AttributeError) fires on the argument-evaluation line that computes ending_value, before the call. |
| 2 | Is _current_session_id None at shutdown time (early return)? |
No. _startup() calls create_session() which sets _current_session_id, and nothing else clears it. Grep confirms _current_session_id is only assigned in 3 places: init (None), create_session (set to new id), close_session (clear after write). |
| 3 | Is the DB connection already closed when close_session is called? |
No. _state.close() is in a finally block at line 678, which runs strictly after the close_session try/except block (line 655–672). DB ordering is correct. |
| 4 | Is _transaction() failing in WAL mode at shutdown? |
No. _transaction() is never entered because close_session() is never called. The context manager itself is a clean commit/rollback wrapper; it is not the cause here. |
The root cause is simpler than any of the hypotheses — it's a plain attribute typo in a silent-failure code path.
Git-blame¶
$ git log -S "total_value_rlusd" -- neo_engine/main_loop.py
7eaed37 2026-04-12 Data: add session tracking for funded runs in stage1 DB
Introduced in the commit that first wired up close_session on shutdown. The bug has been present from the first session the table existed to hold.
The fix¶
Patch delivered at:
02 Projects/NEO Trading Engine/orion-patches-2026-04-18-session-closure/0001-fix-session-closure-write-ended_at-total_value_in_rl.patch
One commit on branch fix/session-closure-ended-at. Two changes, one file of production code + one new test file:
neo_engine/main_loop.py (line 668) — attribute name corrected, comment explaining the failure mode added:
- self._state.close_session(
- ending_xrp=_end_inv.xrp_balance if _end_inv else 0.0,
- ending_rlusd=_end_inv.rlusd_balance if _end_inv else 0.0,
- ending_value=_end_inv.total_value_rlusd if _end_inv else 0.0,
- halt_reason=_final_reason,
- )
- except Exception as exc:
- log.debug("Session close failed — continuing shutdown", extra={"error": str(exc)})
+ # Note: the InventorySnapshot field is `total_value_in_rlusd`. The
+ # prior spelling `total_value_rlusd` (no `_in_`) matches the
+ # valuation_snapshots DB column, not the in-memory dataclass, and
+ # raised AttributeError on every shutdown — silently swallowed by
+ # the debug log below. That caused every session row from S1–S37
+ # to land with ended_at = NULL. Audit: 2026-04-18.
+ self._state.close_session(
+ ending_xrp=_end_inv.xrp_balance if _end_inv else 0.0,
+ ending_rlusd=_end_inv.rlusd_balance if _end_inv else 0.0,
+ ending_value=_end_inv.total_value_in_rlusd if _end_inv else 0.0,
+ halt_reason=_final_reason,
+ )
+ except Exception as exc:
+ # Promoted from log.debug on 2026-04-18: session close is a
+ # source-of-truth write for session accounting — failures must
+ # not be silent. Any future regression here surfaces in logs.
+ log.error(
+ "Session close failed — continuing shutdown",
+ extra={"error": str(exc)},
+ exc_info=True,
+ )
tests/test_shutdown_ended_at.py (new, 4 tests, all passing):
| # | Test | What it pins |
|---|---|---|
| 1 | test_shutdown_populates_ended_at_on_sessions_row |
End-to-end: real StateManager(":memory:"), open session, run _shutdown, assert ended_at IS NOT NULL and ending_* values round-trip correctly. |
| 2 | test_shutdown_passes_total_value_in_rlusd_as_ending_value |
Spies on close_session kwargs to pin the attribute name. Any future re-introduction of the .total_value_rlusd typo fails this test. |
| 3 | test_shutdown_handles_missing_session_gracefully |
_current_session_id is None early-return path — no session row created, no raise. |
| 4 | test_close_session_failure_logs_at_error_level |
Forces close_session to raise; asserts ERROR record is emitted. Prevents future regression back to log.debug. |
Negative control: I stashed my main_loop.py edit and reran the same tests against the pre-fix code. 3 of 4 failed (tests 1, 2, 4), exactly as expected — the tests genuinely catch the bug. With the fix applied, all 4 pass.
Existing shutdown test class TestLiveShutdownCancelOrders (5 tests) continues to pass unchanged.
Scope discipline¶
- Only
neo_engine/main_loop.pyis modified. - The change touches 1 attribute name and 1 log-level upgrade. No behavioral changes to anything before or after the close_session block.
- No other typos of this shape exist in the codebase — I grepped all occurrences of
.total_value_rlusdand all three other occurrences (lines 950, 1384, + dashboard) are either dict keys or kwarg names that legitimately use that spelling; the only attribute-access on anInventorySnapshotwas the one bad line. - Hold list preserved:
anchor_max_divergence_bps: 10.0,base_size_rlusd: 15.0,risk_engine.py, fill paths, schema — none touched.
Historical backfill — proposal, not built¶
You didn't ask for this and I didn't build it. Flagging in case you want a follow-up task: a one-shot script could set ended_at on historical session rows by taking the max of fills.executed_at (or inventory_snapshots.created_at, or valuation_snapshots.created_at) grouped by session_id. That's imprecise (it's the last write, not the true shutdown time, so it understates by the idle tail of the run), but it's strictly better than NULL for Dashboard/Experiment Log purposes. If you want it, I can ship it as a separate commit. I recommend doing it after this fix lands and at least one session closes cleanly — that gives us a clean point of reference to validate the backfill script against.
Git commands for Katja — PowerShell, copy-paste¶
Run from the repo root in your VS Code terminal. Assumes current main is at the post-cleanup-branch tip (which it is).
Block 1 — branch off main
Block 2 — apply patch
# adjust path to wherever you have Claude Homebase Neo mounted
$patch = "C:\path\to\Claude Homebase Neo\02 Projects\NEO Trading Engine\orion-patches-2026-04-18-session-closure\0001-fix-session-closure-write-ended_at-total_value_in_rl.patch"
git am --3way "$patch"
I don't expect any conflicts — main_loop.py has not been touched by other branches since S37, and tests/test_shutdown_ended_at.py is new. If git am errors with "patch does not apply," abort and ping me; that would mean something else moved underneath.
Block 3 — verify
Expected: one commit (fix(session-closure): write ended_at — total_value_in_rlusd typo root cause); 4 passed.
Block 4 — push and merge
Merge via GitHub UI. Commit body covers rationale and test coverage — no PR description needed unless you want one.
Issue 2 — DB Corruption Mechanism + Hardening¶
What most likely happened to the live DB during S38¶
SQLite in WAL mode (which we run — state_manager.py:64 sets PRAGMA journal_mode=WAL) is robust against a clean process death: the WAL file preserves committed transactions and replay is automatic on next open. Corruption in WAL mode specifically requires the OS to drop or reorder writes that SQLite assumed were durable, which happens in a narrow window:
- A WAL auto-checkpoint starts moving pages from the
-walfile back into the main DB. - The process is terminated hard enough that the OS drops the kernel buffer for the file descriptor before the pages are fsynced.
- On next open, SQLite's WAL recovery detects inconsistency between WAL content and main-DB state →
database is malformed/ the exact reported symptom.
On Windows, the typical way to produce case 2 is CTRL_CLOSE_EVENT — the event fired when a user clicks the X on a console window. Unlike SIGINT and SIGBREAK, Python's signal module does not deliver CTRL_CLOSE_EVENT (this is documented behavior: Python's signal module on Windows routes only a fixed set of C-runtime signals, and the CTRL_CLOSE_EVENT path goes through SetConsoleCtrlHandler which Python does not expose). The process receives ~5 seconds from the OS to exit cleanly, then is hard-killed.
Our FLAG-027 signal handlers (run_paper_session.py:113–141) install routes for SIGINT, SIGTERM, and SIGBREAK only. The docstring itself calls this out:
"CTRL_CLOSE_EVENT (terminal-close) is NOT caught by Python's signal module; the pre-run backup is the defense for that exit path."
So the most plausible S38 failure sequence is: terminal closed via X button → OS gives 5s grace period → WAL auto-checkpoint was mid-flight → process killed → kernel drops some file buffers → next open sees malformed DB.
Why the backup is safe¶
run_paper_session.py:_create_pre_run_backup() (line 46) uses SQLite's backup() API:
src_conn = sqlite3.connect(str(source))
dst_conn = sqlite3.connect(str(target))
src_conn.backup(dst_conn)
The backup() API is not a file copy — it iterates pages through SQLite's own page cache, respecting WAL state, and produces a target DB that is a consistent snapshot of the source at the moment backup began. This is the same mechanism sqlite3 uses for online backups and it is safe even if the source is actively written while the backup runs.
Concrete implication for neo_live_stage1.db.bak.20260418T192119Z:
- If that timestamp is pre-S38 (i.e., captured during run_paper_session.py startup for S38), the file is a clean, consistent snapshot of everything up to and including S37's close. Restoring it gives you the DB exactly as it stood before S38 began writing.
- The S38 writes are, by definition, not in the backup. They're also not recoverable from the malformed live DB (or we wouldn't be calling it malformed). So there is no restore strategy that recovers S38 writes — the relevant choice is restore-from-backup vs. repair-in-place, and repair-in-place on a WAL-corrupt DB is essentially .dump | .read through sqlite3, which is slow and lossy at best.
Recommendation: restore the backup and continue. Rename the malformed DB to neo_live_stage1.db.malformed_S38_20260418 for postmortem if you want, then rename the backup to neo_live_stage1.db. The pre-run backup sequence will create its own backup on S39 startup (FLAG-027 retention=10), so you don't lose defense-in-depth.
Verification you can run after restore:
# quick_check is structural; this is what FLAG-033 will also do at startup
sqlite3 neo_live_stage1.db "PRAGMA quick_check;"
# expected: "ok"
# sanity: what's the max session_id? should be 38 or 37 depending on whether
# S38's create_session row committed before corruption hit.
sqlite3 neo_live_stage1.db "SELECT MAX(session_id), COUNT(*) FROM sessions;"
If quick_check returns anything other than ok, don't run S39 on that DB — ping me.
Hardening recommendation — periodic WAL checkpoint on a timer¶
Single most effective defense against this failure mode: add a synchronous PRAGMA wal_checkpoint(TRUNCATE) on a 60-second timer during the session. TRUNCATE mode forces all WAL pages into the main DB and then truncates the WAL file to zero length. After each successful checkpoint, the WAL file is empty, so the "killed mid-checkpoint" blast radius is bounded to the last ≤60 seconds of writes.
Sketch (not implemented — flagging for a follow-up branch):
# in NEOEngine._startup, after state manager is ready
def _wal_checkpoint_loop(stop_event: threading.Event, state: StateManager, cadence_s: int = 60):
while not stop_event.wait(cadence_s):
try:
state._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except Exception as exc:
log.warning("WAL checkpoint failed", extra={"error": str(exc)})
self._wal_stop = threading.Event()
self._wal_thread = threading.Thread(
target=_wal_checkpoint_loop,
args=(self._wal_stop, self._state),
name="wal-checkpoint",
daemon=True,
)
self._wal_thread.start()
…and clean shutdown in _shutdown() before self._state.close():
Caveats and tradeoffs:
- PRAGMA wal_checkpoint serializes against active writers for the duration of the checkpoint — for our write volume this is ~milliseconds and unobservable, but I'd validate in a dry-run paper session before going to live.
- The checkpoint thread accesses state._conn from a non-main thread. StateManager opens the connection with check_same_thread=False (line 62), so this is technically safe, but every write path must hold the GIL-based serialization of the sqlite3 driver. Our writes already do (they all run through _transaction() context manager on the main thread). I'd still audit for any write path that holds a cursor across a time.sleep — if so, the checkpoint could block there.
- If we want to be extra careful on Windows, we can additionally register a SetConsoleCtrlHandler via ctypes, which does catch CTRL_CLOSE_EVENT — but the OS only gives you 5 seconds to clean up afterward (5s on Windows 8+, 10s pre-8), so that's a bandaid not a replacement for the checkpoint thread.
Recommendation: new flag FLAG-035 for "WAL checkpoint cadence + CTRL_CLOSE_EVENT catch on Windows", targeting a separate branch after S39 confirms Phase 7.2 behavior. The Issue 1 fix unblocks S39; the Issue 2 hardening is additive.
Summary table — priorities as you requested¶
| Priority | Item | Status |
|---|---|---|
| 1 | Fix ended_at write failure (root cause + fix) |
Done. Patch ready. Branch fix/session-closure-ended-at, 1 commit, 4 tests. |
| 2 | Change silent log.debug → log.error for shutdown failures |
Done — in the same commit. Specifically on the close_session except handler (the critical one). The 6 telemetry-summary loggers earlier in _shutdown stay at debug — they're diagnostic summaries, not source-of-truth writes. |
| 3 | Recommend WAL checkpoint cadence or other hardening | Recommended above. Design sketched, not yet implemented. Propose as FLAG-035 on a separate branch after Phase 7.2 runs clean. |
| 4 | Confirm backup restore is safe and complete | Confirmed. neo_live_stage1.db.bak.20260418T192119Z is an atomic SQLite backup — consistent snapshot of the source at snapshot time. Restore procedure and quick_check verification above. |
What stays unchanged¶
anchor_max_divergence_bps: 10.0— untouchedbase_size_rlusd: 15.0— untouchedrisk_engine.py— untouched- Fill calculation paths — untouched
- Schema — untouched (this fix writes existing columns; no DDL)
- Phase 7.2 CLOB switch (separate branch
feat/phase-7.2-clob-switch) — untouched
Clear to run S39?¶
From the Issue 1 side: yes, once the patch lands on main. Every future session will persist ended_at correctly, and the error-level log ensures any future regression in this path screams loudly instead of hiding.
From the Issue 2 side: yes, once the backup is restored and PRAGMA quick_check returns ok. The WAL-checkpoint hardening is additive and can wait for a separate branch.
— Orion