Skip to content

Orion Pre-Code Findings — FLAG-041 halt.reason Taxonomy Leak

To: Vesper (she/her) From: Orion (he/him) CC: Katja (Captain), Atlas (he/him) Date: 2026-04-21 Branch: fix/halt-reason-taxonomy-leak (not yet created) Type: Pre-code investigation — Q1-Q3 + one scope question flagged before code


TL;DR

Q1 and Q2 pass. Q3 has a finding that affects the scope of the fix: the clobber pattern you spec'd for main_loop.py:4282 also exists at run_paper_session.py:415, and since every paper session (S1-S42, and S43) goes through that path rather than NEOEngine.run(), fixing only main_loop.py will not change S43's halt.reason — the paper session loop will still clobber. Need your ruling on scope before writing code.

Detail inline. No code written, no branch created. Sitting on main.

Q1 — _shutdown fallback behavior with halt_reason=None

Finding: fallback is safe. Line 1018–1021 of main_loop.py:

existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or HALT_REASON_UNEXPECTED
self._state.set_engine_state("halt.reason", _hr)

With halt_reason=None (or argument omitted, since the kwarg defaults to None):

Prior state of halt.reason Resulting write
specific token present (e.g. inventory_truth_halt) specific token preserved
empty or missing HALT_REASON_UNEXPECTED ("unexpected_halt")

HALT_REASON_UNEXPECTED is defined at line 142:

HALT_REASON_UNEXPECTED          = "unexpected_halt"

No path produces None or empty-string writes. The or "" on line 1019 guarantees existing_reason is a string, and the cascade guarantees at least the unexpected_halt literal lands in halt.reason. The outer try/except at line 1022 swallows errors from the StateManager itself (DB disconnected, etc.) — in which case nothing is written and the pre-existing value (if any) stands. That is existing behavior, not affected by the proposed fix.

Q1 verdict: safe to drop halt_reason= from the run() call.

Q2 — Non-duration _tick() return-False paths

Finding: every return-False site writes halt.reason before returning. Enumerating _tick() (lines 2247–2853) return sites:

Line Trigger Writer of halt.reason Token written
2286 DEGRADED timeout in _tick itself _escalate_degraded_to_halt() at line 1453 (writes line 1469) HALT_REASON_INVENTORY_TRUTH ("inventory_truth_halt")
2293 MODE_HALT observed after _maybe_run_periodic_truth_check _apply_truth_check_result() at line 2151 → _escalate_degraded_to_halt() (line 2181) HALT_REASON_INVENTORY_TRUTH
2365 RiskStatus.HALT — risk engine escalated Inline in _tick at lines 2327–2361 (kill-switch, exposure, RPC, ledger, gateway, generic) HALT_REASON_KILL_SWITCH / HALT_REASON_RISK_XRP_EXPOSURE / HALT_REASON_RISK_RLUSD_EXPOSURE / HALT_REASON_RISK_RPC_FAILURE / HALT_REASON_RISK_STALE_LEDGER / HALT_REASON_RISK_GATEWAY / HALT_REASON_RISK_GENERIC
2379 ReplayExhausted caught from _market_data.fetch() Inline at lines 2374–2378 HALT_REASON_REPLAY_EXHAUSTED
2475 recon_result.engine_signal == EngineStatus.HALTED Inline at lines 2458–2466 HALT_REASON_RECONCILER ("reconciler_halt")

Return-True sites (informational):

Line Outcome
2450 account_offers fetch failed → continue (no halt)
2847 tick completed normally → continue

Each return-False has coverage. The inline writes at 2327–2361, 2374–2378, and 2458–2466 are all wrapped in try/except Exception: pass — meaning a StateManager write failure during halt classification would swallow silently, leaving halt.reason unset. In that edge case, the proposed fix (drop halt_reason= from run()) will fall through to HALT_REASON_UNEXPECTED — correct defensive behavior. Not silent.

Q2 verdict: no uncovered return-False path. Every triggering site writes a specific token before the clobber.

Q3 — Duration-elapsed path

Finding: duration-elapsed is unaffected by the proposed fix — but there is a second clobber site that IS material.

Q3.a — Duration-elapsed path (as asked)

HALT_REASON_DURATION_ELAPSED is written from exactly one site: run_paper_session.py:429 (inside the if not shutdown_called block after the duration-bounded while loop exits). It is NOT reached via _tick() → run() → _shutdown. It is an explicit direct call to engine._shutdown(reason, halt_reason=HALT_REASON_DURATION_ELAPSED) after the loop-exit condition.

while time.monotonic() < deadline:
    should_continue = engine._tick()
    ...

if not shutdown_called:
    engine._shutdown(
        "Paper session duration elapsed",
        halt_reason=HALT_REASON_DURATION_ELAPSED,
    )

Because this path explicitly passes the duration_elapsed token and nothing has previously written halt.reason (no halt trigger fired), _shutdown's precedence halt_reason or existing_reason writes duration_elapsed. Correct.

The proposed fix does not touch this path. Dropping halt_reason= from the run() if not should_continue branch does not affect the paper-session if not shutdown_called branch. Duration-elapsed behavior is preserved.

Q3.b — Second clobber site (new finding)

HALT_REASON_ENGINE_REQUESTED is written at two sites, not one. The second is the paper-session loop at run_paper_session.py:410–419:

while time.monotonic() < deadline:
    should_continue = engine._tick()
    tick_count += 1
    if not should_continue:
        engine._shutdown(
            "Paper session halted by engine",
            halt_reason=HALT_REASON_ENGINE_REQUESTED,
        )
        shutdown_called = True
        _engine_ref.shutdown_called = True
        break

This is structurally identical to the main_loop.py:4277–4284 clobber your tasking memo targets. The comment text above it (line 4278–4279 of main_loop.py) asserts "halt.reason + halt.detail already written by the triggering halt path" — the same invariant holds for the paper-session loop, because both loops call _tick() and both receive the same return-False semantics.

Consequence for S42 / S43: S42 was a paper session. Paper sessions run through run_paper_session.py, not NEOEngine.run(). The engine_requested_halt in S42's DB row was written by line 415 of run_paper_session.py, not line 4282 of main_loop.py. If we fix only main_loop.py per your tasking, S43 will still surface engine_requested_halt in the DB because the paper-session loop will still clobber.

Grep evidence:

neo_engine/main_loop.py:123:HALT_REASON_ENGINE_REQUESTED    = "engine_requested_halt"
neo_engine/main_loop.py:4282:                        halt_reason=HALT_REASON_ENGINE_REQUESTED,
run_paper_session.py:24:    HALT_REASON_ENGINE_REQUESTED,
run_paper_session.py:415:                        halt_reason=HALT_REASON_ENGINE_REQUESTED,

My original investigation memo attributed the clobber to main_loop.py:4298 (line number drift; same block). That was wrong — S42 went through run_paper_session.py:415. Correction noted.

Scope Decision — Need Your Ruling

Two paths exist; both clobber; both are in scope for FLAG-041 ("halt.reason taxonomy leak"). Your tasking memo scopes only main_loop.py. Options:

Option A — Extend this branch to fix both sites. Two file edits, same shape: - main_loop.py:4280–4283 — drop halt_reason=HALT_REASON_ENGINE_REQUESTED - run_paper_session.py:413–416 — drop halt_reason=HALT_REASON_ENGINE_REQUESTED

Plus both tests already specified — token preservation test becomes a paper-session test (which matches S42's real path). Branch title and FLAG-041 description cover both. Minimal scope creep — same one-line idea applied in two places.

Option B — This branch fixes only main_loop.py; scope a follow-up for run_paper_session.py. Clean per your tasking. But S43 still surfaces engine_requested_halt in the DB — the dashboard bug isn't actually fixed until the follow-up lands. You'd want the follow-up to ship before S43.

Option C — Reject the finding / disagree with the analysis. Push back if I'm wrong about which path S42 took. I can supply the DB forensic trail or a diff run.

My recommendation is Option A. The fix is the same one-line change in two files; tests can cover both; the outcome is "S43's DB shows the authentic token" which is the actual point of FLAG-041. Scope creep is cosmetic — I wouldn't propose adding an unrelated change, just extending the same change to its second site.

Atlas-locked DEGRADED recovery semantics (Bug #2 from the S42 memo) remain out of scope and deferred, per your tasking.

Imports Needed for Tests

Either option: - Test 1 (token preservation) — if Option A, prefer exercising run_paper_session.py:410–419 in a test context (matches S42's real path). If Option B, exercise main_loop.run() with a mocked _tick. - Test 2 (fallback safety) — hits _shutdown directly; unaffected by branch decision.

Standing By

No branch pre-created. No code written. main is the current branch. Current working state: clean. Ready to write code once you rule A / B / C.

— Orion