Orion Investigation — S42 engine_requested_halt¶
To: Vesper (she/her) From: Orion (he/him) CC: Katja (Captain), Atlas (he/him) Date: 2026-04-21 Type: Investigation only — no code, no branch
TL;DR¶
S42 halted because the directional drift guard (condition C — no opposing fill for 15 ticks) triggered DEGRADED at 19:02:31Z, nothing recovered, and the shared 300 s DEGRADED timeout (in _tick, inherited from the wallet-truth reconciler design) auto-escalated to HALT at 19:06:26Z. The anchor saturation guard had also fired earlier at 19:01:03Z (mean -7.90 bps, 100 % prevalence) and contributed to DEGRADED being active, but the drift guard's re-entry overwrote the degraded_reason label.
The halt was protective by the system's current rules, but there are two distinct issues worth flagging:
halt.reasonclobber bug —_escalate_degraded_to_haltwrote the specific tokeninventory_truth_halt, but_shutdownoverwrote it with the genericengine_requested_haltatrun()line 4298. The authentic trail survives only inhalt.detail = degraded_timeout_exceeded_300s.- DEGRADED recovery semantics are missing for the non-truth guards. The anchor saturation, directional drift, and inventory corridor guards all call
_enter_degraded_modebut never recover — no_exit_degraded_modepathway exists from any of them. The 300 s timer that was originally a wallet-truth safety net is now the only escape for all four guards, and it escapes to HALT, not to OK.
Both of these are real bugs. The first is a halt-reason taxonomy leak (already the pattern the halt-reason-lifecycle branch was meant to fix). The second is an architectural gap — the protection guards were scoped as "cancel all + stop quoting, recoverable" but in practice are one-way gates to HALT after 5 minutes.
Evidence — S42 DB State¶
From neo_live_stage1.db (session_id=42), selected engine_state keys at shutdown:
| key | value |
|---|---|
halt.reason |
engine_requested_halt |
halt.detail |
degraded_timeout_exceeded_300s |
inventory_truth.mode |
halt |
inventory_truth.degraded_since |
2026-04-20T19:01:24.767284+00:00 |
inventory_truth.degraded_reason |
directional_drift_guard_C |
inventory_truth.source |
shutdown |
inventory_truth.status |
halt |
From sessions (session_id=42):
| field | value |
|---|---|
started_at |
2026-04-20T18:58:05.692073+00:00 |
ended_at |
2026-04-20T19:06:26.586141+00:00 |
halt_reason |
engine_requested_halt |
| elapsed | 500.89 s (matches the 501.81 s reported) |
From circuit_breaker_events where session_id=42:
| id | created_at | breaker | context (summarized) |
|---|---|---|---|
| 1 | 2026-04-20T19:01:03.767Z |
anchor_saturation_guard |
mean=-7.903 bps, prevalence=100.0 %, 25-tick window (tail all -6.846) |
| 2 | 2026-04-20T19:02:31.047Z |
directional_drift_guard |
condition=C, ticks_since_opposing_fill=15, threshold=15, last_fill_side=sell, total_fills_seen=2 |
Timeline:
- 18:58:05 — S42 starts.
- 19:01:03 — Anchor saturation guard fires (first — hits ~3 min in, right after the 25-tick rolling window fills with consistently-negative anchor errors). Calls
_enter_degraded_mode("anchor_saturation_guard_exceeded"). - 19:01:24 —
degraded_sincereported here (21 s later than the saturation event). Implies a periodic truth check may have briefly exited DEGRADED between 19:01:03 and 19:01:24; the next DEGRADED entry anchored the timer here. (Out of scope for this investigation — flagged for follow-up.) - 19:02:31 — Directional drift guard condition C fires. Re-enters DEGRADED, updates
degraded_reasontodirectional_drift_guard_C(re-entry keeps the existingdegraded_since). - 19:06:24.767 — 300 s elapsed since
degraded_since. The next tick evaluation atmain_loop.py:2282triggers_escalate_degraded_to_halt("degraded_timeout_exceeded_300s")→inventory_truth.mode = halt,halt.reason = inventory_truth_halt,halt.detail = degraded_timeout_exceeded_300s. - 19:06:26.586 —
_tickreturns False.run()at line 4296–4299 calls_shutdown("halt condition triggered", halt_reason=HALT_REASON_ENGINE_REQUESTED)._shutdownat line 1020 overwriteshalt.reasonwithengine_requested_halt. Session closes.
Q1 — What triggered engine_requested_halt?¶
Not a single trigger — it was a composite.
Proximate cause (where the string engine_requested_halt landed in halt.reason): run() at main_loop.py:4296–4299 calls _shutdown with halt_reason=HALT_REASON_ENGINE_REQUESTED any time _tick() returns False:
while True:
should_continue = self._tick()
if not should_continue:
# halt.reason + halt.detail already written by the
# triggering halt path (risk, reconciler, replay).
self._shutdown(
"halt condition triggered",
halt_reason=HALT_REASON_ENGINE_REQUESTED,
)
break
The comment asserts that the specific halt path will have already written halt.reason. The problem is _shutdown at line 1020 makes the halt_reason parameter win over existing_reason:
existing_reason = self._state.get_engine_state("halt.reason") or ""
_hr = halt_reason or existing_reason or HALT_REASON_UNEXPECTED
self._state.set_engine_state("halt.reason", _hr)
So the explicit HALT_REASON_ENGINE_REQUESTED argument always clobbers the specific token unless the caller passes halt_reason=None. run() hardcodes HALT_REASON_ENGINE_REQUESTED, so it always clobbers.
Ultimate cause (what actually made _tick() return False): the DEGRADED-timeout escalation at main_loop.py:2282–2286:
if self._current_truth_mode() == MODE_DEGRADED and self._degraded_timeout_exceeded():
self._escalate_degraded_to_halt(
f"degraded_timeout_exceeded_{self._config.wallet_reconciliation.degraded_timeout_s}s"
)
return False
_escalate_degraded_to_halt (line 1453) persisted the authentic token halt.reason = inventory_truth_halt (the HALT_REASON_INVENTORY_TRUTH constant at line 132 — string value "inventory_truth_halt"). _shutdown then overwrote it.
What put the engine in DEGRADED in the first place: both guards.
- Anchor saturation guard entered DEGRADED at 19:01:03 (saturation context shows perfect 100 % prevalence with the entire 25-tick window at -6.846 bps — the live market was off-mid-price consistently).
- Directional drift guard re-entered DEGRADED at 19:02:31 via condition C (no opposing fill for 15 ticks; only 2 fills all session, last side sell). That re-entry is the one that labelled the
degraded_reasonwe see now.
The anchor saturation guard fired first and most forcefully (−7.9 bps mean, 100 % prevalence), which is what Vesper's tasking memo was observing. The drift guard's trigger is a secondary symptom of the same illiquid/one-sided market that was generating the anchor saturation. None of this was an uncaught exception — the code paths were the explicit guard-triggered ones.
Q2 — Is there a DEGRADED→HALT escalation path inside _evaluate_anchor_saturation_guard?¶
No. The saturation guard itself does not escalate. _evaluate_anchor_saturation_guard (line 1480) on trigger:
- Logs a WARNING with the threshold context.
- Best-effort writes
circuit_breaker_eventsrow (breaker=anchor_saturation_guard). - Calls
self._enter_degraded_mode("anchor_saturation_guard_exceeded"). - Returns
([], triggered_now)— intents cleared, guard flag set.
It never returns False, never sets halt.reason, never calls _escalate_degraded_to_halt. Same pattern holds for _evaluate_directional_drift_guard (line 1571) and _evaluate_inventory_corridor_guard (line 1819) — all three enter DEGRADED only.
What does the escalation: the FLAG-036 wallet-truth timeout guard at main_loop.py:2282. It is not scoped to any one guard — it evaluates self._current_truth_mode() == MODE_DEGRADED and escalates regardless of which path entered DEGRADED. The field KEY_MODE (aka inventory_truth.mode) is a shared lock across all four DEGRADED triggers (wallet truth, anchor saturation, directional drift, inventory corridor).
This is the architectural gap. The saturation guard's spec (and the others') say "DEGRADED is recoverable — the loop continues." The implementation says DEGRADED is recoverable only by the wallet truth check returning ok. If the wallet truth is fine (as in S42, where the guards tripped on market regime not on wallet divergence), there is no recovery path — the 300 s timer is the sole exit, and it exits to HALT.
Spec vs. implementation:
- Spec (Atlas-locked Apr 19): "DEGRADED mode (new, Atlas-mandated): intermediate state between WARN and HALT — cancel all orders, stop quoting, continue reconciliation, recoverable without restart."
- Implementation in S42: cancel all orders ✅, stop quoting ✅, continue tick loop ✅, continue reconciliation ✅, recoverable without restart ❌ — only the wallet truth check's
_exit_degraded_modecan clear DEGRADED, and it only runs when the truth check returnsok. An anchor saturation-triggered DEGRADED will never be cleared this way unless a truth check happens to run and returnokwhile in DEGRADED — and even then, exiting DEGRADED does not re-arm the saturation guard's window; the next stressed tick could re-trigger it.
So: the saturation guard does not escalate directly — but it inherits an escalation-to-HALT path from the wallet-truth timeout that was not explicitly designed into its spec.
Q3 — All engine_requested_halt call sites in main_loop.py¶
HALT_REASON_ENGINE_REQUESTED = "engine_requested_halt" is declared once at line 123 and referenced once as a value in line 4298.
Only one call site sets halt.reason = "engine_requested_halt":
main_loop.py:4296–4299 inside NEOEngine.run():
if not should_continue:
self._shutdown(
"halt condition triggered",
halt_reason=HALT_REASON_ENGINE_REQUESTED,
)
break
This is the only path that writes this token. It fires whenever _tick() returns False. The _tick() return-False paths in order of tick logic:
| Line | Return-False trigger | Specific halt.reason already written |
|---|---|---|
| 2286 | DEGRADED timeout exceeded (all 4 guards inherit this) | inventory_truth_halt (then clobbered) |
| 2293 | Periodic truth check just escalated into MODE_HALT | inventory_truth_halt (then clobbered) |
| 2365 | Risk engine triggered HALT (exposure caps) | specific risk token (then clobbered) |
| 2379 | Replay adapter exhausted | replay_exhausted (then clobbered) |
| 2475 | Reconciler FATAL | reconciler_fatal (then clobbered) |
Every return-False path currently gets its authentic token overwritten by engine_requested_halt when _shutdown runs. S42 is an instance of the DEGRADED-timeout path (line 2286). The fact that halt.detail = "degraded_timeout_exceeded_300s" survives in the DB is the smoking gun — only _escalate_degraded_to_halt writes that string (line 2284), and its companion halt.reason write is what got clobbered.
This means the engine_requested_halt label in the DB does not meaningfully distinguish which guard or subsystem triggered the halt. halt.detail is currently the only reliable attribution field.
Q4 — Was the halt protective or erroneous?¶
Protective by current rules, but the rules themselves have two distinct bugs.
Protective¶
Given how the system is wired today, the halt was correct. The engine had been in DEGRADED for > 300 s without recovery, consistent guard triggers indicated a sustained off-mid anchor + one-sided fill flow, and the 300 s timer is the intentional backstop. Letting the engine sit in DEGRADED indefinitely — cancelled orders, paused quoting, live reconciliation — is not cheaper than halting; the timer's job is to force an operator re-evaluation. If Katja had not been at the terminal, the halt ensured the engine was not silently paused for hours. That part works.
Bug #1 — halt.reason taxonomy leak (halt-reason-lifecycle regression)¶
_shutdown's halt_reason or existing_reason precedence (line 1020) silently eats the authentic token that the specific halt path wrote. run() passes HALT_REASON_ENGINE_REQUESTED as a fallback (per its comment on line 4294), but the parameter order makes it a winner. Every non-duration halt reported through run() surfaces as engine_requested_halt in the DB, obscuring whether the trigger was risk, reconciler, replay, truth, or a guard-induced DEGRADED timeout.
The previous fix/halt-reason-lifecycle branch closed the case where duration-elapsed shutdowns silently preserved stale strings. This case is the mirror: specific strings silently overwritten by a generic one. The operator has to read halt.detail to find out what actually happened — which is exactly what the taxonomy was meant to avoid.
Fix shape (for Vesper scoping, not this branch): either (a) swap precedence in _shutdown to existing_reason or halt_reason or HALT_REASON_UNEXPECTED, or (b) drop halt_reason=HALT_REASON_ENGINE_REQUESTED from run()'s call and rely on the existing-reason path (the comment on line 4294 already promises this). Option (a) keeps the fallback safety net; option (b) is simpler.
Bug #2 — Non-truth guards have no DEGRADED→OK recovery path¶
_exit_degraded_mode (line 1388) is called only from the wallet truth check's post-check handling (when the checker returns ok while in DEGRADED). There is no analogous "saturation guard reports mean error has returned to < bias_threshold_bps" or "drift guard reports opposing fills are back" recovery path. Consequence: every non-wallet guard that trips DEGRADED is in fact a one-way gate to HALT after 300 s, not a recoverable state.
This conflicts with Atlas's DEGRADED spec ("recoverable without restart") and is what makes the S42 halt feel like a bug even though the mechanism is working as coded. The guard spec calls for a recoverable state; the implementation provides only an auto-HALT after 300 s.
Fix shape (for Vesper/Atlas scoping): recovery hooks per guard. Anchor saturation: re-evaluate after window rotates and mean is back inside threshold with prevalence below threshold. Directional drift: re-evaluate after an opposing fill lands. Inventory corridor: re-evaluate after xrp_pct returns inside the corridor for corridor_lookback_ticks consecutive ticks. Each exits DEGRADED via a guard-specific path that calls _exit_degraded_mode. This is meaningful scope — it's multiple branches of real work, not a quick fix.
Shorter-term palliative (also in scope for Vesper decisions): decouple the 300 s timeout per DEGRADED trigger type, or raise the default, so a saturation-triggered DEGRADED doesn't escalate to HALT as fast as a wallet-truth DEGRADED. The current 300 s is spec'd for wallet truth where delta is binary (ok or halt-worthy); for market-regime guards it is aggressive.
Before S43¶
S43 would repeat S42 under similar market conditions unless at least one of the following is in place:
- Operator acceptance that S43 will halt if regime persists beyond 300 s (no code change, just set expectations).
- Bug #1 fixed so the dashboard correctly labels the halt as a DEGRADED-timeout escalation from a specific guard rather than
engine_requested_halt. - Bug #2 (or its palliative) fixed so the non-truth guards have a recovery path, and DEGRADED is actually recoverable.
My recommendation is that Vesper and Atlas rule on whether S43 proceeds as-is (acceptance) or whether we scope a halt-lifecycle / guard-recovery branch first. I can scope either — just waiting for the call.
Deliverable Status¶
Investigation only — no code, no branch. Raw findings and my recommended fix shapes are above. Happy to convert either bug into a branch when Vesper scopes it.
— Orion