Orion Pre-Code Findings — FLAG-042 DEGRADED Recovery¶
To: Vesper (she/her)
From: Orion (he/him)
CC: Katja (Captain), Atlas (he/him)
Date: 2026-04-21
Branch: feat/flag-042-degraded-recovery (not yet created)
Type: Pre-code investigation — Q1–Q5 + scope recommendation before code
TL;DR¶
Q1–Q4 pass — every invariant the tasking assumes is in place. Q5 (scope of drift/corridor recovery) needs your ruling: both hooks are implementable cleanly but push the branch past the "<30 lines each" floor you quoted. My recommendation at the bottom is to ship anchor recovery fully this branch and ship drift + corridor with minimal exit hooks (one simple condition each, one test each) rather than defer — the infrastructure (episode counter, state-reset helper, _recovery_enabled disable flag) has to exist either way and the incremental cost is small.
No branch created. No code written. On main, working tree clean after git add of check_capital_baseline.py being an outstanding untracked file unrelated to this branch.
Q1 — How does _exit_degraded_mode() work?¶
Location: neo_engine/main_loop.py:1388–1417
Signature: def _exit_degraded_mode(self) -> None — no args, idempotent. Early-returns with a debug log if _current_truth_mode() != MODE_DEGRADED.
What it resets:
self._degraded_since_epoch = None # process cache
self._state.set_engine_state(KEY_MODE, MODE_OK) # DB
self._state.set_engine_state(KEY_DEGRADED_SINCE, "") # DB
self._state.set_engine_state(KEY_DEGRADED_REASON, "") # DB
That's all — no re-arm of orders, no reset of guard-specific rolling windows, no reset of one-shot flags.
Callers (the ONLY caller today): _apply_truth_check_result at main_loop.py:2195 — the wallet-truth ok path. Anchor / drift / corridor guards have no exit path today. They one-way into DEGRADED and stay there until the 300s timeout escalates to HALT.
Re-enabling order flow: not explicit. After _exit_degraded_mode runs, the next _tick reaches Step 8 (strategy intent generation) and Step 9 (submit) normally because the pre-trade gate (C5) only blocks submits while MODE_DEGRADED / MODE_HALT. So "resume quoting" is implicit — once the mode flips back to MODE_OK, Step 9 stops refusing and quoting restarts on the very next tick.
Conclusion: _exit_degraded_mode is the correct building block. It clears DEGRADED bookkeeping cleanly and the tick loop resumes order flow without additional plumbing. For FLAG-042 the recovery path needs to layer state reset (rolling windows + guard one-shot flags + any guard counters) on top of the existing exit and not call _exit_degraded_mode directly from the guards — we'll wrap it.
Q2 — Anchor guard state machine¶
Evaluator: _evaluate_anchor_saturation_guard at main_loop.py:~1480–1569.
Step sequence in _tick (top to bottom):
- Top of tick: DEGRADED timeout check → escalate to HALT if exceeded. Then MODE_HALT guard.
- Step 1–7: risk check, market data, reconciler, snapshot, cancels.
- Step 8: _strategy.calculate_quote → populates _anchor_error_window via div_bps = self._strategy.last_anchor_divergence_bps at main_loop.py:2563.
- Step 8.5: _evaluate_anchor_saturation_guard(intents) at main_loop.py:2676.
- Step 8.5b: drift guard.
- Step 8.5c: corridor guard.
- Step 9: submit.
Trigger path (line 1509–1569): requires (a) window fully populated, (b) abs(mean) >= bias_threshold_bps, (c) %(|x| > prevalence_threshold_bps) >= prevalence_pct. On first trigger: sets _anchor_guard_triggered_this_session = True (one-shot for log + circuit_breaker_events dedup), then unconditionally calls self._enter_degraded_mode("anchor_saturation_guard_exceeded") every tick while condition holds.
State persisted between ticks (all instance-local):
- self._anchor_error_window: deque[float] — maxlen = lookback_ticks (25 default), bounded.
- self._anchor_guard_triggered_this_session: bool — one-shot.
What "reset rolling windows and guard counters" means concretely:
- self._anchor_error_window.clear()
- self._anchor_guard_triggered_this_session = False
That's the full per-guard reset. Guard re-evaluation then needs lookback_ticks new ticks before it can re-trigger — which itself enforces some settling time post-recovery.
Q3 — Does the anchor error window keep updating in DEGRADED?¶
Yes. The window append at line 2563 is inside Step 8's if snapshot.is_valid(): block and runs every tick regardless of truth mode. The only short-circuit before Step 8 is either (a) 300s timeout → HALT → return False, or (b) MODE_HALT observed → return False. Neither gates Step 8 while MODE_DEGRADED.
Consequence: the recovery monitor does NOT need its own data-collection path. It can read the same _anchor_error_window the entry guard reads. The window is the source of truth for both entry evaluation and exit evaluation.
One wrinkle worth flagging: the window is bounded (deque(maxlen=lookback_ticks)). By the time a fresh tick reaches the recovery monitor, up to 25 DEGRADED-era ticks' worth of anchor data are already averaged in. That's actually fine for recovery — when conditions genuinely normalize, the new clean values will displace the saturated ones within lookback_ticks ticks, and the exit threshold (abs(mean) < 4 + prevalence < 30% — tighter than entry) will then fire. But it means the recovery_stability_ticks timer should be counted from the tick the exit conditions first hold, not from when DEGRADED was entered — otherwise we'd exit based on stale history.
Q4 — Episode tracking: what exists?¶
Nothing. Grep across the repo for degraded_entry_count, degraded_count, recovery_count, episode → zero hits. No existing counter, no engine_state key, no session-table column.
Proposed additions:
- engine_state key degraded.entry_count (int stored as ISO string). Incremented in _enter_degraded_mode ONLY on the not already_degraded branch (first entry, not idempotent re-entry).
- engine_state key degraded.recovery_count (int). Incremented by the recovery exit path on a successful exit (not by the wallet-truth ok exit path, which gets its own scoping below).
- Reset both to 0 in _startup()'s fresh-session clear block (next to the KEY_MODE / halt.reason resets I just added in fix/startup-mode-reset — same pattern, same block).
Enforcement point: in _enter_degraded_mode, before flipping to MODE_DEGRADED, check:
if (
not already_degraded
and cfg.degraded_recovery.max_recovery_attempts_per_episode > 0
and entry_count >= 1 + cfg.degraded_recovery.max_recovery_attempts_per_episode
):
# Second (or greater) entry in the episode -> escalate directly to HALT.
self._escalate_degraded_to_halt(f"second_degraded_entry;{reason}")
return
With max_recovery_attempts_per_episode = 1:
- Entry 1 → entry_count becomes 1. Proceeds normally. 1 < (1 + 1) so OK.
- Recovery → recovery_count becomes 1.
- Entry 2 → entry_count becomes 2 at check time. 2 >= 2 → escalate to HALT.
Wallet-truth exit path scoping: Vesper, do we want the wallet-truth ok exit to count as a "recovery" for the one-per-episode cap? Two choices:
- (a) Wallet-truth exit IS a recovery → counts against the cap. If the wallet check recovers, then an anchor guard fires, entry_count=2 → HALT. Symmetric, simplest rule.
- (b) Wallet-truth exit is NOT a recovery → only guard recoveries count. Wallet truth can loop indefinitely (reasonable — wallet mismatches aren't market-regime-driven and may transiently resolve from their own accord).
Atlas's ruling Section 5 ("one recovery per episode") doesn't distinguish the cause. I lean (a) — simplest, matches the ruling as literal — but this is a design call. Recommend you rule before code.
Q5 — Drift and corridor guard recovery hooks — in this branch, or defer?¶
Per Atlas's ruling "keep minimal for now" and your Secondary scope note:
Directional drift recovery¶
Spec (yours): exit DEGRADED when opposing fill observed OR N ticks with no same-side fills.
Reality check: Condition A (burst) and Condition B (net notional) auto-clear as their rolling windows expire (30s / 120s by default). Condition C (no opposing) does NOT auto-clear — _drift_ticks_since_opposing_fill keeps incrementing unless a fill arrives.
Minimal hook: while DEGRADED on any drift condition, re-run the three evaluators passively each tick. If NONE currently trigger AND we've been settled for recovery_stability_ticks_drift ticks → exit. ~25–30 lines including state + eval. Matches "mirror existing drift logic in reverse."
Reset on exit:
- self._drift_fill_events.clear(), self._drift_net_notional_events.clear()
- self._drift_ticks_since_opposing_fill = 0
- self._drift_guard_triggered_this_session = False
- self._drift_last_fill_side = None
- Leave _drift_fills_seen_this_session alone (session-cumulative counter, not guard state).
Inventory corridor recovery¶
Spec (yours): exit DEGRADED when inventory % inside corridor for corridor_lookback_ticks consecutive ticks.
Reality check: existing guard already resets _corridor_ticks_outside to 0 every tick inventory is inside (line 1960). For recovery, we need the mirror counter: _corridor_ticks_inside incrementing when inside, resetting when outside. Exit when >= corridor_lookback_ticks (reuse same param per your spec). ~15–20 lines including state.
Reset on exit:
- self._corridor_ticks_outside = 0, self._corridor_ticks_inside = 0
- self._corridor_guard_triggered_this_session = False
Recommendation¶
Option A — anchor only (per your "Implement hooks if < 30 lines; skip if they require significant investigation"): ship anchor recovery with all 10 tests. Drift and corridor stay one-way-to-HALT. Defer both to a follow-up branch.
Option B — ship all three in this branch: anchor fully tested; drift + corridor with one exit test each (so ≥12 tests total). The shared infrastructure (episode counter, YAML config loading, recovery monitor step) is reused by all three, making the incremental cost ~50 lines of code + 2 tests for drift/corridor combined. Still well within the "minimal" spirit.
Option C — reject part of the spec: drift/corridor exit conditions are too simple (single-tick no-opposing, single-tick inside-corridor) to be safe exit signals in mixed regimes. Require hysteresis + stability on those two too, which Atlas hasn't ruled on → defer.
My recommendation: Option B, with two deviations flagged explicitly. Drift and corridor each use recovery_stability_ticks_drift / recovery_stability_ticks_corridor as new YAML params (recommending 10 and 5 respectively — tighter than anchor's 30 because these guards are faster-moving). If you prefer strict parity with Atlas's "reuse existing corridor parameters," I'd drop the corridor stability param and just use corridor_lookback_ticks for both directions.
Scope Decision — Needed From You¶
Three rulings needed before code:
- Drift/corridor scope — Options A / B / C above. My default pick: B.
- Wallet-truth exit counts toward cap? — (a) count vs (b) don't count. My default pick: (a).
- Anchor recovery placement in tick — options:
- (i) New Step 8.45 method
_evaluate_anchor_saturation_recovery()that runs BEFORE_evaluate_anchor_saturation_guardwhen mode == DEGRADED. Clean separation. - (ii) Merge recovery and entry into the existing evaluator (single method handles both paths). Less plumbing, more intermingled logic. My default pick: (i) — matches per-guard pattern, easier to test in isolation.
Config Additions (per your spec + my recommendations)¶
AnchorSaturationGuardConfig — new fields:
recovery_enabled: bool = True
recovery_exit_bias_threshold_bps: float = 4.0
recovery_exit_prevalence_pct: float = 30.0
recovery_stability_ticks: int = 30
New top-level DegradedRecoveryConfig:
If Option B ruled in — DirectionalDriftGuardConfig and InventoryCorridorGuardConfig get parallel recovery_enabled and recovery_stability_ticks_* fields.
YAML additions go into config/config.yaml, config/config.example.yaml, config/config_live_stage1.yaml under the existing guard blocks (lines 158/183/220 in config.yaml, mirrored in the others).
Tests I'd Write (per your 10-test minimum, all apply to anchor)¶
- Single-clean-tick does NOT exit (time stability)
- Bias clean + prevalence still high → no exit
- Prevalence clean + bias still high → no exit
- Both clean for
recovery_stability_ticks→ exit - Symmetric: negative-bias saturation recovers the same as positive
- State reset on exit:
_anchor_error_windowempty,_anchor_guard_triggered_this_sessionFalse - Second DEGRADED entry (post-recovery) → escalates to HALT, halt.reason=
second_degraded_entry_in_episode(or similar taxonomy token — flagging this for you) degraded.entry_countanddegraded.recovery_countpersist correctly across the transitionrecovery_enabled: false→ even with clean conditions, guard stays DEGRADED until 300s timeout- Hysteresis: entering DEGRADED requires mean ≥ 7 bps (current entry threshold); after recovery with mean at 3.5 bps, engine is OK — re-entering DEGRADED still needs mean ≥ 7 (not 4). Asserted by running a post-recovery scenario where mean=4.5 for N ticks and mode stays OK.
If Option B ruled in → add 2 more: one drift exit, one corridor exit (passing the stability window, asserting mode flips to OK). Total 12.
Halt.reason Taxonomy Addition¶
Second-entry escalation writes a halt.reason token. Candidates:
- HALT_REASON_SECOND_DEGRADED_ENTRY = "second_degraded_entry_halt" (descriptive)
- HALT_REASON_RECOVERY_EXHAUSTED = "recovery_exhausted_halt" (concise)
I'd go with recovery_exhausted_halt — matches the taxonomy style of replay_exhausted and inventory_truth_halt. Flagging for your ruling; trivial to switch.
Imports Needed for Tests¶
neo_engine.inventory_truth_checker already re-exports KEY_MODE, KEY_DEGRADED_SINCE, KEY_DEGRADED_REASON, MODE_OK, MODE_DEGRADED, MODE_HALT — used in tests/test_halt_reason_lifecycle.py as precedent. Episode-counter keys will be new string literals; may promote to module-level constants in main_loop.py if you prefer — flag it if you want that.
Standing By¶
No branch. No code. main is clean. Ready to write code once you rule:
- A / B / C on drift+corridor scope
- (a) / (b) on wallet-truth counting toward cap
- (i) / (ii) on recovery placement in tick
- Halt.reason taxonomy token for second-entry escalation
— Orion