Orion Pre-Code Findings — FLAG-042 DEGRADED Recovery¶

To: Vesper (she/her) From: Orion (he/him) CC: Katja (Captain), Atlas (he/him) Date: 2026-04-21 Branch: feat/flag-042-degraded-recovery (not yet created) Type: Pre-code investigation — Q1–Q5 + scope recommendation before code

TL;DR¶

Q1–Q4 pass — every invariant the tasking assumes is in place. Q5 (scope of drift/corridor recovery) needs your ruling: both hooks are implementable cleanly but push the branch past the "<30 lines each" floor you quoted. My recommendation at the bottom is to ship anchor recovery fully this branch and ship drift + corridor with minimal exit hooks (one simple condition each, one test each) rather than defer — the infrastructure (episode counter, state-reset helper, _recovery_enabled disable flag) has to exist either way and the incremental cost is small.

No branch created. No code written. On main, working tree clean after git add of check_capital_baseline.py being an outstanding untracked file unrelated to this branch.

Q1 — How does `_exit_degraded_mode()` work?¶

Location: neo_engine/main_loop.py:1388–1417

Signature: def _exit_degraded_mode(self) -> None — no args, idempotent. Early-returns with a debug log if _current_truth_mode() != MODE_DEGRADED.

What it resets:

self._degraded_since_epoch = None                # process cache
self._state.set_engine_state(KEY_MODE, MODE_OK)          # DB
self._state.set_engine_state(KEY_DEGRADED_SINCE, "")     # DB
self._state.set_engine_state(KEY_DEGRADED_REASON, "")    # DB

That's all — no re-arm of orders, no reset of guard-specific rolling windows, no reset of one-shot flags.

Callers (the ONLY caller today): _apply_truth_check_result at main_loop.py:2195 — the wallet-truth ok path. Anchor / drift / corridor guards have no exit path today. They one-way into DEGRADED and stay there until the 300s timeout escalates to HALT.

Re-enabling order flow: not explicit. After _exit_degraded_mode runs, the next _tick reaches Step 8 (strategy intent generation) and Step 9 (submit) normally because the pre-trade gate (C5) only blocks submits while MODE_DEGRADED / MODE_HALT. So "resume quoting" is implicit — once the mode flips back to MODE_OK, Step 9 stops refusing and quoting restarts on the very next tick.

Conclusion: _exit_degraded_mode is the correct building block. It clears DEGRADED bookkeeping cleanly and the tick loop resumes order flow without additional plumbing. For FLAG-042 the recovery path needs to layer state reset (rolling windows + guard one-shot flags + any guard counters) on top of the existing exit and not call _exit_degraded_mode directly from the guards — we'll wrap it.

Q2 — Anchor guard state machine¶

Evaluator: _evaluate_anchor_saturation_guard at main_loop.py:~1480–1569.

Step sequence in _tick (top to bottom): - Top of tick: DEGRADED timeout check → escalate to HALT if exceeded. Then MODE_HALT guard. - Step 1–7: risk check, market data, reconciler, snapshot, cancels. - Step 8: _strategy.calculate_quote → populates _anchor_error_window via div_bps = self._strategy.last_anchor_divergence_bps at main_loop.py:2563. - Step 8.5: _evaluate_anchor_saturation_guard(intents) at main_loop.py:2676. - Step 8.5b: drift guard. - Step 8.5c: corridor guard. - Step 9: submit.

Trigger path (line 1509–1569): requires (a) window fully populated, (b) abs(mean) >= bias_threshold_bps, (c) %(|x| > prevalence_threshold_bps) >= prevalence_pct. On first trigger: sets _anchor_guard_triggered_this_session = True (one-shot for log + circuit_breaker_events dedup), then unconditionally calls self._enter_degraded_mode("anchor_saturation_guard_exceeded") every tick while condition holds.

State persisted between ticks (all instance-local): - self._anchor_error_window: deque[float] — maxlen = lookback_ticks (25 default), bounded. - self._anchor_guard_triggered_this_session: bool — one-shot.

What "reset rolling windows and guard counters" means concretely: - self._anchor_error_window.clear() - self._anchor_guard_triggered_this_session = False

That's the full per-guard reset. Guard re-evaluation then needs lookback_ticks new ticks before it can re-trigger — which itself enforces some settling time post-recovery.

Q3 — Does the anchor error window keep updating in DEGRADED?¶

Yes. The window append at line 2563 is inside Step 8's if snapshot.is_valid(): block and runs every tick regardless of truth mode. The only short-circuit before Step 8 is either (a) 300s timeout → HALT → return False, or (b) MODE_HALT observed → return False. Neither gates Step 8 while MODE_DEGRADED.

Consequence: the recovery monitor does NOT need its own data-collection path. It can read the same _anchor_error_window the entry guard reads. The window is the source of truth for both entry evaluation and exit evaluation.

One wrinkle worth flagging: the window is bounded (deque(maxlen=lookback_ticks)). By the time a fresh tick reaches the recovery monitor, up to 25 DEGRADED-era ticks' worth of anchor data are already averaged in. That's actually fine for recovery — when conditions genuinely normalize, the new clean values will displace the saturated ones within lookback_ticks ticks, and the exit threshold (abs(mean) < 4 + prevalence < 30% — tighter than entry) will then fire. But it means the recovery_stability_ticks timer should be counted from the tick the exit conditions first hold, not from when DEGRADED was entered — otherwise we'd exit based on stale history.

Q4 — Episode tracking: what exists?¶

Nothing. Grep across the repo for degraded_entry_count, degraded_count, recovery_count, episode → zero hits. No existing counter, no engine_state key, no session-table column.

Proposed additions: - engine_state key degraded.entry_count (int stored as ISO string). Incremented in _enter_degraded_mode ONLY on the not already_degraded branch (first entry, not idempotent re-entry). - engine_state key degraded.recovery_count (int). Incremented by the recovery exit path on a successful exit (not by the wallet-truth ok exit path, which gets its own scoping below). - Reset both to 0 in _startup()'s fresh-session clear block (next to the KEY_MODE / halt.reason resets I just added in fix/startup-mode-reset — same pattern, same block).

Enforcement point: in _enter_degraded_mode, before flipping to MODE_DEGRADED, check:

if (
    not already_degraded
    and cfg.degraded_recovery.max_recovery_attempts_per_episode > 0
    and entry_count >= 1 + cfg.degraded_recovery.max_recovery_attempts_per_episode
):
    # Second (or greater) entry in the episode -> escalate directly to HALT.
    self._escalate_degraded_to_halt(f"second_degraded_entry;{reason}")
    return

With max_recovery_attempts_per_episode = 1: - Entry 1 → entry_count becomes 1. Proceeds normally. 1 < (1 + 1) so OK. - Recovery → recovery_count becomes 1. - Entry 2 → entry_count becomes 2 at check time. 2 >= 2 → escalate to HALT.

Wallet-truth exit path scoping: Vesper, do we want the wallet-truth ok exit to count as a "recovery" for the one-per-episode cap? Two choices:

(a) Wallet-truth exit IS a recovery → counts against the cap. If the wallet check recovers, then an anchor guard fires, entry_count=2 → HALT. Symmetric, simplest rule.
(b) Wallet-truth exit is NOT a recovery → only guard recoveries count. Wallet truth can loop indefinitely (reasonable — wallet mismatches aren't market-regime-driven and may transiently resolve from their own accord).

Atlas's ruling Section 5 ("one recovery per episode") doesn't distinguish the cause. I lean (a) — simplest, matches the ruling as literal — but this is a design call. Recommend you rule before code.

Q5 — Drift and corridor guard recovery hooks — in this branch, or defer?¶

Per Atlas's ruling "keep minimal for now" and your Secondary scope note:

Directional drift recovery¶

Spec (yours): exit DEGRADED when opposing fill observed OR N ticks with no same-side fills.

Reality check: Condition A (burst) and Condition B (net notional) auto-clear as their rolling windows expire (30s / 120s by default). Condition C (no opposing) does NOT auto-clear — _drift_ticks_since_opposing_fill keeps incrementing unless a fill arrives.

Minimal hook: while DEGRADED on any drift condition, re-run the three evaluators passively each tick. If NONE currently trigger AND we've been settled for recovery_stability_ticks_drift ticks → exit. ~25–30 lines including state + eval. Matches "mirror existing drift logic in reverse."

Reset on exit: - self._drift_fill_events.clear(), self._drift_net_notional_events.clear() - self._drift_ticks_since_opposing_fill = 0 - self._drift_guard_triggered_this_session = False - self._drift_last_fill_side = None - Leave _drift_fills_seen_this_session alone (session-cumulative counter, not guard state).

Inventory corridor recovery¶

Spec (yours): exit DEGRADED when inventory % inside corridor for corridor_lookback_ticks consecutive ticks.

Reality check: existing guard already resets _corridor_ticks_outside to 0 every tick inventory is inside (line 1960). For recovery, we need the mirror counter: _corridor_ticks_inside incrementing when inside, resetting when outside. Exit when >= corridor_lookback_ticks (reuse same param per your spec). ~15–20 lines including state.

Reset on exit: - self._corridor_ticks_outside = 0, self._corridor_ticks_inside = 0 - self._corridor_guard_triggered_this_session = False

Recommendation¶

Option A — anchor only (per your "Implement hooks if < 30 lines; skip if they require significant investigation"): ship anchor recovery with all 10 tests. Drift and corridor stay one-way-to-HALT. Defer both to a follow-up branch.

Option B — ship all three in this branch: anchor fully tested; drift + corridor with one exit test each (so ≥12 tests total). The shared infrastructure (episode counter, YAML config loading, recovery monitor step) is reused by all three, making the incremental cost ~50 lines of code + 2 tests for drift/corridor combined. Still well within the "minimal" spirit.

Option C — reject part of the spec: drift/corridor exit conditions are too simple (single-tick no-opposing, single-tick inside-corridor) to be safe exit signals in mixed regimes. Require hysteresis + stability on those two too, which Atlas hasn't ruled on → defer.

My recommendation: Option B, with two deviations flagged explicitly. Drift and corridor each use recovery_stability_ticks_drift / recovery_stability_ticks_corridor as new YAML params (recommending 10 and 5 respectively — tighter than anchor's 30 because these guards are faster-moving). If you prefer strict parity with Atlas's "reuse existing corridor parameters," I'd drop the corridor stability param and just use corridor_lookback_ticks for both directions.

Scope Decision — Needed From You¶

Three rulings needed before code:

Drift/corridor scope — Options A / B / C above. My default pick: B.
Wallet-truth exit counts toward cap? — (a) count vs (b) don't count. My default pick: (a).
Anchor recovery placement in tick — options:
(i) New Step 8.45 method _evaluate_anchor_saturation_recovery() that runs BEFORE _evaluate_anchor_saturation_guard when mode == DEGRADED. Clean separation.
(ii) Merge recovery and entry into the existing evaluator (single method handles both paths). Less plumbing, more intermingled logic. My default pick: (i) — matches per-guard pattern, easier to test in isolation.

Config Additions (per your spec + my recommendations)¶

AnchorSaturationGuardConfig — new fields:

recovery_enabled: bool = True
recovery_exit_bias_threshold_bps: float = 4.0
recovery_exit_prevalence_pct: float = 30.0
recovery_stability_ticks: int = 30

New top-level DegradedRecoveryConfig:

max_recovery_attempts_per_episode: int = 1

If Option B ruled in — DirectionalDriftGuardConfig and InventoryCorridorGuardConfig get parallel recovery_enabled and recovery_stability_ticks_* fields.

YAML additions go into config/config.yaml, config/config.example.yaml, config/config_live_stage1.yaml under the existing guard blocks (lines 158/183/220 in config.yaml, mirrored in the others).

Tests I'd Write (per your 10-test minimum, all apply to anchor)¶

Single-clean-tick does NOT exit (time stability)
Bias clean + prevalence still high → no exit
Prevalence clean + bias still high → no exit
Both clean for recovery_stability_ticks → exit
Symmetric: negative-bias saturation recovers the same as positive
State reset on exit: _anchor_error_window empty, _anchor_guard_triggered_this_session False
Second DEGRADED entry (post-recovery) → escalates to HALT, halt.reason=second_degraded_entry_in_episode (or similar taxonomy token — flagging this for you)
degraded.entry_count and degraded.recovery_count persist correctly across the transition
recovery_enabled: false → even with clean conditions, guard stays DEGRADED until 300s timeout
Hysteresis: entering DEGRADED requires mean ≥ 7 bps (current entry threshold); after recovery with mean at 3.5 bps, engine is OK — re-entering DEGRADED still needs mean ≥ 7 (not 4). Asserted by running a post-recovery scenario where mean=4.5 for N ticks and mode stays OK.

If Option B ruled in → add 2 more: one drift exit, one corridor exit (passing the stability window, asserting mode flips to OK). Total 12.

Halt.reason Taxonomy Addition¶

Second-entry escalation writes a halt.reason token. Candidates: - HALT_REASON_SECOND_DEGRADED_ENTRY = "second_degraded_entry_halt" (descriptive) - HALT_REASON_RECOVERY_EXHAUSTED = "recovery_exhausted_halt" (concise)

I'd go with recovery_exhausted_halt — matches the taxonomy style of replay_exhausted and inventory_truth_halt. Flagging for your ruling; trivial to switch.

Imports Needed for Tests¶

neo_engine.inventory_truth_checker already re-exports KEY_MODE, KEY_DEGRADED_SINCE, KEY_DEGRADED_REASON, MODE_OK, MODE_DEGRADED, MODE_HALT — used in tests/test_halt_reason_lifecycle.py as precedent. Episode-counter keys will be new string literals; may promote to module-level constants in main_loop.py if you prefer — flag it if you want that.

Standing By¶

No branch. No code. main is clean. Ready to write code once you rule:

A / B / C on drift+corridor scope
(a) / (b) on wallet-truth counting toward cap
(i) / (ii) on recovery placement in tick
Halt.reason taxonomy token for second-entry escalation

— Orion