Vesper Review — feat/flag-042-degraded-recovery¶

To: Orion (he/him), Katja (Captain) From: Vesper (she/her) CC: Atlas (he/him) Date: 2026-04-21 Status: APPROVED — apply and merge

Verdict: APPROVED¶

5 commits, 16 new tests, 158/158 regression green. Atlas spec followed on all locked points. One documented deviation (drift condition C) reviewed and accepted below. Apply instructions correct. Cleared for merge.

Spec Compliance¶

Anchor recovery — full compliance. Hysteresis correct (exit 4 bps / 30% vs entry 7 bps / 40%), stability window 30 consecutive ticks, state reset on exit (_anchor_error_window cleared, _anchor_guard_triggered_this_session cleared), entry evaluator re-armed so re-entry is possible. ✅

Per-episode cap — full compliance. Per-source counters in engine_state[degraded_recovery.<source>.attempts], incremented atomically on each capped DEGRADED entry. attempts > 1 escalates to HALT with recovery_exhausted_halt. Wallet-truth and reconciler uncapped — matches my Q2 ruling (wallet-truth exits excluded from the guard recovery cap). DB read errors treated as "first attempt" to avoid spurious HALT — correct defensive behavior. ✅

Halt reason taxonomy — full compliance. HALT_REASON_RECOVERY_EXHAUSTED = "recovery_exhausted_halt" matches Q4 ruling. ✅

Step 8.4 ordering — full compliance. Recovery evaluators run before entry guards (8.5/8.5b/8.5c). A tick that exits DEGRADED at 8.4 is immediately re-evaluated by guards at 8.5–8.5c. If the regime is still hostile the guard trips again; per-episode cap fires with recovery_exhausted_halt. This is the correct behavior for catching "recovered just to re-enter" in hostile regimes — matches Q3 ruling. ✅

Drift recovery (A+B only) — deviation reviewed and ACCEPTED. See section below. ✅

Corridor recovery — full compliance. Reuses corridor_lookback_ticks for both entry and exit direction per my ruling (no new stability param). Both conditions required (rlusd floor AND xrp_pct corridor). mid_price=0 and below-minimum-portfolio cases correctly reset counter — cannot confirm safe means no exit. ✅

Startup reset extension — full compliance. Three new recovery-attempt keys cleared in the fresh-session startup block alongside existing mode/degraded_since/degraded_reason clears. Pattern matches fix/startup-mode-reset exactly. ✅

Backward compatibility — full compliance. recovery_enabled=false restores pre-FLAG-042 one-way behavior with zero code path change. _escalate_degraded_to_halt gains optional halt_reason kwarg, default unchanged — existing call sites unaffected. No schema migrations. ✅

Test coverage — exceeds minimum. Minimum was 12 (10 anchor + 1 drift + 1 corridor). Delivered 16 with structured coverage: no-op gates (Part A), hysteresis counter behavior (Part B), exit transitions (Part C), drift reset contract (Part D), corridor conditions (Part E). Regression: 158/158 across all related suites. ✅

Apply instructions — correct. Defensive branch delete present. Get-ChildItem ... | Sort-Object Name | ForEach-Object { git am $_.FullName } form used. Patch path correctly references 08 Patches/patches-flag-042-degraded-recovery/. Prerequisite noted (fix/startup-mode-reset must be on main — it is, merged Apr 21). ✅

Deviation: Drift Condition C Omitted from Recovery — ACCEPTED¶

Orion's reasoning: During DEGRADED the engine has cancelled all orders and stopped quoting. No new fills can arrive. _drift_ticks_since_opposing_fill therefore grows monotonically every tick regardless of market behavior — not because directional flow is occurring, but because the engine is silent. Including condition C in the recovery evaluator would make drift recovery via condition C structurally impossible.

My ruling: Correct analysis. Condition C ("no opposing fill for N ticks") is meaningful as an entry trigger because it measures actual fill behavior during active quoting. During DEGRADED there is no active quoting, so the counter is meaningless as a recovery signal. Conditions A (burst window) and B (net notional window) are time-bounded rolling windows — they correctly decay to "no active flow" even during DEGRADED as the observation window advances. These are the right recovery signals.

Deviation is accepted. Document in the commit message is sufficient. No follow-up required.

Structural Note: Per-Source Counter Keys¶

My pre-code ruling specified degraded.entry_count, degraded.guard_entry_count, degraded.recovery_count as engine_state keys. Orion implemented per-source degraded_recovery.<source>.attempts instead.

This is a better design: it's more granular, wallet-truth exclusion falls out naturally (no counter registered for that source), and per-source counters give cleaner observability in session logs. The functional contract is identical. Accepting the deviation — it improves on my spec.

S45 Monitoring Note (not a blocker)¶

Corridor corridor_lookback_ticks defaults to 3 ticks (~12 seconds at 4s cadence). This is compliant with my ruling ("reuse existing corridor_lookback_ticks"). But 3 ticks is a short stability window for recovery — inventory can oscillate at that frequency. Flag for S45 observation: if corridor recovery fires repeatedly in a single session and the per-episode cap catches it, we may want to decouple the entry and exit stability windows with a dedicated recovery_stability_ticks_corridor param in a follow-up. Not acting on this now — observe first.

Apply Instructions¶

cd C:\Users\Katja\Documents\NEO GitHub\neo-2026
git checkout main
git pull

git branch -D feat/flag-042-degraded-recovery 2>$null
git checkout -b feat/flag-042-degraded-recovery

Get-ChildItem "C:\Users\Katja\Documents\Claude Homebase Neo\02 Projects\NEO Trading Engine\08 Patches\patches-flag-042-degraded-recovery" -Filter "*.patch" |
    Sort-Object Name |
    ForEach-Object { git am $_.FullName }

python -m pytest tests/test_flag_042_degraded_recovery.py -v

Expected: 16 passed

python -m pytest tests/test_anchor_saturation_guard.py tests/test_directional_drift_guard.py `
  tests/test_inventory_corridor_guard.py tests/test_reconciler_conservative.py `
  tests/test_flag_036_wallet_truth_reconciliation.py tests/test_halt_reason_lifecycle.py `
  tests/test_reconciler_anomaly_log.py tests/test_config.py tests/test_config_invariants.py `
  tests/test_flag_042_degraded_recovery.py -q

Expected: 158 passed

Then merge to main and push.

What This Means for S45¶

After this merge, a session that enters DEGRADED on anchor saturation now has an exit path:

Condition	Before FLAG-042	After FLAG-042
Anchor cleans up mid-session	Sits DEGRADED until 300s → HALT	Exits DEGRADED after 30 clean ticks → resumes quoting
Regime stays hostile after recovery	N/A	Re-enters DEGRADED → per-episode cap → HALT (`recovery_exhausted_halt`)
Regime never cleans up (S43 pattern)	DEGRADED → HALT at 300s	Unchanged — 300s timeout still fires

S44 is the model case: anchor mean +4.43 bps with range [−3.6, +10.0]. With FLAG-042 live, the anchor's sub-4 bps periods would have accumulated stability ticks and potentially exited DEGRADED. Whether they would have accumulated 30 consecutive clean ticks depends on the exact tick-by-tick sequence — S45 will show us the real behavior.

S45 pre-session checklist unchanged: realign with realign_inventory_to_onchain.py --confirm before starting.

— Vesper