Skip to content

Orion Delivery — feat/flag-042-degraded-recovery

Vesper — branch complete and all green per Atlas's 2026-04-21 locked recovery spec and your pre-code rulings. Ready for review + merge.

Branch

feat/flag-042-degraded-recovery (off main including fix/startup-mode-reset), 5 commits, +1391 / −45.

Commits

# Hash Subject
C1 e209370 feat(config): add FLAG-042 recovery config schema (anchor + drift + corridor + episode cap)
C2 00b2898 feat(main_loop): FLAG-042 infrastructure — recovery counter + cap + halt taxonomy + startup reset
C3 4bf8105 feat(main_loop): FLAG-042 C3 — anchor saturation recovery evaluator + Step 8.4 wiring
C4 c38dd41 feat(main_loop): FLAG-042 C4 — drift + corridor recovery evaluators + Step 8.4 wiring
C5 2d380a4 test(recovery): FLAG-042 C5 — 16 tests for DEGRADED recovery evaluators

Tests

  • New: 16 tests in tests/test_flag_042_degraded_recovery.py, all green.
  • Part A (6) — anchor recovery no-op gates: enabled=False, recovery_enabled=False, not-DEGRADED, wrong-source reason (directional_drift_guard_*), window not full, get_engine_state raise.
  • Part B (4) — anchor recovery hysteresis: bias excursion resets counter, prevalence excursion resets counter, both-conditions-clear advances counter, advance-then-excursion resets to 0.
  • Part C (2) — anchor recovery exit: mode transition + flag clear + counter reset; [ANCHOR_SAT] exit WARNING log.
  • Part D (2) — drift recovery: exit on stability (deques cleared, ticks-since-opposing zeroed, last-side/fills-seen cleared, one-shot flag cleared, watermark preserved); burst excursion resets counter.
  • Part E (2) — corridor recovery: exit on stability (both conditions hold, _corridor_ticks_outside reset, flag cleared); mid_price=0 safe-reset.
  • Regression: 158/158 green across guard + truth + reason + config + recovery suites.

Run command (sandbox reproducible):

python -m pytest tests/test_anchor_saturation_guard.py tests/test_directional_drift_guard.py \
  tests/test_inventory_corridor_guard.py tests/test_reconciler_conservative.py \
  tests/test_flag_036_wallet_truth_reconciliation.py tests/test_halt_reason_lifecycle.py \
  tests/test_reconciler_anomaly_log.py tests/test_config.py tests/test_config_invariants.py \
  tests/test_flag_042_degraded_recovery.py -q

Spec compliance — Atlas 2026-04-21

Anchor recovery (hysteresis):

exit bias threshold: recovery_exit_bias_threshold_bps=4.0    (entry 7.0)
exit prev threshold: recovery_exit_prevalence_pct=30.0       (entry 40.0)
stability window:    recovery_stability_ticks=30             (consecutive)
On a DEGRADED tick with reason starting anchor_saturation: if abs(mean(window)) < 4 bps AND %(|x|>5 bps) < 30%, increment counter; any excursion past either threshold resets to 0. Counter reaching 30 → _exit_degraded_mode(), reset counter, clear _anchor_guard_triggered_this_session so the entry evaluator can re-fire.

Drift recovery (minimal — no hysteresis):

stability window: recovery_stability_ticks_drift=10
On a DEGRADED tick with reason starting directional_drift_guard: evaluate conditions A (burst within live burst_window_seconds) and B (net notional within live net_notional_window_seconds). Condition C is deliberately excluded_drift_ticks_since_opposing_fill grows monotonically during DEGRADED (no new fills), so including C would permanently latch the guard. A and B are time-bounded and correctly decay to "no active flow." On exit: burst deque cleared, net-notional deque cleared, ticks-since-opposing zeroed, last-side cleared, fills-seen zeroed. Watermark preserved — resetting would re-play every session fill on the next tick.

Corridor recovery (no new stability parameter):

stability window: corridor_lookback_ticks=3 (reused from entry-side)
On a DEGRADED tick with reason starting inventory_corridor_guard: require BOTH rlusd >= min_rlusd_floor AND xrp_pct ∈ [min_xrp_pct, max_xrp_pct]. Missing / zero mid_price OR total portfolio below min_portfolio_rlusd → cannot confirm safe → counter reset. Counter reaching corridor_lookback_ticks → exit, reset _corridor_ticks_outside, clear the one-shot flag.

Per-episode cap:

max_recovery_attempts_per_episode=1
RECOVERY_CAPPED_SOURCES = (anchor, drift, corridor)
HALT_REASON_RECOVERY_EXHAUSTED = "recovery_exhausted_halt"
Per-source counters live in engine_state[degraded_recovery.<source>.attempts]. On each DEGRADED entry for a capped source, the counter increments atomically; attempts > 1 (i.e. a second entry from the same source in one session) escalates to HALT with recovery_exhausted_halt. Wallet-truth and reconciler sources are uncapped (existing refusal behavior). DB errors on counter reads are treated as "first attempt" to avoid spurious HALT.

Execution ordering

Step 8.4 (new) runs before Step 8.5 / 8.5b / 8.5c guards on purpose:

Step 8.4  — DEGRADED recovery evaluators (anchor → drift → corridor)
Step 8.5  — Anchor saturation guard
Step 8.5b — Directional drift guard
Step 8.5c — Inventory corridor guard

A tick that exits DEGRADED at 8.4 can be immediately re-evaluated by the guards at 8.5–8.5c in the same tick. If the regime is still hostile, the guard trips again and the per-episode cap escalates to HALT with recovery_exhausted_halt. This is the mechanism for catching "recovered just to re-enter" pathology in hostile regimes.

Each recovery evaluator is a no-op unless the DEGRADED reason matches its source prefix — routing through engine_state[KEY_DEGRADED_REASON]. Only one evaluator can exit DEGRADED on any given tick.

Startup state reset

The startup reset block added in fix/startup-mode-reset is extended in C2 to clear the three new recovery-attempt keys on fresh session start:

degraded_recovery.anchor.attempts    -> ""
degraded_recovery.drift.attempts     -> ""
degraded_recovery.corridor.attempts  -> ""
Pattern matches the existing inventory_truth.mode / degraded_since / degraded_reason clears. Confirmed alongside the 3 FLAG-041-follow-up tests in tests/test_halt_reason_lifecycle.py (green on this branch; no behavior change to those tests).

Backward compatibility

  • _escalate_degraded_to_halt gained optional halt_reason kwarg (default HALT_REASON_INVENTORY_TRUTH). Existing call sites unchanged.
  • recovery_enabled defaults to true on each guard. Setting any of them to false restores pre-FLAG-042 one-way behavior (guard fires, engine stays DEGRADED until restart) with zero code path change elsewhere.
  • No schema migrations — all state uses existing engine_state K/V.
  • Existing guard tests updated only for the new source=... kwarg added to _enter_degraded_mode in C2. No semantic test changes.

Files touched

config/config.example.yaml                  |  30 ++
config/config.yaml                          |  30 ++
config/config_live_stage1.yaml              |  24 ++
neo_engine/config.py                        | 184 ++    (+validator + loader fields for all three guards
                                                         and DegradedRecoveryConfig)
neo_engine/main_loop.py                     | 599 ++    (recovery state fields, _escalate_* kwarg, cap
                                                         helpers, 3 recovery evaluators, Step 8.4 wiring,
                                                         startup reset extension, source kwargs on guard
                                                         _enter_degraded_mode calls)
tests/test_anchor_saturation_guard.py       |   8 ++   (source="anchor" on 4 assertions)
tests/test_directional_drift_guard.py       |  16 ++   (source="drift" on 7 assertions)
tests/test_inventory_corridor_guard.py      |  14 ++   (source="corridor" on 7 assertions)
tests/test_reconciler_conservative.py       |   8 ++   (source=SOURCE_RECONCILER source-level assertion)
tests/test_flag_042_degraded_recovery.py    | 523 ++   (new)

Operator impact

  • Healthy sessions (no DEGRADED): zero observable change. Recovery evaluators short-circuit on the mode != DEGRADED check.
  • Single DEGRADED episode that recovers (new behavior): one [<GUARD>] DEGRADED WARNING on entry, counter increments in engine_state[degraded_recovery.<source>.attempts], stability window accumulates (30 ticks anchor / 10 ticks drift / 3 ticks corridor), [<GUARD>] recovery conditions stable — exiting DEGRADED WARNING on exit. Mode returns to OK; guard one-shot flag cleared so a re-entry is possible.
  • Second entry from same source in one session: immediate HALT with halt_reason=recovery_exhausted_halt and detail recovery_exhausted:<source>. Matches the existing inventory_truth_halt escalation contract.
  • recovery_enabled=false on any guard: that source behaves exactly as before FLAG-042 (one-way into DEGRADED until restart). Useful for Phase 7.4 SR-AUDIT comparison runs.

Deviations from tasking

One — condition C omitted from drift recovery. Documented in both C4 commit message and _evaluate_drift_recovery docstring. During DEGRADED no new fills arrive, so _drift_ticks_since_opposing_fill grows monotonically and condition C would permanently latch the guard. Conditions A (burst) and B (net notional) are the correct recovery signals — both use time-bounded rolling windows that decay to "no active flow." This matches the spirit of your ruling ("none of the drift conditions A/B/C would re-trigger") without introducing the pathology. Flagging for your review — happy to rework if you want a different interpretation.

No other deviations. Hysteresis thresholds, stability windows, episode cap, source taxonomy, halt token, startup reset, and Step 8.4 ordering all match the locked spec.

Apply instructions (Windows / PowerShell)

Patches live at 02 Projects/NEO Trading Engine/08 Patches/patches-flag-042-degraded-recovery/ (5 files, 0001 → 0005). From Katja's VS Code terminal:

cd C:\Users\Katja\Documents\NEO GitHub\neo-2026
git checkout main
git pull

# Defensive: clear any pre-existing branch from a prior attempt.
git branch -D feat/flag-042-degraded-recovery 2>$null

git checkout -b feat/flag-042-degraded-recovery

Get-ChildItem "C:\Users\Katja\Documents\Claude Homebase Neo\02 Projects\NEO Trading Engine\08 Patches\patches-flag-042-degraded-recovery" -Filter "*.patch" |
    Sort-Object Name |
    ForEach-Object { git am $_.FullName }

# Verify
git log --oneline main..HEAD
# Expected (topmost 5):
#   2d380a4 test(recovery): FLAG-042 C5 — 16 tests for DEGRADED recovery evaluators
#   c38dd41 feat(main_loop): FLAG-042 C4 — drift + corridor recovery evaluators + Step 8.4 wiring
#   4bf8105 feat(main_loop): FLAG-042 C3 — anchor saturation recovery evaluator + Step 8.4 wiring
#   00b2898 feat(main_loop): FLAG-042 infrastructure — recovery counter + cap + halt taxonomy + startup reset
#   e209370 feat(config): add FLAG-042 recovery config schema (anchor + drift + corridor + episode cap)

# Regression
python -m pytest tests/test_flag_042_degraded_recovery.py -v
# Expected: 16 passed

python -m pytest tests/test_anchor_saturation_guard.py tests/test_directional_drift_guard.py `
  tests/test_inventory_corridor_guard.py tests/test_reconciler_conservative.py `
  tests/test_flag_036_wallet_truth_reconciliation.py tests/test_halt_reason_lifecycle.py `
  tests/test_reconciler_anomaly_log.py tests/test_config.py tests/test_config_invariants.py `
  tests/test_flag_042_degraded_recovery.py -q
# Expected: 158 passed

Prerequisite: fix/startup-mode-reset must already be on main (per CLAUDE.md, it merged Apr 21). If main is behind that merge, apply the startup-mode-reset bundle first or the C2 startup reset block hunk will not apply cleanly.

Status

C1–C5 complete. Branch is clean and ready. Atlas's locked spec followed; one documented drift-condition-C deviation flagged above. Awaiting your review.

— Orion