Skip to content

Orion Tasking — feat/flag-042-degraded-recovery

To: Orion (he/him) From: Vesper (she/her) CC: Katja (Captain), Atlas (he/him) Date: 2026-04-21 Branch: feat/flag-042-degraded-recovery Priority: HIGH — Atlas-approved, blocks S45


Mission

Implement DEGRADED recovery logic for the anchor saturation guard (FLAG-042). This is the direct next step after Atlas's ruling approving FLAG-042 for the current phase.

Context: S43 and S44 both halted at ~441s via DEGRADED→HALT timeout. S44 showed the anchor regime cycling mid-session (mean +4.43 bps, range [−3.6, +10.0]) — the conditions that triggered DEGRADED did partially resolve, but the engine had no exit path. It sat idle for ~5 minutes and halted. FLAG-042 adds that exit path.

Ruling file: 07 Agent Coordination/[C] Atlas Ruling — FLAG-042 Approved + Recovery Spec.md


Scope — What Ships in This Branch

Primary (required for S45)

Anchor saturation recovery — exit DEGRADED when anchor conditions normalize:

  • mean(anchor_error_bps, last N ticks) < 4 bps (absolute value)
  • %(|anchor_error_bps| > 5 bps, last N ticks) < 30% (prevalence below exit threshold)
  • Both conditions sustained for N consecutive ticks (recommend 20–40; make configurable)

Secondary (minimal — do not overengineer)

Directional drift recovery: exit DEGRADED when an opposing fill is observed OR N ticks have elapsed with no same-side fills. Keep the exit condition minimal — mirror the existing drift guard logic in reverse.

Inventory corridor recovery: exit DEGRADED when inventory % is inside the corridor for corridor_lookback_ticks consecutive ticks. Reuse existing corridor parameters.


Atlas-Locked Constraints (Non-Negotiable)

4.1 Hysteresis Required

Entry threshold ≠ exit threshold for anchor saturation:

Mean Error Prevalence
Enter DEGRADED abs(mean) > 6 bps > 40%
Exit DEGRADED abs(mean) < 4 bps < 30%

This asymmetry prevents oscillation — the engine cannot immediately re-enter DEGRADED after recovering.

4.2 Time Stability Required

Exit must require persistence: N consecutive ticks (recommend 20–40, configurable) OR an equivalent time window (60–120s). A single tick where the anchor is clean is NOT sufficient to exit DEGRADED.

4.3 State Reset on Exit

On exiting DEGRADED (re-arming), reset: - Rolling anchor error window (deque) - Guard counters - Any in-session DEGRADED bookkeeping

Treat as a fresh regime after recovery. Do not inherit pre-DEGRADED state.

4.4 One Recovery Attempt Per Episode

Track whether a recovery has already occurred in the current session episode. If the system: 1. Enters DEGRADED 2. Recovers (exits DEGRADED) 3. Re-enters DEGRADED within the same episode

→ Do NOT loop. Escalate directly to HALT on the second entry.

This means the recovery logic must track recovery_attempt_count (or equivalent) per session. Second failure = HALT, no second recovery.


Pre-Code Investigation Required

Before writing any code, confirm:

  1. How does _exit_degraded_mode() work? Identify the exact method signature, what it resets, and what callers it expects. Confirm whether it already handles order re-enabling or whether the engine needs explicit re-arm logic to resume quoting.

  2. What is the anchor guard's current state machine? Identify where the anchor saturation guard evaluation lives in the tick loop, how it transitions to DEGRADED (what method it calls), and what guard state persists between ticks. Understand what "reset rolling windows and guard counters" means concretely for this guard.

  3. Where does DEGRADED mode persist across ticks? Confirm that while in DEGRADED, the anchor error window is still being updated each tick (required for recovery evaluation). If not, the recovery monitor must be a separate evaluation path that runs even in DEGRADED.

  4. Episode tracking — what exists? Check whether the engine tracks per-session DEGRADED entry/exit counts anywhere. If not, Orion implements a minimal counter: degraded_entry_count per session, written to engine_state. On second entry, escalate to HALT instead of looping.

  5. Directional drift and inventory corridor guards — exit conditions needed? Confirm whether these guards also need recovery hooks for this branch or whether Atlas's "keep minimal" means anchor only for S45. Per the ruling: "Keep minimal for now" on drift and corridor. Implement their hooks if they're < 30 lines; skip if they require significant investigation.

Report findings before writing any code.


YAML Config Changes

Add recovery parameters to the existing anchor_saturation_guard block in config_live_stage1.yaml and config.yaml:

anchor_saturation_guard:
  enabled: true
  lookback_ticks: 25
  bias_threshold_bps: 7.0
  prevalence_threshold_bps: 5.0
  prevalence_pct: 40.0
  # FLAG-042 recovery
  recovery_enabled: true
  recovery_exit_bias_threshold_bps: 4.0    # abs(mean) must drop below this
  recovery_exit_prevalence_pct: 30.0       # prevalence must drop below this
  recovery_stability_ticks: 30             # consecutive ticks required to exit

Add a top-level recovery constraint in strategy: (or equivalent top-level guard config):

degraded_recovery:
  max_recovery_attempts_per_episode: 1     # second DEGRADED entry → HALT

Exact YAML placement: match the nesting of existing guard config. Confirm exact key names with what's already in config.


Implementation Notes

Recovery monitor runs while in DEGRADED. The existing DEGRADED loop (cancel orders, stop quoting, continue observation) must also evaluate exit conditions each tick. If the anchor saturation guard triggered DEGRADED, the anchor recovery check runs every tick until either: (a) exit conditions are met for recovery_stability_ticks ticks, or (b) the 300s timeout hits.

Recovery exit: on triggering exit from DEGRADED: 1. Reset rolling windows and guard counters (state reset) 2. Call _exit_degraded_mode() (or equivalent) 3. Increment recovery attempt counter in engine_state 4. Resume normal quoting on the next tick 5. Log recovery: [ANCHOR_SAT_RECOVERY] Exited DEGRADED — mean={x:.2f}bps prevalence={y:.1f}% stable for {n} ticks

Second DEGRADED entry (same episode): if degraded_entry_count >= 1 and the guard fires again, skip recovery logic entirely and escalate directly to HALT. Log: [DEGRADED_RECOVERY] Second DEGRADED entry in episode — escalating to HALT (no loop).

Do NOT: modify entry thresholds, touch offset logic, relax any existing guard parameters, or expand FLAG-037. Atlas ruling Section 9 is explicit on this.


Test Requirements

Minimum 10 tests for this branch:

  1. Recovery does NOT exit DEGRADED on a single clean tick (time stability)
  2. Recovery does NOT exit DEGRADED when bias is clean but prevalence still high (both conditions required)
  3. Recovery does NOT exit DEGRADED when prevalence is clean but bias still high
  4. Recovery exits DEGRADED when both conditions met for recovery_stability_ticks consecutive ticks
  5. Recovery exits DEGRADED for negative bias scenario (symmetry — anchor recovers from negative saturation)
  6. State reset on exit — rolling window is cleared, guard counters reset
  7. Second DEGRADED entry after recovery escalates to HALT (one-recovery-per-episode rule)
  8. Recovery counter persists correctly in engine_state through the transition
  9. recovery_enabled: false disables recovery path — DEGRADED never exits automatically
  10. Hysteresis: after exiting DEGRADED with mean at 3.5 bps, re-entering requires mean to cross 6 bps again (not 4 bps)

Commit Plan (suggested)

  1. feat: add FLAG-042 recovery config schema (anchor saturation + episode cap)
  2. feat: implement anchor saturation recovery monitor — stability window and exit conditions
  3. feat: add one-recovery-per-episode escalation to HALT
  4. feat: state reset on DEGRADED exit (anchor guard windows + counters)
  5. feat: add directional drift and inventory corridor recovery hooks (minimal)
  6. test: FLAG-042 recovery — stability, symmetry, hysteresis, episode cap, state reset

Fewer commits if the secondary guards are trivial; split further if episode tracking requires meaningful isolated work.


Constraints

  • Do not modify DEGRADED entry thresholds — hysteresis is asymmetric by design
  • Do not expand FLAG-037 — phantom fill realignment remains a standing manual procedure
  • Do not relax any existing guard parameters
  • Parameters must be configurable via YAML — no hardcoded thresholds
  • Recovery must be disableable (recovery_enabled: false) without restart
  • No strategy tuning in this branch — recovery state machine only

Standing Delivery Rules (enforced)

  1. No pre-creating branches. Orion must not create feat/flag-042-degraded-recovery until ready to commit. Investigation stays on main or a throwaway local branch deleted before delivery.
  2. No *.patch glob in PowerShell. Apply instructions must use the Get-ChildItem ... -Filter "*.patch" | Sort-Object Name | ForEach-Object { git am $_.FullName } form.
  3. Always include defensive branch delete. Apply instructions must include git branch -D feat/flag-042-degraded-recovery 2>$null before git checkout -b feat/flag-042-degraded-recovery.

Deliverable

Standard delivery format: - Branch name, commit list with hashes and messages - Test count and pass rate - Pre-code investigation findings (required — do not skip) - Patch files to: patches/feat-flag-042-degraded-recovery/ - Deviations from spec, flagged explicitly

Vesper reviews before merge. S45 runs after merge.

— Vesper