Cross-Team Briefing — S54 Root Cause, FLAG-052, FLAG-051¶
1. FLAG-051 (Vesper delivery, merged a5897cc — Atlas/Orion not yet briefed)¶
What it was: Cross-session EMA staleness caused immediate ANCHOR_IDLE lockout when the regime shifted between sessions. S53 confirmed it: persisted EMA ~+10 bps from S49/S50 cap-locked sessions, fresh structural −5 bps → residual −15 bps → ANCHOR_IDLE on tick 5, no exit path.
Fix: Added baseline_regime_drift_threshold_bps: float = 10.0 to AnchorDualSignalConfig. On first observe() call after persistence restore, if abs(structural − persisted_baseline) > threshold, discard the persisted baseline and cold-start the EMA warm-up. This is an emergency reset gate on session start only — normal operation is unchanged.
Verification: S54 confirmed the fix worked. EMA cold-started correctly (drift 12.5 bps > 10.0 threshold logged). Engine exited ANCHOR_IDLE at ~01:55–01:59 UTC as the 150-tick EMA converged to the −11 bps structural. Residual → ~0. Orders placed.
Files changed: neo_engine/config.py, neo_engine/dual_signal_calculator.py, config/config_live_stage1.yaml, tests/test_flag_051_regime_drift.py (13 tests). Merged a5897cc.
2. S54 Root Cause — Corrected Analysis¶
What I initially diagnosed (WRONG)¶
My first analysis after S54 concluded: "The DEGRADED entry cancel path is missing FLAG-047's tecNO_TARGET detection." This was incorrect. I based that conclusion on production log analysis without reading the code.
What I found when I read the code¶
_cancel_all_live_orders is a single shared method called by both the DEGRADED entry path ("Degraded entry cancel") and the shutdown path ("Shutdown cancel"). FLAG-047's tecNO_TARGET → mark_cancel_race_unknown logic is already present in this shared method (lines 1637–1664 of the committed main_loop.py). Both paths are covered.
Actual root cause — tick ordering race¶
The S54 halt was caused by a timing race between the truth check and the reconciler. The tick structure is:
Tick start → _maybe_run_periodic_truth_check (runs if 60s elapsed)
→ Step 1: pre-trade inventory snapshot
→ Step 2: risk check
→ Step 3: market data
→ Step 4: account_offers fetch
→ Step 5: reconciler (resolves CANCEL_RACE_UNKNOWN via account_tx)
→ ...
The sequence in S54:
- Tick N: DEGRADED entry fires → _cancel_all_live_orders("Degraded entry cancel") → buy order filled on-chain → tecNO_TARGET → mark_cancel_race_unknown succeeds → order is CANCEL_RACE_UNKNOWN
- Tick N+1: _maybe_run_periodic_truth_check runs at the TOP of the tick. The 60-second interval had elapsed (last check was ~60s before DEGRADED entry). Truth check fetches on-chain balances, sees +13.65 XRP / −19.5 RLUSD delta. The fill hasn't been recorded yet (reconciler runs at Step 5, AFTER the truth check). Truth check transitions to HALT. Tick exits before reconciler runs. inventory_truth_halt.
The _handle_cancel_race_unknown implementation existed and was correct — it just never got a chance to execute before the truth check killed the session.
Why it wasn't caught by FLAG-047 tests¶
FLAG-047 tests mocked or disabled the truth check. They tested the reconciler's resolution logic in isolation. The specific failure requires: (1) a real 60-second timer that aligns with (2) a tick immediately following CANCEL_RACE_UNKNOWN creation. This integration timing case was not covered.
3. FLAG-052 — Fix Delivered by Vesper¶
Root cause (precise): After mark_cancel_race_unknown succeeds, _last_truth_check_ts was not reset. The next periodic truth check could fire before the reconciler resolved the race.
Fix: One line in the else: branch of _cancel_all_live_orders (the branch that executes only when mark_cancel_race_unknown succeeds):
Why this is correct:
- Resets the timer to now, giving the reconciler one full check_interval_s (default 60s) before the next truth check
- Only fires on successful mark_cancel_race_unknown — if the DB write fails, the order stays CANCELLED_BY_ENGINE and there's no fill to wait for (timer correctly NOT reset)
- Only fires on tecNO_TARGET — normal tesSUCCESS cancels don't need the deferral (timer correctly NOT reset)
- The truth check is NOT permanently suppressed — it fires again after check_interval_s, and if the reconciler resolved the fill, it passes. If INCONCLUSIVE, it fails as expected
Patch location: 08 Patches/fix-flag-052-cancel-race-timer/
- 0001-fix-main-loop-FLAG-052-truth-timer-reset.patch — main_loop.py (1 line change)
- 0002-test-main-loop-FLAG-052-truth-timer-reset.patch — 3 new tests
- APPLY.md — apply instructions
Tests (3 invariants):
1. TIMER_RESET_ON_CONFIRMED_RACE — tecNO_TARGET + successful mark_cancel_race_unknown → _last_truth_check_ts reset to ≈time.time()
2. TIMER_NOT_RESET_ON_DB_FAILURE — mark_cancel_race_unknown raises → timer unchanged
3. TIMER_NOT_RESET_ON_NORMAL_CANCEL — tesSUCCESS → timer unchanged
Vesper requests Atlas ruling on the fix before Katja applies. See questions below.
4. Working Tree Truncation — Ongoing Concern¶
Finding: The Linux sandbox shows that working tree files are significantly shorter than committed versions:
- ledger_reconciler.py: 982 lines disk vs 1542 lines committed (560 missing)
- main_loop.py: 4974 lines disk vs 6250 lines committed (1276 missing)
- state_manager.py: 2819 vs 3173, xrpl_gateway.py: 1573 vs 1820, config.py: 1420 vs 1639
The .pyc files appear to have been compiled from the correct full versions (pyc header stores expected size/mtime that doesn't match the truncated disk files). Python on Katja's Windows machine likely uses the .pyc and runs correctly today. BUT:
Risk: If any .pyc is invalidated (Python version change, __pycache__ cleared, disk cleanup, fresh checkout, etc.), Python will try to recompile from the truncated source and get SyntaxError: unterminated triple-quoted string literal. The engine will fail to start.
Katja needs to run (in VS Code terminal from repo root):
This restores all truncated working tree files from HEAD and eliminates the dependency on potentially stale .pyc files.Note: The SMB mount (which Vesper reads through) may not accurately reflect Windows file state — the truncation may appear only on the Linux side. Katja should verify with git status on Windows before and after.
5. Questions for Atlas¶
-
FLAG-052 fix approval: Is the
_last_truth_check_ts = time.time()approach correct, or should the deferral logic live in_maybe_run_periodic_truth_check(e.g., check for CANCEL_RACE_UNKNOWN orders and skip if any are pending)? The timer reset is simpler but uses a coarse 60s grace period. The CANCEL_RACE_UNKNOWN check would be more precise but requires a DB query every tick. -
Working tree restoration: Should
git checkout HEAD -- neo_engine/be run before applying the FLAG-052 patch, or after? (It should not conflict, but want Atlas to confirm sequencing.) -
Audit gap: The FLAG-047 test suite should have included an integration test covering the timing race (CANCEL_RACE_UNKNOWN created AND truth check fires in the same tick sequence). Should this be added as a separate follow-up, or included in the FLAG-052 delivery?
6. Orion Status¶
Orion is on hold. Vesper is handling FLAG-052. No new tasking for Orion until FLAG-052 is merged and S55 validates.
— Vesper