Skip to content

Cross-Team Briefing — S54 Root Cause, FLAG-052, FLAG-051

1. FLAG-051 (Vesper delivery, merged a5897cc — Atlas/Orion not yet briefed)

What it was: Cross-session EMA staleness caused immediate ANCHOR_IDLE lockout when the regime shifted between sessions. S53 confirmed it: persisted EMA ~+10 bps from S49/S50 cap-locked sessions, fresh structural −5 bps → residual −15 bps → ANCHOR_IDLE on tick 5, no exit path.

Fix: Added baseline_regime_drift_threshold_bps: float = 10.0 to AnchorDualSignalConfig. On first observe() call after persistence restore, if abs(structural − persisted_baseline) > threshold, discard the persisted baseline and cold-start the EMA warm-up. This is an emergency reset gate on session start only — normal operation is unchanged.

Verification: S54 confirmed the fix worked. EMA cold-started correctly (drift 12.5 bps > 10.0 threshold logged). Engine exited ANCHOR_IDLE at ~01:55–01:59 UTC as the 150-tick EMA converged to the −11 bps structural. Residual → ~0. Orders placed.

Files changed: neo_engine/config.py, neo_engine/dual_signal_calculator.py, config/config_live_stage1.yaml, tests/test_flag_051_regime_drift.py (13 tests). Merged a5897cc.


2. S54 Root Cause — Corrected Analysis

What I initially diagnosed (WRONG)

My first analysis after S54 concluded: "The DEGRADED entry cancel path is missing FLAG-047's tecNO_TARGET detection." This was incorrect. I based that conclusion on production log analysis without reading the code.

What I found when I read the code

_cancel_all_live_orders is a single shared method called by both the DEGRADED entry path ("Degraded entry cancel") and the shutdown path ("Shutdown cancel"). FLAG-047's tecNO_TARGETmark_cancel_race_unknown logic is already present in this shared method (lines 1637–1664 of the committed main_loop.py). Both paths are covered.

Actual root cause — tick ordering race

The S54 halt was caused by a timing race between the truth check and the reconciler. The tick structure is:

Tick start → _maybe_run_periodic_truth_check (runs if 60s elapsed)
           → Step 1: pre-trade inventory snapshot
           → Step 2: risk check
           → Step 3: market data
           → Step 4: account_offers fetch
           → Step 5: reconciler (resolves CANCEL_RACE_UNKNOWN via account_tx)
           → ...

The sequence in S54: - Tick N: DEGRADED entry fires → _cancel_all_live_orders("Degraded entry cancel") → buy order filled on-chain → tecNO_TARGETmark_cancel_race_unknown succeeds → order is CANCEL_RACE_UNKNOWN - Tick N+1: _maybe_run_periodic_truth_check runs at the TOP of the tick. The 60-second interval had elapsed (last check was ~60s before DEGRADED entry). Truth check fetches on-chain balances, sees +13.65 XRP / −19.5 RLUSD delta. The fill hasn't been recorded yet (reconciler runs at Step 5, AFTER the truth check). Truth check transitions to HALT. Tick exits before reconciler runs. inventory_truth_halt.

The _handle_cancel_race_unknown implementation existed and was correct — it just never got a chance to execute before the truth check killed the session.

Why it wasn't caught by FLAG-047 tests

FLAG-047 tests mocked or disabled the truth check. They tested the reconciler's resolution logic in isolation. The specific failure requires: (1) a real 60-second timer that aligns with (2) a tick immediately following CANCEL_RACE_UNKNOWN creation. This integration timing case was not covered.


3. FLAG-052 — Fix Delivered by Vesper

Root cause (precise): After mark_cancel_race_unknown succeeds, _last_truth_check_ts was not reset. The next periodic truth check could fire before the reconciler resolved the race.

Fix: One line in the else: branch of _cancel_all_live_orders (the branch that executes only when mark_cancel_race_unknown succeeds):

self._last_truth_check_ts = time.time()

Why this is correct: - Resets the timer to now, giving the reconciler one full check_interval_s (default 60s) before the next truth check - Only fires on successful mark_cancel_race_unknown — if the DB write fails, the order stays CANCELLED_BY_ENGINE and there's no fill to wait for (timer correctly NOT reset) - Only fires on tecNO_TARGET — normal tesSUCCESS cancels don't need the deferral (timer correctly NOT reset) - The truth check is NOT permanently suppressed — it fires again after check_interval_s, and if the reconciler resolved the fill, it passes. If INCONCLUSIVE, it fails as expected

Patch location: 08 Patches/fix-flag-052-cancel-race-timer/ - 0001-fix-main-loop-FLAG-052-truth-timer-reset.patch — main_loop.py (1 line change) - 0002-test-main-loop-FLAG-052-truth-timer-reset.patch — 3 new tests - APPLY.md — apply instructions

Tests (3 invariants): 1. TIMER_RESET_ON_CONFIRMED_RACEtecNO_TARGET + successful mark_cancel_race_unknown_last_truth_check_ts reset to ≈time.time() 2. TIMER_NOT_RESET_ON_DB_FAILUREmark_cancel_race_unknown raises → timer unchanged 3. TIMER_NOT_RESET_ON_NORMAL_CANCELtesSUCCESS → timer unchanged

Vesper requests Atlas ruling on the fix before Katja applies. See questions below.


4. Working Tree Truncation — Ongoing Concern

Finding: The Linux sandbox shows that working tree files are significantly shorter than committed versions: - ledger_reconciler.py: 982 lines disk vs 1542 lines committed (560 missing) - main_loop.py: 4974 lines disk vs 6250 lines committed (1276 missing) - state_manager.py: 2819 vs 3173, xrpl_gateway.py: 1573 vs 1820, config.py: 1420 vs 1639

The .pyc files appear to have been compiled from the correct full versions (pyc header stores expected size/mtime that doesn't match the truncated disk files). Python on Katja's Windows machine likely uses the .pyc and runs correctly today. BUT:

Risk: If any .pyc is invalidated (Python version change, __pycache__ cleared, disk cleanup, fresh checkout, etc.), Python will try to recompile from the truncated source and get SyntaxError: unterminated triple-quoted string literal. The engine will fail to start.

Katja needs to run (in VS Code terminal from repo root):

git checkout HEAD -- neo_engine/
This restores all truncated working tree files from HEAD and eliminates the dependency on potentially stale .pyc files.

Note: The SMB mount (which Vesper reads through) may not accurately reflect Windows file state — the truncation may appear only on the Linux side. Katja should verify with git status on Windows before and after.


5. Questions for Atlas

  1. FLAG-052 fix approval: Is the _last_truth_check_ts = time.time() approach correct, or should the deferral logic live in _maybe_run_periodic_truth_check (e.g., check for CANCEL_RACE_UNKNOWN orders and skip if any are pending)? The timer reset is simpler but uses a coarse 60s grace period. The CANCEL_RACE_UNKNOWN check would be more precise but requires a DB query every tick.

  2. Working tree restoration: Should git checkout HEAD -- neo_engine/ be run before applying the FLAG-052 patch, or after? (It should not conflict, but want Atlas to confirm sequencing.)

  3. Audit gap: The FLAG-047 test suite should have included an integration test covering the timing race (CANCEL_RACE_UNKNOWN created AND truth check fires in the same tick sequence). Should this be added as a separate follow-up, or included in the FLAG-052 delivery?


6. Orion Status

Orion is on hold. Vesper is handling FLAG-052. No new tasking for Orion until FLAG-052 is merged and S55 validates.

— Vesper