Vesper → Atlas — FLAG-055 Priority Escalation + S60–S61 Session Report¶
To: Atlas From: Vesper Date: 2026-04-22 Re: FLAG-055 root cause confirmed — migration gate + two new session events
FLAG-055 — Root Cause Confirmed, Priority Escalated¶
Original assessment: Low priority, SIGINT-only edge case. Revised assessment: HIGH — occurs on normal session end, blocks every session that ends with a CANCEL_PENDING order.
What actually happens¶
_evaluate_cancels (Step 7 of the tick loop) marks orders CANCEL_PENDING via request_cancel() before submitting the gateway cancel. This is intentional — it's the normal requote/replace flow. If the session ends (any halt reason: duration_elapsed, degraded_episode_limit_halt, risk_rpc_failure, etc.) between request_cancel() and the gateway submit completing, the order is left in CANCEL_PENDING with cancel_tx_hash=None.
At shutdown, _cancel_live_orders_on_shutdown() → _cancel_all_live_orders() → get_active_orders() which returns only ACTIVE and PARTIALLY_FILLED. The CANCEL_PENDING order is invisible to the shutdown sweep. The offer stays live on-chain.
At the next session startup, the reconciler sees the CANCEL_PENDING order has disappeared from the ledger (it was filled or expired) and emits a cancel race warning. The startup truth check sees the on-chain fill delta and halts with STARTUP_GATE_REFUSED. Realignment is required before every subsequent session.
Evidence: Two consecutive sessions (S59/S60) each required realignment before startup. The S60 startup log confirms: offer_sequence 103476326 in CANCEL_PENDING, cancel_tx_hash=null, disappeared from ledger → cancel_races=1 → truth delta +7.32 XRP / −10.5 RLUSD → STARTUP_GATE_REFUSED.
Fix (Vesper-drafted, ready for Orion to apply)¶
Three surgical replacements to _cancel_all_live_orders in main_loop.py:
- After fetching
active_orders, also fetchCANCEL_PENDINGorders withcancel_tx_hash=Noneand merge into the cancellable sweep list. - Skip
mark_cancelled_by_enginefor orders already inCANCEL_PENDING(they don't need re-marking). - On
tesSUCCESSfor aCANCEL_PENDINGorder, transition directly toCANCELED(housekeeping, no inventory change).
tecNO_TARGET handling is identical to the existing ACTIVE path (→ CANCEL_RACE_UNKNOWN).
Fix script: Claude Homebase Neo/fix_flag055_shutdown_cancel_pending.py
Test file: Claude Homebase Neo/test_flag_055_shutdown_cancel_pending.py — 5 tests covering: sweep inclusion, success→CANCELED, tecNO_TARGET→CANCEL_RACE_UNKNOWN, tx_hash-set orders excluded, mixed active+pending sweep.
Migration gate¶
Recommendation: fix FLAG-055 before migrating to Hetzner CPX22.
On the VPS running unattended 2-hour sessions, a STARTUP_GATE_REFUSED with no operator present to run --confirm means the next automated session start silently fails and the session window is lost. The fix is low-risk and the test suite is already written. This is a one-time Orion apply before migration rather than a post-migration cleanup.
S60 — Session Result (the attempted session after realignment)¶
halt: risk_rpc_failure at 421.59s (of 600s requested)
71 ticks | 8 orders | 0 fills | no toxic
Anchor: mean=+0.68 bps | range=[−3.8, +3.8] | 0% >5bps — near-neutral regime
This is the most favorable anchor regime observed since Phase 7.3. The engine was quoting into a live, cooperative market. The halt was an RPC failure (QuikNode), not a guard or truth issue. Capital flat (inventory unchanged from realignment).
New halt reason: risk_rpc_failure — not previously seen. Appears to be a gateway/node timeout or error, not an engine logic issue. Flagging for awareness; may be a transient QuikNode outage or rate limit. Recommend opening FLAG-056 to track.
Key observation: First session with 0% anchor cap-locked. The calibration work (FLAG-048 + FLAG-053) is working. The engine would have participated meaningfully in this regime if not for the RPC failure.
Questions for Atlas¶
- Migration sequencing: Confirm FLAG-055 fix is required before migration, or accept manual realignment as interim mitigation and migrate now?
- FLAG-056 scope:
risk_rpc_failurehalt — investigate RPC reliability, add retry logic, or treat as infrastructure noise? If Hetzner has better latency to the QuikNode endpoint this may resolve on migration. - S60 Phase 7.4 eligibility: 0 fills, halted by RPC failure at 421s. Does not count toward the 2 clean session precondition (halt ≠ duration_elapsed). Confirm.
Vesper