Vesper → Atlas — FLAG-055 Priority Escalation + S60–S61 Session Report¶

To: Atlas From: Vesper Date: 2026-04-22 Re: FLAG-055 root cause confirmed — migration gate + two new session events

FLAG-055 — Root Cause Confirmed, Priority Escalated¶

Original assessment: Low priority, SIGINT-only edge case. Revised assessment: HIGH — occurs on normal session end, blocks every session that ends with a CANCEL_PENDING order.

What actually happens¶

_evaluate_cancels (Step 7 of the tick loop) marks orders CANCEL_PENDING via request_cancel() before submitting the gateway cancel. This is intentional — it's the normal requote/replace flow. If the session ends (any halt reason: duration_elapsed, degraded_episode_limit_halt, risk_rpc_failure, etc.) between request_cancel() and the gateway submit completing, the order is left in CANCEL_PENDING with cancel_tx_hash=None.

At shutdown, _cancel_live_orders_on_shutdown() → _cancel_all_live_orders() → get_active_orders() which returns only ACTIVE and PARTIALLY_FILLED. The CANCEL_PENDING order is invisible to the shutdown sweep. The offer stays live on-chain.

At the next session startup, the reconciler sees the CANCEL_PENDING order has disappeared from the ledger (it was filled or expired) and emits a cancel race warning. The startup truth check sees the on-chain fill delta and halts with STARTUP_GATE_REFUSED. Realignment is required before every subsequent session.

Evidence: Two consecutive sessions (S59/S60) each required realignment before startup. The S60 startup log confirms: offer_sequence 103476326 in CANCEL_PENDING, cancel_tx_hash=null, disappeared from ledger → cancel_races=1 → truth delta +7.32 XRP / −10.5 RLUSD → STARTUP_GATE_REFUSED.

Fix (Vesper-drafted, ready for Orion to apply)¶

Three surgical replacements to _cancel_all_live_orders in main_loop.py:

After fetching active_orders, also fetch CANCEL_PENDING orders with cancel_tx_hash=None and merge into the cancellable sweep list.
Skip mark_cancelled_by_engine for orders already in CANCEL_PENDING (they don't need re-marking).
On tesSUCCESS for a CANCEL_PENDING order, transition directly to CANCELED (housekeeping, no inventory change).

tecNO_TARGET handling is identical to the existing ACTIVE path (→ CANCEL_RACE_UNKNOWN).

Fix script: Claude Homebase Neo/fix_flag055_shutdown_cancel_pending.py Test file: Claude Homebase Neo/test_flag_055_shutdown_cancel_pending.py — 5 tests covering: sweep inclusion, success→CANCELED, tecNO_TARGET→CANCEL_RACE_UNKNOWN, tx_hash-set orders excluded, mixed active+pending sweep.

Migration gate¶

Recommendation: fix FLAG-055 before migrating to Hetzner CPX22.

On the VPS running unattended 2-hour sessions, a STARTUP_GATE_REFUSED with no operator present to run --confirm means the next automated session start silently fails and the session window is lost. The fix is low-risk and the test suite is already written. This is a one-time Orion apply before migration rather than a post-migration cleanup.

S60 — Session Result (the attempted session after realignment)¶

halt: risk_rpc_failure at 421.59s (of 600s requested) 71 ticks | 8 orders | 0 fills | no toxic

Anchor: mean=+0.68 bps | range=[−3.8, +3.8] | 0% >5bps — near-neutral regime

This is the most favorable anchor regime observed since Phase 7.3. The engine was quoting into a live, cooperative market. The halt was an RPC failure (QuikNode), not a guard or truth issue. Capital flat (inventory unchanged from realignment).

New halt reason: risk_rpc_failure — not previously seen. Appears to be a gateway/node timeout or error, not an engine logic issue. Flagging for awareness; may be a transient QuikNode outage or rate limit. Recommend opening FLAG-056 to track.

Key observation: First session with 0% anchor cap-locked. The calibration work (FLAG-048 + FLAG-053) is working. The engine would have participated meaningfully in this regime if not for the RPC failure.

Questions for Atlas¶

Migration sequencing: Confirm FLAG-055 fix is required before migration, or accept manual realignment as interim mitigation and migrate now?
FLAG-056 scope: risk_rpc_failure halt — investigate RPC reliability, add retry logic, or treat as infrastructure noise? If Hetzner has better latency to the QuikNode endpoint this may resolve on migration.
S60 Phase 7.4 eligibility: 0 fills, halted by RPC failure at 421s. Does not count toward the 2 clean session precondition (halt ≠ duration_elapsed). Confirm.

Vesper