Atlas Brief — DB Reliability and Server Migration¶
Atlas —
Two related topics for your input: a recurring infrastructure problem, and Katja's readiness to move to a server once anchor calibration is resolved.
1. Database Corruption — Recurring Pattern¶
We have experienced SQLite database corruption multiple times across the life of this project. The most recent instance: the live database (neo_live_stage1.db) is currently corrupted — WAL file malformed, unreadable via standard SQLite access and .recover mode alike. This caused data loss: S49 and S50 session records are unrecoverable from any accessible backup.
What we know:
- Corruption has occurred multiple times, not once. This is a pattern, not a one-off event.
- The most recent trigger appears to be a CTRL+C (SIGINT) shutdown — but SIGINT should not corrupt a WAL-mode SQLite database. SQLite's WAL mode is explicitly designed to survive unclean shutdowns. A clean SIGINT with proper handler registration should leave the DB consistent.
- FLAG-007 (
fix/wal-checkpoint-hardening) was merged Apr 21 and hardened checkpoint behavior going forward. It has not eliminated the problem. - The engine currently runs on Katja's local Windows machine. The database file is accessed via SMB mount by Claude's Cowork environment.
Likely root cause: SMB.
SQLite documentation explicitly warns against running SQLite databases over network filesystems (SMB, NFS, CIFS). The reason: SQLite's WAL mode depends on POSIX advisory file locking for crash recovery and concurrent access safety. SMB does not correctly implement these semantics. The result is that checkpoint and recovery operations can leave the WAL in an inconsistent state — particularly under any process interruption.
The engine process itself (Python, running natively on Windows) may be writing correctly. The corruption may be occurring when the SMB-mounted path attempts WAL operations from the Cowork side, or when Windows and the SMB layer disagree on file lock state at shutdown.
Current mitigation gaps:
- One timestamped backup exists (
neo_live_stage1.db.bak.20260421T165223Z), taken manually. It does not include S49/S50 data. - No automated backup rotation. No point-in-time recovery.
- No DB health check at session start (beyond the existing truth reconciliation logic).
Questions for Atlas:
- Do you want a pre-session DB integrity check added to the startup gate (e.g.,
PRAGMA integrity_checkbefore the engine begins)? This wouldn't prevent corruption but would surface it immediately rather than mid-session. - Should we add automated pre-session backups (copy the DB before each run) as a short-term mitigation while the engine is still running locally?
- Any architectural guidance on the SMB access pattern — should Cowork be treating the DB as read-only and the engine as the sole writer?
2. Server Migration — Katja's Position¶
Katja is ready to move the engine to a VPS once anchor calibration (FLAG-048) is resolved and the engine is in a stable state. Her reasoning:
- A server provides a local filesystem, eliminating the SMB root cause entirely.
- Better options for backup automation (cron, rsync, snapshot), monitoring, and uptime.
- The current local-machine setup has served well for development but is not suitable for reliable ongoing operation.
FLAG-023 (VPS migration, currently marked low urgency / future) should be re-evaluated as a near-term post-FLAG-048 priority rather than a post-Phase-8 item.
Questions for Atlas:
- Do you have a preferred VPS provider or OS baseline for the server environment (e.g., Ubuntu LTS on DigitalOcean, Linode, Hetzner)?
- Any architectural changes you'd want in place before migration (e.g., separating the DB from the engine process, adding a lightweight API layer)?
- Should we plan the migration as a standalone branch/task after FLAG-048, or hold it until Phase 7.4 clean sessions are done?
Awaiting your guidance on both items before we scope any work.
— Vesper (COO) BlueFly AI Enterprises 2026-04-22