Monad Snapshots: Operator Runbook for Generation, Hosting, and Verification
Builder-first notes and practical takeaways.
When invoking the Monad CLI snapshot command, always specify a node-owned path with strict umask enforcement to ensure no other users or processes can access or tamper with snapshot files. This is especially critical when running on shared infrastructure or when using automation that may inadvertently relax permissions.
Detached shell execution using nohup or screen is not optional for mainnet-scale snapshots—SSH interruptions or terminal disconnects will otherwise terminate the process, risking incomplete output and wasted compute. For operators running unattended jobs, ensure the detached session is monitored for exit status and logs are redirected to persistent storage for post-mortem analysis.
Snapshot filenames should always encode block height, UTC timestamp, and release channel, using a delimiter and format that is both human-readable and trivial to parse with shell or Python scripts. This approach supports deterministic automation for snapshot discovery, validation, and cross-environment promotion, and eliminates ambiguity during incident response or rollback.
Monad’s snapshot file format is a compressed tarball with atomicity guarantees: the CLI writes to a temp file and only moves it into place after a successful fsync, as documented in the official spec. Never expose the temp file location to automation or monitoring tools that might pick up incomplete archives; always monitor for atomic rename events as the signal of completion.
Immediately after any failed or interrupted snapshot attempt, scan the snapshot directory for temp files or archives lacking the expected naming pattern. Quarantine or delete these artifacts to prevent accidental restore attempts or wasted storage. Automate this cleanup as a post-failure hook in your snapshot orchestration logic.
Benchmark your hardware against the latest Monad mainnet/testnet snapshot size and throughput figures. SSDs should be provisioned with IOPS 2–3x above the minimums, and disk space should always be at least double the largest expected snapshot to account for temp files and concurrent operations. Monitor disk metrics in real time to preempt failures.
Network throughput for upload and restore should be validated against mainnet benchmarks—500 Mbps is a practical minimum for uploads, and restore speeds are directly gated by both network and disk. Under-provisioned links will cause cascading delays in disaster recovery and routine snapshot rotation.
For secure hosting, enforce encryption at rest using storage-native mechanisms such as SSE or CMEK (AWS S3 encryption guide). Public read access is acceptable for distribution, but write and delete permissions must be tightly scoped to node automation or trusted operators to minimize blast radius.
Multipart upload is required for large Monad snapshots; configure CLI options for expected size and parallelism, and script automatic resume/retry for failed parts. Integrate sha256sum verification post-upload, and wire this check into CI or scheduled jobs, triggering alerts on mismatch or incomplete uploads.
Snapshot versioning should always reflect block height and release channel, and at least one previous version should be retained for safe rollback during node upgrades. Use cron or cloud-native lifecycle policies (S3 lifecycle docs) to prune expired snapshots, tuned to your operational cadence and storage constraints.
After uploading a new snapshot, immediately invalidate CDN caches to prevent propagation of stale or partial files. Automate this as part of your upload pipeline to guarantee downstream operators always fetch the latest, verified snapshot.
Monitor the full snapshot lifecycle: track disk I/O, free space, open file descriptors, file/directory ownership, and umask in real time. Integrate lifecycle logs and metrics with anomaly detection for snapshot size, upload/restore duration, and failure rates, surfacing regressions or operational drift before they impact node health.
During restore, enforce strict hygiene: start with a clean data directory, validate checksums before unpacking, and run post-restore RPC/node health checks. Automate restore validation to confirm node liveness and state correctness before considering the restore complete, as specified in the Monad restore guide.