Running Sui Infra: What to Monitor So You Dont Find Out on Twitter
Builder-first notes and practical takeaways.
Operators running Sui full nodes or RPC endpoints know the real outage starts long before the first Twitter thread. The earliest warning often comes from spikes in sui_node_rpc_error_count, a Sui-native metric that flags RPC layer trouble before endpoints go dark. Alerting on these spikes, especially when correlated with method-specific tags, is the first line of defense against user-facing failures.
Latency is just as revealing. Monitoring sui_node_rpc_latency_seconds surfaces Sui-specific slowdowns that generic uptime checks miss. Early latency creep, even without outright errors, can indicate backend pressure or network issues, giving teams a chance to intervene before the backlog hits users.
State growth is rarely linear on Sui, especially after upgrades or during NFT-heavy periods. Tracking sui_node_db_size_bytes is critical for detecting rapid, unexpected database expansion that can overwhelm storage and lead to cascading failures. Post-upgrade surges in this metric demand immediate attention.
Pruning is supposed to keep database size in check, but lag can creep in. Correlating sui_node_pruning_events_total with sui_node_db_size_bytes helps detect when pruning falls behind, signaling storage risk and potential for state bloat. This is especially important after protocol upgrades or during periods of high NFT minting.
Async task overload is another Sui-specific risk. Alerting on sui_node_backpressure_events_total catches situations where transaction queues build up and async tasks start dropping work. This metric often jumps before outright RPC or consensus failures, making it a key pre-failure indicator.
Peer churn is a leading cause of network instability during upgrades or partitions. Watching both sui_network_active_peers and sui_network_disconnected_peers provides early signals of network health, especially when protocol changes roll out. Sudden drops or spikes in these metrics can precede broader consensus or RPC issues.
Automated log pattern detection is essential for catching DB migration failures and peer drop events during version upgrades. Relying on manual review is too slow; real-time log scanning for Sui-specific schema change or migration errors is now table stakes for operational readiness.
For indexers and data providers, gRPC stream health is non-negotiable. Monitoring for dropped or stalled streams—especially using Sui’s real-time streaming metrics—can surface issues that traditional polling misses, as outlined in Sui’s gRPC streaming docs.
Not all sync lag is created equal. Differentiating between Beluga synchronizer block sync lag and Mysticeti’s consensus_round_latency_seconds is critical for pinpointing whether a stall is due to block propagation or consensus progression. Alerting on consensus_round_latency_seconds spikes surfaces Mysticeti-specific round issues that generic block lag metrics can’t catch (reference).
Alert thresholds should never be static. Tuning them to hardware, region, and peer topology is necessary to avoid false positives and missed incidents. Region-specific latency anomalies and network partitions must be handled in alerting logic, not ignored as noise.
Polling-based monitoring misses too many async and streaming failure modes in Sui. Real-time, event-driven monitoring—especially for log and metric signals—captures issues as they emerge, not minutes later. The official Sui monitoring docs and metrics.rs are the baseline for covering operational edge cases, but every production deployment should extend coverage to Sui-specific DB migration and schema change log events.