Fluxme.io
Best Practices

Monitoring & Alerting

Setting up comprehensive monitoring with Discord, email, and dashboard alerts.

9 min read
monitoringalertsdiscordemail

Effective monitoring is the backbone of a reliable Flux node hosting operation. Without proper monitoring and alerting, you're flying blind β€” node failures, benchmark degradations, and reward losses can go unnoticed for hours or days. This guide covers building a comprehensive monitoring stack.

Essential Metrics to Monitor

Every FluxNode generates metrics that indicate its health, performance, and reward status. Here are the critical ones:

MetricWhy It MattersAlert Threshold
Node statusCONFIRMED = earning rewardsAny status change
Benchmark statusMust pass to stay activeAny benchmark failure
Uptime %Affects PNR eligibilityBelow 97%
EPS scoreCPU performance benchmarkBelow minimum for tier
Disk usageFull disks crash nodesAbove 85%
RAM usageHigh RAM usage causes swappingAbove 90%
Node rankLower rank = more frequent rewardsSudden jumps
FluxOS versionOutdated versions may failBehind latest by 1+
Last rewardConfirms node is earningNo reward for 2x expected interval

Built-in FluxOS Monitoring

Every FluxNode has a built-in web UI accessible at http://YOUR_IP:16126. This dashboard provides real-time information about your node including:

  • β€’Node Status β€” current blockchain confirmation status
  • β€’Benchmark Results β€” latest EPS, RAM, SSD, and network scores
  • β€’Connected Peers β€” how many other nodes you're connected to
  • β€’Running Apps β€” Docker containers deployed on the node
  • β€’Resource Usage β€” current CPU, RAM, and disk utilization
  • β€’Flux Daemon Info β€” blockchain sync status and chain height

The FluxOS API is also available programmatically. You can query node status via:

Query node status via FluxOS API

# Get node status
curl http://YOUR_IP:16127/flux/info

# Get benchmark status
curl http://YOUR_IP:16127/benchmark/getbenchmarks

# Get FluxOS version
curl http://YOUR_IP:16127/flux/version

# Get running apps
curl http://YOUR_IP:16127/apps/installedapps

External Monitoring Tools

FluxNodes.net

FluxNodes.net is a community-operated network explorer that provides a comprehensive view of all Flux nodes. You can look up any node by IP address or Zel ID to check its status, benchmarks, rank, and reward history. It's invaluable for verifying node health from an external perspective.

UptimeRobot / Hetrixtools

Third-party uptime monitoring services can ping your node's FluxOS API endpoint and alert you when it becomes unreachable. This catches network-level issues that the node itself can't report.

  1. 1

    Create an account

    Sign up for a free plan on UptimeRobot (50 monitors free) or Hetrixtools (15 monitors free).

  2. 2

    Add HTTP monitors

    Monitor http://YOUR_IP:16127/flux/info β€” this endpoint returns node info when FluxOS is running.

  3. 3

    Set check interval

    5-minute intervals are sufficient for most providers. Premium plans offer 1-minute intervals.

  4. 4

    Configure alerts

    Set up email, SMS, or webhook notifications for downtime events.

Discord Webhook Alerts

Discord webhooks are a popular and free way to get real-time alerts in a team channel. Here's how to set them up:

  1. 1

    Create a Discord channel

    Create a dedicated #node-alerts channel in your Discord server.

  2. 2

    Create a webhook

    Channel Settings β†’ Integrations β†’ Webhooks β†’ New Webhook. Copy the webhook URL.

  3. 3

    Write a monitoring script

    Create a bash script that checks node status via the FluxOS API and sends alerts to the webhook when issues are detected.

  4. 4

    Schedule with cron

    Run the script every 5 minutes via a cron job on a separate monitoring server (not the node itself).

Simple Discord alert script (check_node.sh)

#!/bin/bash
NODE_IP="YOUR_IP"
WEBHOOK_URL="YOUR_DISCORD_WEBHOOK_URL"

# Check if FluxOS API responds
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --connect-timeout 10 \
  "http://$NODE_IP:16127/flux/info")

if [ "$STATUS" != "200" ]; then
  curl -H "Content-Type: application/json" \
    -d "{"content": "⚠️ **Node Alert**: $NODE_IP is unreachable (HTTP $STATUS)"}" \
    "$WEBHOOK_URL"
fi

Cron job (runs every 5 minutes)

# Add to crontab: crontab -e
*/5 * * * * /home/user/check_node.sh >> /var/log/node-monitor.log 2>&1

Fleet-Wide Monitoring Dashboard

When managing 10+ nodes, individual monitoring becomes impractical. You need a centralized dashboard that shows the health of your entire fleet at a glance.

  • β€’Fluxme.io Dashboard β€” the built-in monitoring on this platform shows fleet-wide status, alerts, and performance metrics
  • β€’Custom Grafana setup β€” for advanced providers: collect metrics with Prometheus, visualize with Grafana. Query FluxOS API from each node and aggregate.
  • β€’Spreadsheet tracking β€” for smaller fleets: maintain a simple spreadsheet with node IPs, status, last benchmark, last reward, expiry date

Automated Remediation

For common, well-understood issues, automated remediation can save significant time:

  • β€’Auto-restart FluxOS β€” if the FluxOS service stops, a systemd watchdog or script can restart it automatically
  • β€’Disk cleanup β€” automated scripts to clean Docker images, logs, and temporary files when disk usage exceeds 80%
  • β€’Benchmark recovery β€” if a benchmark fails due to temporary load, a script can restart the daemon and force a re-benchmark
  • β€’FluxOS auto-update β€” scripts that check for new FluxOS versions and apply updates during maintenance windows

Always test automated remediation scripts thoroughly before deploying to production. A buggy auto-restart script can cause more downtime than it prevents. Start with monitoring-only, then add automation gradually.

Escalation Procedures

Define clear escalation paths for different severity levels:

SeverityExampleResponseEscalation
CriticalMultiple nodes offlineImmediate investigationWake up on-call engineer
HighSingle node benchmark failureWithin 1 hourNotify lead engineer
MediumDisk usage above 85%Within 4 hoursStandard ticket
LowMinor version behindWithin 24 hoursAdd to maintenance queue