Netdata - Awesome System Monitoring Tool

Daniel · January 22, 2019, 7:50pm

I’ve been trying Netdata on a few of my servers, and it’s awesome. Real-time monitoring of Linux servers (collects thousands of metrics every second) but somehow has very low CPU usage. Really cool.

I was previously using Munin, which runs every 5 minutes and causes a big jump in CPU usage whenever it runs. Netdata is much more precise, useful and collects much more data, yet uses significantly less CPU power.

I have it up here if you want to see a demo: https://netdata.vps03.d.sb/

The only downside I’ve seen is that it’s only focused on realtime monitoring, so it only keeps an hour of data default. It keeps it all in RAM (~15 MB for an hour of data for 1000 metrics). It’s amazing for that use case though. For long-term monitoring, you can stream the data into a time series database. I’m using Prometheus with 120-day retention to track longer-term trends, and it’s working really well.

Wolveix · January 22, 2019, 7:58pm

Thanks for supplying a demo! I might have to give this a proper look when I have the time

Daniel · January 22, 2019, 8:03pm

They do also have a few demos on their site, but I thought I’d provide my one too

Something I found amusing is that when I was running both Netdata and Munin, I could actually see the CPU usage spike every 5 minutes when Munin ran. That wasn’t visible at all on the Munin graphs, since it’s only measuring CPU usage every 5 minutes, and measures it before doing all its heavy processing.

Ympker · January 22, 2019, 8:16pm

Netdata is great

Hivelocity-Lee · January 28, 2019, 5:33pm

Interesting, as I have a unique case that requires what was going tobe Munin, think this is a perfect time to try Netdata, will comment post install

Daniel · February 13, 2019, 7:43am

Did you like Netdata?

I ended up uninstalling munin-node from all my servers, and switched to Netdata everywhere (except on Windows servers, as unfortunately it doesn’t work on Windows yet). I installed Prometheus to archive historical data, and created some custom dashboards through Grafana (example: Grafana)

ChrisM · February 15, 2019, 4:42am

I got Netdata running on a couple personal servers of mine. It works well and I really love the Dark Theme & just the overall UI itself.

Harambe · February 15, 2019, 4:49am

I gotta tweak the Netdata config on my FreeNAS box. Keep getting spammed with the high packet/bandwidth notices when I move anything bigger than like a gig on my LAN.

Other than that, it’s quite nice.

Daniel · February 15, 2019, 4:58am

Which alerts are firing? They might actually be legitimate (for example, if your send/receive buffers are too small, the system could be dropping packets).

Having said that, I had to tweak a few alerts too. The emails mention the files that contain the alerts, and there’s a script in /etc/netdata to make a copy of the file that you can edit (sorry, not at my computer right now to check the exact path).

Harambe · February 15, 2019, 5:00am

net_packets.em0 “10s received packets storm = XXX%” - just telling me the box was idle and then a transfer at full gigabit hit it.

So it does. I have just been ignoring them . Other one was the ram usage warning… which doesn’t do much for me either, of course it’s going to eat ram - it’s ZFS. It returns to normal after 5 mins every time.

Daniel · February 15, 2019, 5:44am

The command to edit the alert config is something like this:

sudo /etc/netdata/edit-config health.d/net.conf

It’ll make a copy of the default config into /etc/netdata and open it in your editor. Then just disable the noisy ones by setting to: silent or by tweaking the threshold

Daniel · March 7, 2019, 6:33am

So I was thinking about this today while configuring a new VPS… One of the interesting properties of Netdata is that the memory usage is easily computable: Each value is stored as a 32-bit number, so the amount of memory it uses for data is 4 bytes * number of metrics * number of historical values kept. Documented here: Database | Learn Netdata

By default it keeps one hour of data at one-second granularity (so 3600 entries per metric). With 1000 metrics, that means it’ll use ~14.4 MB RAM. Changing it to collect metrics every five seconds (instead of every second) and still keep one hours worth of data means that it only uses ~2.8 MB RAM for the same number of metrics. With such small RAM usage, it runs fine even on little 128 MB VPSes.

hacktek · March 10, 2019, 4:23pm

I’ve been playing with it and it really works very well. How difficult do you feel the Prometheus / Grafana setup would be for someone who’s very technically able but has never set this sort of thing up? I run a service and every once in a blue moon I get pinged for performance issues at times when I’m sleeping so it’s hard to debug where the issue was, keeping some of these graphs for say a couple of weeks would greatly help with that I feel.

ashsg · March 10, 2019, 4:41pm

You can adjust netdata to keep data for a longer period of time. For example could set it to 24 hours so you could go back and look at the data when the alert come through. Obviously ram will increase but not a lot more.

Grafana e.t.c is very simple to setup, loads of good articles that will get you working using the node_exporter plugin that will pull in most metrics you’d want. And there is a gallery of pre configured grafana dashboards to try that you can edit and adjust to your preference.

This would be a good guide for what you want : Exporting reference | Learn Netdata

Daniel · March 10, 2019, 7:10pm

Prometheus is pretty easy! I had never set up anything similar before either, and I found it quite straightforward. For reference, this is what my config looks like, with some irrelevant stuff removed:

global:
  scrape_interval:     1m
  evaluation_interval: 1m

scrape_configs:
  - job_name: 'netdata'
    metrics_path: '/api/v1/allmetrics'
    params:
      format: [prometheus_all_hosts]
	 
    honor_labels: true

    static_configs:
      - targets: [
          'vps03.vpn.d.sb:19999',
          'vps07.vpn.d.sb:19999',
          'vps10.vpn.d.sb:19999',
          'vps11.vpn.d.sb:19999'
        ]

And the command I’m using to run it:

prometheus --storage.tsdb.retention=365d --log.level=info --web.listen-address=127.0.0.1:9090

I’m running it on a Windows server, but it’d be similar on Linux. You’d just configure the command line via systemd, or use a package that does that for you automatically.

Currently I’ve got data from 4th January until today, across four Netdata servers (some added recently though), one Windows server using wmi_exporter, and a few other various things I’m monitoring, and my Prometheus data directory is around 2.9 GB in size. I’m scraping the data every minute. If needed, you could reduce the size even more by only scraping certain metrics - I’m scraping all Netdata’s metrics into Prometheus.

Grafana is even easier - Once installed and running, everything is configured in its web UI. Alerting in particular is easier to configure - You configure it in the UI while looking at a graph, rather than having to edit a YAML file.

Here’s one of my Grafana dashboards, for inspiration. It shows CPU, RAM and disk usage across all my VPSes:
https://dash.d.sb/d/yLPMYDwik/all-servers

Here’s the JSON for that dashboard, if you want it (you can import it into your own Grafana instance):

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 0,
      "gridPos": {
        "h": 6,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 2,
      "legend": {
        "alignAsTable": true,
        "avg": false,
        "current": true,
        "max": true,
        "min": true,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "100 - avg(netdata_cpu_cpu_percentage_average{dimension=\"idle\"}) by (instance)",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "A"
        },
        {
          "expr": "100 - 100 / sum(rate(wmi_cpu_time_total[5m])) by (instance) * sum(rate(wmi_cpu_time_total{mode=\"idle\"}[5m])) by (instance)",
          "format": "time_series",
          "interval": "",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "CPU Usage",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "percent",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 6,
        "w": 12,
        "x": 0,
        "y": 6
      },
      "id": 6,
      "legend": {
        "alignAsTable": true,
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "netdata_mem_available_MiB_average",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "A"
        },
        {
          "expr": "wmi_os_physical_memory_free_bytes / 1024 / 1024",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "B"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Available Memory",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "mbytes",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "alert": {
        "conditions": [
          {
            "evaluator": {
              "params": [
                5
              ],
              "type": "lt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "A",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "5m",
        "frequency": "1m",
        "handler": 1,
        "message": "Free disk space is low!",
        "name": "Free Disk Space on Primary Disk alert",
        "noDataState": "no_data",
        "notifications": [
          {
            "id": 1
          }
        ]
      },
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 6,
        "w": 12,
        "x": 12,
        "y": 6
      },
      "id": 4,
      "legend": {
        "alignAsTable": true,
        "avg": false,
        "current": true,
        "max": false,
        "min": false,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "netdata_disk_space_GiB_average{family=\"/\", dimension=\"avail\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "A"
        },
        {
          "expr": "wmi_logical_disk_free_bytes{volume=\"C:\"} / 1024 / 1024 / 1024",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}}",
          "refId": "B"
        }
      ],
      "thresholds": [
        {
          "colorMode": "critical",
          "fill": true,
          "line": true,
          "op": "lt",
          "value": 5
        }
      ],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Free Disk Space on Primary Disk",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "decgbytes",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": false
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    },
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "fill": 1,
      "gridPos": {
        "h": 8,
        "w": 24,
        "x": 0,
        "y": 12
      },
      "id": 8,
      "legend": {
        "alignAsTable": true,
        "avg": true,
        "current": true,
        "max": true,
        "min": true,
        "rightSide": true,
        "show": true,
        "total": false,
        "values": true
      },
      "lines": true,
      "linewidth": 1,
      "links": [],
      "nullPointMode": "null",
      "percentage": false,
      "pointradius": 5,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [
        {
          "alias": "/.+ sent/",
          "transform": "negative-Y"
        }
      ],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "netdata_system_net_kilobits_persec_average{dimension=\"received\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}} received",
          "refId": "A"
        },
        {
          "expr": "-netdata_system_net_kilobits_persec_average{dimension=\"sent\"}",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}} sent",
          "refId": "B"
        },
        {
          "expr": "sum(rate(wmi_net_bytes_sent_total[5m])) by (instance) / 1024",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}} sent",
          "refId": "C"
        },
        {
          "expr": "sum(rate(wmi_net_bytes_received_total[5m])) by (instance) / 1024",
          "format": "time_series",
          "intervalFactor": 1,
          "legendFormat": "{{instance}} received",
          "refId": "D"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Network Bandwidth",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "Kbits",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "schemaVersion": 16,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ],
    "time_options": [
      "5m",
      "15m",
      "1h",
      "6h",
      "12h",
      "24h",
      "2d",
      "7d",
      "30d"
    ]
  },
  "timezone": "",
  "title": "All Servers",
  "uid": "yLPMYDwik",
  "version": 15
}

I wouldn’t bother with node_exporter if you’re using Netdata too - Instead just configure Prometheus to scrape from Netdata.

hacktek · March 10, 2019, 7:35pm

Much appreciated! So it looks like Prometheus pulls from the netdata servers, is that right? How does that translate in terms of utilized bandwidth for metrics? Have you noticed any performance hit either on the netdata machines or the Prometheus host?

Right now I have 3 netdata servers on 3 frontend machines and 1 on a backend machine and was hoping to put Prometheus and Grafana on the backend but not sure what to expect in terms of a possible performance penalty on the services running there (a small penalty would be fine CPU/disk/ram/bw wise)

hacktek · March 10, 2019, 7:57pm

I tried this and, I may be dumb, but I could never figure out how to go back in time, seemed like the graphs were all real-time. Maybe I didn’t set it far back enough.

EDIT: LOL. I just noticed I could just drag the graphs back. I’m still gonna play with prometheus and graphana cause it’d be good to have all important graphs in a single place.

Daniel · March 10, 2019, 8:48pm

Yeah, that’s right. This is a nice model because it can warn you if any of the servers are inaccessible, and you can configure alerts around that (eg. if it can’t scrape Netdata for more than 10 minutes, fire an alert). You get a nice dashboard page showing all your scrape targets and whether they’re currently up (eg. mine is at https://prom.d.sb/targets)

On one of my servers, all the metrics from Netdata (/api/v1/allmetrics?format=prometheus_all_hosts) are around 22 KB. That’s about 1.3 MB per hour if you scrape Netdata once per minute (60 times per hour).

Netdata is barely noticeable, even on very tiny VPSes with small CPU allocations. I’m running it on a 128 MB RAM VPS from MrVM @mikho and even there it still uses less than 2% CPU. I did disable the Python collectors to save a bit of RAM on that one. Netdata stores data entirely in RAM by default, so you don’t get any disk I/O at all.

Prometheus is pretty light too, unless you use it in a very high-load scenario (scraping hundreds of servers). Looking at the data, my Prometheus instance is using less than 1% CPU on average. It does write to the disk of course, so there’s definitely more I/O than Netdata, but it’s not excessive.

My Prometheus server is on a VirMach VPS, and they have a reputation for suspending VPSes that use too much CPU power or I/O. Haven’t had an issue yet.

Netdata is mostly written in C, and Prometheus is written in Go… Both languages have a reputation of being really lightweight in terms of RAM and CPU usage.

You can also hold Alt and use your mouse scrollwheel to zoom out There’s some buttons at the bottom right of the graph too, under the legend.

hacktek · March 10, 2019, 10:01pm

I just set prometheus and grafana up and I’m surprised about how easy it was.

The way the metrics are found and the queries generated is quite intuitive.

On to build some more cute dashboards!

sshd · March 13, 2019, 11:02am

Finally figured why the SSH connection takes a very long time on my VPS in Romania

I’ll let you guess when I’ve connected on the server