- Use a stacked graph instead of a gauge as development over time is
especially useful for disk space usage.
- By only taking one metric per device into account, we avoid
double-counting for devices that are mounted multiple times.
Signed-off-by: beorn7 <beorn@grafana.com>
* node-mixin: Improve nodes dashboard
- Use stacking where it makes sense.
- Normalize idle CPU so that stacking is more meaningful.
- Consistently fill where stacking is used but don't fill where not.
- Fix y axis max value for Idle CPU panel.
- Fix y axis min value for memory usage panel.
- Use `$__interval` for range where applicable (and set min step
to 1m).
- Make the right Y axis for disk I/O actually work.
This is just an incremental improvements. It doesn't touch the more
involved TODOs.
Signed-off-by: beorn7 <beorn@grafana.com>
This addresses the blissful scenario where single-node failures are
unproblematic. No reason to wake somebody up if a node is about to
screw itself up by filling the disk.
Signed-off-by: beorn7 <beorn@grafana.com>
- Normalize cluster memory utilisation.
- Fix missing `1m` in memory saturation.
- Have both disk-related row next to each other instead with the
network row in between.
- Correctly render transmit network traffic as negative, using
`seriesOverrides` and `min: null` for the y-axis.
- Make panel and row naming consistent.
- Remove legend where it would just display a single entry with
exactly the title of the panel.
- Fix metric name in individual node CPU Saturation panel.
- Break up disk space utilisation by device in the panel for an
individual node.
NB: All of that doesn't touch any more subtle issues captured in the
various TODOs.
Signed-off-by: beorn7 <beorn@grafana.com>
This will cause a query to be valid even if values of selector are
empty.
Additionally fixing query responsible for disk space usage.
Signed-off-by: paulfantom <pawel@krupa.net.pl>
The only deviation that happened so far is to use format="percentunit"
in a Grafana gauge. This change wasn't even properly used in this repo
so far, so I opted to stick with "upstream" for now. If changes are
really needed, we can try to change upstream first.
Another change was done in parallal here and upstream, but it was
"more correct" in upstream. (Change datasource to $datasource
variable, only partially applied here.) Which is another point for
using the upstream and not copy it here.
Signed-off-by: beorn7 <beorn@grafana.com>
* Replace supervisord xmlrpc library
* Use `github.com/mattn/go-xmlrpc` that doesn't leak goroutines.
* Fix uptime metric
* Use Prometheus best practices for uptime metric.
* Use "start time" rather than "uptime".
* Don't emit a start time if the process is down.
* Add changelog entry.
* Add example compatibility rules.
Signed-off-by: Ben Kochie <superq@gmail.com>
* Add metrics from SNTPv4 packet to ntp collector & add ntpd sanity check
1. Checking local clock against remote NTP daemon is bad idea, local
ntpd acting as a client should do it better and avoid excessive load on
remote NTP server so the collector is refactored to query local NTP
server.
2. Checking local clock against remote one does not check local ntpd
itself. Local ntpd may be down or out of sync due to network issues, but
clock will be OK.
3. Checking NTP server using sanity of it's response is tricky and
depends on ntpd implementation, that's why common `node_ntp_sanity`
variable is exported.
* `govendor add golang.org/x/net/ipv4`, it is dependency of github.com/beevik/ntp
* Update github.com/beevik/ntp to include boring SNTP fix
* Use variable name from RFC5905
* ntp: move code to make export of raw metrics more explicit
* Move NTP math to `github.com/beevik/ntp`
* Make `golint` happy
* Add some brief docs explaining `ntp` #655 and `timex` #664 modules
* ntp: drop XXX comment that got its decision
* ntp: add `_seconds` suffix to relevant metrics
* Better `node_ntp_leap` comment
* s/node_ntp_reftime/node_ntp_reference_timestamp_seconds/ as requested by @discordianfish
* Extract subsystem name to const as suggested by @SuperQ