Commit Graph

228 Commits

Author SHA1 Message Date
Paul Cuzner
4750ac0d77 mgr/prometheus: add test cases and validation using tox
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.

In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:36:40 +13:00
Paul Cuzner
e0dfc02063 mgr/prometheus: track individual healthchecks as metrics
This patch creates a health history object maintained in
the modules kvstore.  The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear

In addition to the new commands, the additional metrics
have been used to update the prometheus alerts

Fixes: https://tracker.ceph.com/issues/52638

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:32:39 +13:00
Aashish Sharma
ed954b0e6c mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards
This PR intends to refactor cephfs dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:36:31 +05:30
Aashish Sharma
e490e2f3ab mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards
This PR intends to refactor osds dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:13:50 +05:30
Aashish Sharma
8c48821c21 mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards
This PR intends to refactor pools dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:10:56 +05:30
Aashish Sharma
e737aaa000 mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards
This PR intends to refactor rbd dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:09:04 +05:30
Aashish Sharma
eb01954cd9 mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards
This PR intends to refactor radosgw dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 11:57:28 +05:30
Ernesto Puerta
19535b1d0e
Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards
mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-18 17:14:03 +02:00
Ernesto Puerta
9b40c9df26
Merge pull request #43377 from rhcs-dashboard/fix-clients-connection-query
mgr/dashboard: replace "Ceph-cluster" Client connections with active-standby MGRs

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-13 13:37:51 +02:00
Sebastian Wagner
53382d70eb
Merge pull request #43274 from pcuzner/add-mib
monitoring:Adding the Ceph MIB

Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-12 22:29:06 +02:00
Aashish Sharma
f7714de294 mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards
This PR intends to refactor hosts dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-12 11:05:02 +05:30
Avan Thakkar
d388c5e958 mgr/dashboard: replace Client connections with active-stdby mgrs
Fixes: https://tracker.ceph.com/issues/52121
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-10-11 21:53:23 +05:30
Paul Cuzner
b96aa5d184 monitoring:Updated README
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-06 14:32:47 +13:00
Ernesto Puerta
ba9e17d2d2
Merge pull request #43132 from p-se/monitoring-grafana-piechart-update
monitoring: update grafana-piechart-panel plugin

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-09-28 18:37:45 +02:00
Paul Cuzner
f9213ad9cf monitoring:Adding the Ceph MIB
The ceph MIB has been created and maintained in a
a separate repo:
https://github.com/SUSE/prometheus-webhook-snmp

This patch brings this MIB into the main ceph repo, so
alert changes can target prometheus and potentially
SNMP environments within the same PR.

Kudos to Volker Theile for creating the MIB.

Fixes: https://tracker.ceph.com/issues/52708

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-09-23 11:06:19 +12:00
Patrick Seidensal
af94237621
monitoring: update grafana-piechart-panel plugin
Fixes: https://tracker.ceph.com/issues/51211

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-09-10 15:28:17 +02:00
Aashish Sharma
58d635455d mgr/dashboard: Incorrect MTU mismatch warning
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-09-02 15:34:36 +05:30
Kefu Chai
1835fd86dd cmake: exclude "grafonnet-lib" target from "all"
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-20 22:50:42 +08:00
Kefu Chai
1fdd632d0c cmake: silence build output when building external deps
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-16 21:27:57 +08:00
Ernesto Puerta
559afae0b9
Merge pull request #41570 from jhrcz-ls/wip-cephfs-overview-use-rate
mgr/dashboard: cephfs MDS Workload to use rate for counter type metric
2021-08-12 20:53:07 +02:00
Aashish Sharma
4907c78bb7 mgr/dashboard: fix grafonnet build error
This PR tends to fix the issue caused by #42194

Fixes:https://tracker.ceph.com/issues/52238
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-12 17:48:33 +05:30
Ernesto Puerta
afadfede0d
Merge pull request #42194 from rhcs-dashboard/add-grafonnet-grafana
mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based code
2021-08-11 18:11:59 +02:00
Aashish Sharma
e9bd94515f mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based Code
This PR intends to add grafonnet to generate grafana JSON files

Fixes: https://tracker.ceph.com/issues/45184
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-11 19:23:54 +05:30
Ernesto Puerta
cc6b18a92c
Merge pull request #41880 from david-caro/fix_cluster_grafana_dashboard
monitoring/grafana/cluster: use per-unit max and limit values

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-08-02 13:03:46 +02:00
Jan Horáček
5bf516dcc7 [mgr/dashboard] cephfs metrics in MDS Workload panels to use rate because of counter type metric
Fixes: https://tracker.ceph.com/issues/51954
Signed-off-by: Jan Horacek <jan.horacek@livesport.eu>
2021-07-29 10:09:41 +02:00
Seena Fallah
feb8f784d2 monitoring: fix Physical Device Latency unit
Based on the expr it should be seconds

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-07-07 17:00:30 +04:30
Ernesto Puerta
62e3a5c41c
Merge pull request #41838 from p-se/grafana-clean-up
monitoring: Clean up Grafana dashboards

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: jan--f <NOT@FOUND>
Reviewed-by: p-se <NOT@FOUND>
Reviewed-by: Paul Cuzner <pcuzner@redhat.com>
2021-06-25 20:45:28 +02:00
David Caro
c981298039
monitoring/grafana/cluster: use per-unit max and limit values
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.

Signed-off-by: David Caro <david@dcaro.es>
2021-06-16 10:38:41 +02:00
Patrick Seidensal
037410713f
monitoring: remove instance label from ceph-cluster.json completely
The `instance` label is only useful if

- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
  other nodes

In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on.  It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change.  The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).

(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)

Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on).  That'd split the data, which
I don't think is a useful feature, but rather looks broken.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:11:30 +02:00
Patrick Seidensal
4270a13d6c
mgr/dashboard: Fix Grafana Ceph Cluster health status widget
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.

This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
f51cab109d
mgr/dashboard: Fix decimals in OSC Capacity Utilization widget
Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
5527c1c54f
mgr/dashboard: Remove hard-coded timezone off Grafana dashboards
Remove hard-coded timezone off Grafana dashboards to enable the Grafana
administrator to decide which timezone should be used for dashboards.

If we hard-coded those values, changing the global settings in Grafana
wouldn't have an effect. And the administrators can't change the
automatically imported Grafana dashboards provided by us.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
8218d43e5f
monitoring: convert newline character to LF
Convert newline character from CRLF in `rbd-details.json` to LF, so that
it will be consistent with all the other dashboard JSON files.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
a709abf8bf mgr/dashboard: deprecated variable usage in Grafana dashboards
Fixes: https://tracker.ceph.com/issues/50059

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-07 14:31:53 +02:00
Dan Mick
de491c128a monitoring/grafana/build/Makefile: work around buildah bug
Workaround https://github.com/containers/buildah/issues/3253
by pushing to a local OCI-format image to clear out erroneously-left
'parent' field in buildah commit --squash output.  Can be removed
when the fix for the above is available.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
b56ff43232 monitoring/grafana/build/Makefile: use --authfile
podman login caches auth tokens in auth.json; for sudo, it may be
placed in /run/containers/0 or it may be in /run/users/0/containers;
the latter directory is removed when root "logs out", which isn't
clear what it means with sudo/su.  Several builds failed because
they couldn't find the cached auth between sudo podman login and sudo
podman push.  Sidestep the confusion by just using a local file for
the auth cache.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
a3b4bc73f7 monitoring/grafana/build/Makefile: cleanup, ready for jenkins
- allow env setting of versions of components
- add docker/quay username/password variables
- derive container version from grafana version
- make arch-specific tags
- expand clean target to remove container images
- remove release-specific targets, "all" target
- move push operations to separate "push" target

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
0fdbe673c8 monitoring/grafana/build/Makefile: use curl instead of wget
build machines tend to already have curl installed

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
2faadc2d5c monitoring/grafana/build/Makefile: use "sudo buildah"
Some build machines don't have /etc/sub{u,g}id set up for
so-called "rootless" (non-privileged) operation.  Use sudo
to avoid the need for "rootless".

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
9d37c6efbd monitoring/grafana/build/Makefile: pull dashboards from local dir
Use the dashboard definition files in this workspace directly

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
444d6f6623 monitoring/grafana/build/Makefile: Add ARCH variable
Allow building for other archs, in particular arm64

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
508b1d387f monitoring/grafana/build/Makefile: fully qualify source image
Some build machines may not have a default docker repo configured.
Specify docker.io.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:24 -07:00
Ernesto Puerta
ac5d24e5ca
mgr/dashboard: remove non-null id in Grafana dashb
Testing added to prevent this situation.

Fixes: https://tracker.ceph.com/issues/50918
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2021-05-21 13:54:48 +02:00
Alfonso Martínez
7d79efb025 mgr/dashboard: fix OSDs Host details/overview grafana graphs
Fixes: https://tracker.ceph.com/issues/50686
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2021-05-07 15:38:07 +02:00
Ernesto Puerta
458ad48024
Merge pull request #40715 from pcuzner/pool-overview-enhancement
mgr/dashboard:include compression stats on pool dashboard

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-05-05 18:08:58 +02:00
Paul Cuzner
81788b1f21 mgr/dashboard:include compression stats on pool dashboard
This is a replacement dashboard configuration for the
pool overview page. It provides a cluster wide view of
capacity consumed and compression effectiveness, and
breaks this down by each pool within the configuration.

Fixes: https://tracker.ceph.com/issues/50226

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-05-03 12:26:06 +12:00
Ernesto Puerta
381685f17f
Merge pull request #40072 from wornet-mwo/dashboard--grafana-hostname-corrections
mgr/dashboard: Fixed name clash when hostname similar to another

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-04-29 19:40:57 +02:00
Michael Wodniok
e97e27ebdb dashboard: Fixed name clash when hostname similar to anaother
Fixes: #49769
Signed-off-by: Michael Wodniok <wodniok@wor.net>
2021-04-27 08:42:59 +02:00
Malcolm Holmes
382e293656 monitoring/grafana: Remove erroneous elements in hosts-overview Grafana dashboard
The hosts-overview Grafana dashboard json file contains a repeated element, making
it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet
from parsing the dashboard, which prevents the deployment of this dashboard via
Jsonnet.

Fixes: https://tracker.ceph.com/issues/50410
Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>
2021-04-17 23:11:48 +01:00
Aashish Sharma
8d2f39e6c5 mgr/dashboard:Simplify some complex calculations in test_alerts.yml
run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue.

Fixes: https://tracker.ceph.com/issues/49952
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-03-25 12:05:07 +05:30
Aashish Sharma
53a5816ded mgr/dashboard:test prometheus rules through promtool
This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file.

Fixes: https://tracker.ceph.com/issues/45415
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-03-08 10:16:22 +05:30
Ernesto Puerta
dff5b78d3b
Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch
mgr/dashboard: fix MTU Mismatch alert

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-02-17 14:14:17 +01:00
Ernesto Puerta
e2d73297cf
Merge pull request #38030 from p-se/prom-alert-package-drops-leeway
mgr/dashboard: prometheus alerting: add some leeway for package drops and errors

Reviewed-by: Stephan Müller <smueller@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-02-16 20:45:44 +01:00
Patrick Seidensal
9ac248b0c3 mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%)
Fixes: https://tracker.ceph.com/issues/48201

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-02-16 14:43:00 +01:00
Aashish Sharma
8527489b91 mgr/dashboard:fix MTU Mismatch alert
This PR intends to fix the expression used for MTU Mismatch alert in prometheus

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-02-15 10:13:39 +05:30
Aashish Sharma
06cc0d8743 mgr/dashboard: trigger alert if some nodes have a MTU different than the median value
This PR intends to alert a user if a specific network is configured with a custom MTU

Fixes: https://tracker.ceph.com/issues/48748
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-01-22 11:20:13 +05:30
Alfonso Martínez
9441fda4dc mgr/dashboard/monitoring: upgrade Grafana version due to CVE-2020-13379
Fixes: https://tracker.ceph.com/issues/48685
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2021-01-07 16:53:26 +01:00
Kefu Chai
30487c755c
Merge pull request #38282 from vosdev/ceph-pool-alert
mgr/prometheus: Fix 'pool filling up' with >50% usage

Reviewed-by: Patrick Seidensal <pseidensal@suse.com>
2020-12-12 12:10:44 +08:00
Daniël Vos
79568d51c6 mgr/prometheus: Fix 'pool filling up' with >50% usage
Fixes: https://tracker.ceph.com/issues/48354
Signed-off-by: Daniël Vos <danielvos@outlook.com>
2020-12-01 16:31:09 +01:00
haoyixing
0e7e036aa7 doc/dev: use http://docs.ceph.com/en/latest/ instead of /docs/master/ for docs
Several links under http://docs.ceph.com/docs/master/ were unable to access.
Change them to http://docs.ceph.com/en/lastest so we can access them directly.

Signed-off-by: haoyixing <haoyixing@kuaishou.com>
2020-11-24 12:49:47 +08:00
Paul Cuzner
2010432b50 mgr/prometheus: Add healthcheck metric for SLOW_OPS
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future

If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2020-11-02 15:30:49 +13:00
Seena Fallah
0fd28f646c monitoring: Use null yaxes min for OSD read latency
According to seriesOverrides that negative-Y for read param there shouldn't be a minimum for yaxes

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2020-10-12 19:56:18 +03:30
Patrick Seidensal
fe64b9d176 mgr/dashboard: Fix many-to-many issue in host-details dashboard
The labels on one side do not match the labels of the other side, where
a label_replace is used. The fix uses the same label_replace on the
missing side.

Fixes: https://tracker.ceph.com/issues/47334

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-09-07 12:37:40 +02:00
Avan Thakkar
f039e5585d mgr/dashboard: cpu stats incorrectly displayed
Fixes: https://tracker.ceph.com/issues/46683

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2020-07-23 11:57:32 +05:30
pcuzner
0021dd278b
Merge pull request #35610 from pcuzner/wip-grafana-container
monitoring: add grafana container build file
2020-07-06 13:06:55 +12:00
Lenz Grimmer
399521d66b
Merge pull request #34532 from rhcs-dashboard/wip-45068-fix-parse-error
mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-06-30 10:50:59 +02:00
Paul Cuzner
3c813729dc monitoring:add grafama container build file
This commit provides the Makefile to create the
ceph-grafana containers for nautilus, octopus and
master releases.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2020-06-17 17:20:45 +12:00
Kiefer Chang
b963b7fbe9
monitoring: fixing some issues in RBD detail dashboard
- Exchange read/write legends in The `I/O Bytes per second` panel.
- Rename `I/O Bytes per second` to `Throughput`.
- Rename `IOPS Count` to just `IOPS`.
- Remove instance name from legends.
- Fixes typos: `Averange` -> `Average`.

Fixes: https://tracker.ceph.com/issues/45735
Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>
2020-05-28 14:49:31 +08:00
Alfonso Martínez
cf4ff7d2f0 mgr/dashboard: grafana panels for rgw multisite sync performance
* RGW sync perf. counters are now exposed through grafana panels.
* Sync Performance tab is only shown if rgw realm is detected.
* Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility).

Fixes: https://tracker.ceph.com/issues/45310
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2020-05-22 13:36:10 +02:00
Benoît Knecht
653c3f6682 monitoring: Fix "10% OSDs down" alert description
The alert was triggered when less than 90% of OSDs were _up_, but then the
description took that value and described it as the percentage of OSDs being
_down_. So with 12% of OSDs down, the alert description would read:

```
88% or 88 of 100 OSDs are down (>=10%).
```

which can be panic-inducing.

This commit changes the alert expression to actually compute the ratio of OSDs
being down, which makes the correct value appear in the description.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2020-05-06 18:49:26 +02:00
Lenz Grimmer
9334471340
Merge pull request #33991 from SchoolGuy/monitoring/rbd-image-details
mgr/dashboard/grafana: Add rbd-image details dashboard

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-05-04 09:59:53 +02:00
Enno Gotthold
dfb1e0020e
mgr/dashboard: Remove additional unneeded steps for the metrics calculation
Signed-off-by: Enno Gotthold <egotthold@suse.de>
2020-04-28 13:34:16 +02:00
Ernesto Puerta
3fd804f10b
monitoring: fix decimal precision in Grafana %
Set decimal precision to 2 positions for charts using percentunits.

Fixes: https://tracker.ceph.com/issues/45183
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2020-04-22 13:39:16 +02:00
Avan Thakkar
47b515c094 mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images
Fixes: https://tracker.ceph.com/issues/45068

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2020-04-21 23:03:09 +05:30
Volker Theile
e197e4d7f4 monitoring: alert for pool fill up broken
Fixes: https://tracker.ceph.com/issues/44991
Signed-off-by: Volker Theile <vtheile@suse.com>
2020-04-08 15:02:45 +02:00
Volker Theile
a5ade11a31
Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert
monitoring: root volume full alert fires false positives

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-04-06 14:17:17 +02:00
Lenz Grimmer
b6ad9a804b
Merge pull request #34240 from krig/grafana-dashboards-fixes
mgr/dashboard: Repair broken grafana panels

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Stephan Müller <smueller@suse.com>
2020-04-06 10:55:20 +02:00
Patrick Seidensal
6935dc5592 monitoring: alert for prediction of disk and pool fill up broken
Fixes: https://tracker.ceph.com/issues/44776

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 13:44:28 +01:00
Kristoffer Grönlund
b7abaab5bd dashboard: Convert FQDN to hostname in grafana panels
The $ceph_hosts variable contained the FQDN for hosts
while the instance label created by ceph only has
the hostname.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:15 +01:00
Kristoffer Grönlund
136d21e21d dashboard: Resolve FQDN / hostname mismatch in hosts overview panel
In the AVG Disk Utilization panel, the result is calculated
by combining the output of node_disk_io_time_seconds_total
with the output of ceph_disk_occupation. However, the
first vector encodes the instance label with the full FQDN
while the ceph label only contains the hostname:port. In
order for these to match correctly, the domain name and port
has to be stripped from the labels.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:09 +01:00
Kristoffer Grönlund
8b61b8d3d7 dashboard: Use exported_instance to identify OSDs
When moving to LVM-based ceph-volume setups, several
grafana dashboards stopped working. The problem is that
(device, instance) no longer results in unique labels
which causes errors like:

"many-to-many matching not allowed: matching labels must be unique on one side"

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:33:01 +01:00
Kristoffer Grönlund
4444333243 dashboard: AVG RAM Utilization panel always showed "N/A"
The references to `$osd_hosts` etc. were encoded as
`[[osd_hosts]]` in the PromQL expression divisor, and
the panel always displayed N/A as the result of the
query.

Replacing the `[[...]]` with `$...` makes the expression
work again.

Fixes: https://tracker.ceph.com/issues/44784
Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>
2020-03-27 12:32:52 +01:00
Patrick Seidensal
f8e347f771 monitoring: root volume full alert fires false positives
Fixes: https://tracker.ceph.com/issues/44780

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 11:06:08 +01:00
Kefu Chai
a12f9f19e0
Merge pull request #32749 from james58899/fix-capacity
monitoring: Fix pool capacity incorrect

Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
2020-03-27 16:13:29 +08:00
Enno Gotthold
9707cb30cb
mgr/dashboard: Add grafana chart for rbd image details
Fixes: https://tracker.ceph.com/issues/44623
Signed-off-by: Enno Gotthold <egotthold@suse.de>

This dashboard will per default be empty as the already existing
dashboard with the summary for all rbd images.
2020-03-26 08:21:30 +01:00
Alfonso Martínez
1f0cddfafc monitoring: fix RGW grafana chart 'Average GET/PUT Latencies'
Fixes: https://tracker.ceph.com/issues/44538
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2020-03-10 12:05:26 +01:00
Patrick Seidensal
1794b55e64 monitoring: restore lost pool full alert
Fixes: https://tracker.ceph.com/issues/44366

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-02 11:43:03 +01:00
James Cheng
1b980ef88c
monitoring: Fix pool capacity incorrect
Signed-off-by: James Cheng <james59988@gmail.com>
2020-02-18 19:19:13 +08:00
Avan Thakkar
dd8cb9d2d6 mgr/dashboard: UI fixes
Fixes: https://tracker.ceph.com/issues/42914

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2020-02-10 22:57:57 +05:30
Aleksei Zakharov
a37cf380ad mgr/grafana: sum pg states for cluster
Also, revert table formatting.

Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Aleksei Zakharov
4eb58f7ccc monitoring/grafana,prometheus: add per-pool pg states support
Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Patrick Seidensal
fb51c589b5 monitoring: add details to Prometheus' alerts
Fixes: https://tracker.ceph.com/issues/43764

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-01-24 14:21:31 +01:00
Jan Fajerski
e098536acc
Merge pull request #32325 from Kriechi/fix-42982
monitoring: fix prometheus alert for full pools
2020-01-20 10:42:36 +01:00
Bryan Stillwell
8eafb09acb Switch spelling of utilization
Prefer the non-British spelling of utilization since that's what the majority
of the code base seems to use.

Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>
2020-01-07 16:57:36 -07:00
Thomas Kriechbaumer
9abddc0dd3 monitoring: fix prometheus alert for full pools
The existing alert (introduced via
https://tracker.ceph.com/issues/24977) already triggers when still 50%
of storage space are available.

Fixes: https://tracker.ceph.com/issues/42982
Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>
2019-12-18 15:04:51 +01:00
Lenz Grimmer
11a1708e19
mgr/dashboard: grafana charts match time picker selection. (#31964)
mgr/dashboard: grafana charts match time picker selection.

Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>
2019-12-03 17:09:00 +00:00
Alfonso Martínez
5ba114330e mgr/dashboard: grafana charts match time picker selection.
Fixes: https://tracker.ceph.com/issues/43097
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
2019-12-03 14:15:10 +01:00
Ernesto Puerta
1182073f0c
mgr/dashboard,grafana: remove shortcut menu
Remove shortcut menu (links) and add check in grafana CI script.

Fixes: https://tracker.ceph.com/issues/43091
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2019-12-03 10:21:35 +01:00
Patrick Seidensal
d262adeb21 monitoring: fix indentation of ceph default alerts
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:40:55 +01:00
Patrick Seidensal
e923af3430 monitoring: wait before firing osd full alert
Fixes: https://tracker.ceph.com/issues/42862

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:39:27 +01:00
Radu Toader
3beaf63761
mgr/dashboard: fix grafana dashboards
Fixes: https://tracker.ceph.com/issues/42542

Sort order was wrong for some dashboards,
fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools
Overall performance
fixed Avg utilization Multiple series found - in Host Overall
performance
Fixed invalid dimensions for plot - in OSD Overall performance

Signed-off-by: Radu Toader <radu.m.toader@gmail.com>
2019-10-30 11:03:03 +02:00
Volker Theile
8e6838c740 monitoring: SNMP OID per every Prometheus alert rule
Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md.

Example OID:
1.3.6.1.4.1.50495.15.1.2.2.1

All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details.

Signed-off-by: Volker Theile <vtheile@suse.com>
2019-05-28 09:59:50 +02:00
Jan Fajerski
e7a4437fdc monitoring: update Grafana dashboards
Fix various panels that used outdated metric names, cluncky or
unnecessary label_replace calls. Also unify the style of many panels.

Fixes: http://tracker.ceph.com/issues/39652

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-05-14 13:47:55 +02:00
Jan Fajerski
c0e58bd8ae monitoring: add a few prometheus alerts
Alerts are from
https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
but updated for the mgr module and node_exporter >= 0.15.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-26 11:21:39 +02:00
Jan Fajerski
287e209351 monitoring/grafana: fix typo in README
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-16 14:19:51 +02:00
Neha Gupta
739fdbad37 mgr/dashboard: Fixed performance details context for host list row selection
Fixes: http://tracker.ceph.com/issues/37854

Signed-off-by: Neha Gupta <gnehapk@gmail.com>
2019-01-18 13:36:49 +09:00
Jason Dillaman
f4ac899950 monitoring/grafana: new RBD overview dashboard page
This page pulls RBD stats from the Natuatilus prometheus exporter.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
2019-01-11 16:41:46 -05:00
Boris Ranto
1ade714910 cmake: Support grafana dashboard installation
We are currently hosting the grafana dashboards in our repo but we do
not install them. This patch adds the cmake support.

Signed-off-by: Boris Ranto <branto@redhat.com>
2018-10-25 17:09:02 +02:00
Lenz Grimmer
94aefee3b0
Merge pull request #24314 from rhcs-dashboard/dashboards
mgr/dashboard: Grafana dashboard updates and additions

Reviewed-by: Boris Ranto <branto@redhat.com>
2018-10-19 12:42:23 +02:00
Paul Cuzner
a848411bd8 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
a99618ce41 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
b64289ca3d MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
5432470914 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
bc5eea09c8 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
ba1a3b3a09 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
f97fee3a83 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
02b5414d19 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
7c04098e68 MGR/dashboard: make grafana datasource selectable
Grafana dashboard updated to use a templating
variable for the datasource

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
2c346efd12 Fix linewidth issue in pools overview dashboard
Linewidth was set to two, but the idea is that
a linewidth of >1 is reserved for eye-catcher
plot lines like maximums

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
b84f0ce45f Refresh of the dashboards
Fixes some minor anomalies and tested against
node_exporter 0.15 and 0.16

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
7d97bb28a8 Updated requirements information
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
0e655f8400 Added new Overview dashboards
These new dashboard definitions provide the high
level views for the hosts in the cluster and the
OSDs.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
4292a7a357 Screenshots added for all dashboards
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
3c7c32f2ed Add Host level details dashboard
The host-details.json file provides a view of host
level metrics. The panels are arranged in two
rows;
Overview : Cpu/RAM/Network related stats
OSD Performance: OSD physical drive stats

The overview row is shown by default. Click on
the OSD Performance row to show the remaining
graphs

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Paul Cuzner
a0d9325c4d Document the current state of the dashboards
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:26:08 +13:00
Paul Cuzner
8ebf2ede7f Initial grafana dashboard definitions
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-10-09 08:23:39 +13:00
Maxime
68b044a75e [grafana] Fix OSD Capacity Utlization graph
Signed-off-by: Maxime <maxime@root314.com>
2018-10-04 13:44:12 +02:00
Jan Fajerski
7e7ae7a0fe add monitoring subdir and Grafana cluster dashboard
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2018-05-07 14:25:29 +02:00