Commit Graph

142 Commits

Author SHA1 Message Date
Ernesto Puerta
978d5829f2
Merge pull request #44294 from rhcs-dashboard/feature-bluestore-onode
mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-11 11:24:21 +01:00
Aashish Sharma
15aa4dffa9 mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard
Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters

Fixes: https://tracker.ceph.com/issues/53577
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-01-05 14:22:53 +05:30
Ernesto Puerta
cdc9f742df
Merge pull request #44190 from rhcs-dashboard/grafana-regex
monitoring/grafana: improve grafana unit tests variable substitution

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-21 17:58:17 +01:00
Pere Diaz Bou
bbbdf8e6a2 monitoring/grafana: doctest util regex
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-15 09:36:08 +01:00
Pere Diaz Bou
2286ddc1c2 monitoring/grafana: rename tox promql test
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Pere Diaz Bou
5ebdb746e8 monitoring/grafana: improve grafana unit tests variable substitution
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Ernesto Puerta
d10b0b7e72
mgr/dashboard: disable Promql test in ARM
Temporarily disable this test while debugging the issue (since https://github.com/ceph/ceph/pull/43669
originally passed the ARM check).

Fixes: https://tracker.ceph.com/issues/53451
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2021-12-13 20:20:44 +01:00
Avan Thakkar
8d83126e51 mgr/dashboard: introduce HAProxy metrics for RGW
Fixes: https://tracker.ceph.com/issues/53311
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-12-09 20:03:03 +05:30
Pere Diaz Bou
44d3e4c264 monitoring/grafana: Grafana query tester
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-11-16 10:30:49 +01:00
Paul Cuzner
7ffcbd7f79 mgr/prometheus: Update rule format and enhance SNMP support
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.

In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.

The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.

Fixes: https://tracker.ceph.com/issues/53111

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-11-05 11:24:25 +13:00
Sebastian Wagner
aae2ea3897
Merge pull request #43293 from pcuzner/granular-alerts
mgr/prometheus: expose ceph healthchecks as metrics

Reviewed-by: Boris Ranto <branto@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-29 00:23:24 +02:00
Pere Diaz Bou
e1bc6f24ff monitoring: ethernet bonding filter in Network Load
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-10-27 09:08:20 +02:00
Paul Cuzner
37b82b8793 mgr/prometheus: remove cmake tests
Temporary removal of the cmake test integration

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-27 09:58:17 +13:00
Sebastian Wagner
b830c555d2 monitoring/prometheus: Add cmake integration
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-22 13:37:31 +13:00
Paul Cuzner
4750ac0d77 mgr/prometheus: add test cases and validation using tox
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.

In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:36:40 +13:00
Paul Cuzner
e0dfc02063 mgr/prometheus: track individual healthchecks as metrics
This patch creates a health history object maintained in
the modules kvstore.  The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear

In addition to the new commands, the additional metrics
have been used to update the prometheus alerts

Fixes: https://tracker.ceph.com/issues/52638

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-22 13:32:39 +13:00
Aashish Sharma
ed954b0e6c mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards
This PR intends to refactor cephfs dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:36:31 +05:30
Aashish Sharma
e490e2f3ab mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards
This PR intends to refactor osds dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:13:50 +05:30
Aashish Sharma
8c48821c21 mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards
This PR intends to refactor pools dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:10:56 +05:30
Aashish Sharma
e737aaa000 mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards
This PR intends to refactor rbd dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:09:04 +05:30
Aashish Sharma
eb01954cd9 mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards
This PR intends to refactor radosgw dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 11:57:28 +05:30
Ernesto Puerta
19535b1d0e
Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards
mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-18 17:14:03 +02:00
Ernesto Puerta
9b40c9df26
Merge pull request #43377 from rhcs-dashboard/fix-clients-connection-query
mgr/dashboard: replace "Ceph-cluster" Client connections with active-standby MGRs

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-13 13:37:51 +02:00
Sebastian Wagner
53382d70eb
Merge pull request #43274 from pcuzner/add-mib
monitoring:Adding the Ceph MIB

Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
2021-10-12 22:29:06 +02:00
Aashish Sharma
f7714de294 mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards
This PR intends to refactor hosts dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-12 11:05:02 +05:30
Avan Thakkar
d388c5e958 mgr/dashboard: replace Client connections with active-stdby mgrs
Fixes: https://tracker.ceph.com/issues/52121
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-10-11 21:53:23 +05:30
Paul Cuzner
b96aa5d184 monitoring:Updated README
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-10-06 14:32:47 +13:00
Ernesto Puerta
ba9e17d2d2
Merge pull request #43132 from p-se/monitoring-grafana-piechart-update
monitoring: update grafana-piechart-panel plugin

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-09-28 18:37:45 +02:00
Paul Cuzner
f9213ad9cf monitoring:Adding the Ceph MIB
The ceph MIB has been created and maintained in a
a separate repo:
https://github.com/SUSE/prometheus-webhook-snmp

This patch brings this MIB into the main ceph repo, so
alert changes can target prometheus and potentially
SNMP environments within the same PR.

Kudos to Volker Theile for creating the MIB.

Fixes: https://tracker.ceph.com/issues/52708

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2021-09-23 11:06:19 +12:00
Patrick Seidensal
af94237621
monitoring: update grafana-piechart-panel plugin
Fixes: https://tracker.ceph.com/issues/51211

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-09-10 15:28:17 +02:00
Aashish Sharma
58d635455d mgr/dashboard: Incorrect MTU mismatch warning
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-09-02 15:34:36 +05:30
Kefu Chai
1835fd86dd cmake: exclude "grafonnet-lib" target from "all"
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-20 22:50:42 +08:00
Kefu Chai
1fdd632d0c cmake: silence build output when building external deps
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-16 21:27:57 +08:00
Ernesto Puerta
559afae0b9
Merge pull request #41570 from jhrcz-ls/wip-cephfs-overview-use-rate
mgr/dashboard: cephfs MDS Workload to use rate for counter type metric
2021-08-12 20:53:07 +02:00
Aashish Sharma
4907c78bb7 mgr/dashboard: fix grafonnet build error
This PR tends to fix the issue caused by #42194

Fixes:https://tracker.ceph.com/issues/52238
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-12 17:48:33 +05:30
Ernesto Puerta
afadfede0d
Merge pull request #42194 from rhcs-dashboard/add-grafonnet-grafana
mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based code
2021-08-11 18:11:59 +02:00
Aashish Sharma
e9bd94515f mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based Code
This PR intends to add grafonnet to generate grafana JSON files

Fixes: https://tracker.ceph.com/issues/45184
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-11 19:23:54 +05:30
Ernesto Puerta
cc6b18a92c
Merge pull request #41880 from david-caro/fix_cluster_grafana_dashboard
monitoring/grafana/cluster: use per-unit max and limit values

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-08-02 13:03:46 +02:00
Jan Horáček
5bf516dcc7 [mgr/dashboard] cephfs metrics in MDS Workload panels to use rate because of counter type metric
Fixes: https://tracker.ceph.com/issues/51954
Signed-off-by: Jan Horacek <jan.horacek@livesport.eu>
2021-07-29 10:09:41 +02:00
Seena Fallah
feb8f784d2 monitoring: fix Physical Device Latency unit
Based on the expr it should be seconds

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-07-07 17:00:30 +04:30
Ernesto Puerta
62e3a5c41c
Merge pull request #41838 from p-se/grafana-clean-up
monitoring: Clean up Grafana dashboards

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: jan--f <NOT@FOUND>
Reviewed-by: p-se <NOT@FOUND>
Reviewed-by: Paul Cuzner <pcuzner@redhat.com>
2021-06-25 20:45:28 +02:00
David Caro
c981298039
monitoring/grafana/cluster: use per-unit max and limit values
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.

Signed-off-by: David Caro <david@dcaro.es>
2021-06-16 10:38:41 +02:00
Patrick Seidensal
037410713f
monitoring: remove instance label from ceph-cluster.json completely
The `instance` label is only useful if

- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
  other nodes

In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on.  It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change.  The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).

(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)

Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on).  That'd split the data, which
I don't think is a useful feature, but rather looks broken.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:11:30 +02:00
Patrick Seidensal
4270a13d6c
mgr/dashboard: Fix Grafana Ceph Cluster health status widget
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.

This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
f51cab109d
mgr/dashboard: Fix decimals in OSC Capacity Utilization widget
Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
5527c1c54f
mgr/dashboard: Remove hard-coded timezone off Grafana dashboards
Remove hard-coded timezone off Grafana dashboards to enable the Grafana
administrator to decide which timezone should be used for dashboards.

If we hard-coded those values, changing the global settings in Grafana
wouldn't have an effect. And the administrators can't change the
automatically imported Grafana dashboards provided by us.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
8218d43e5f
monitoring: convert newline character to LF
Convert newline character from CRLF in `rbd-details.json` to LF, so that
it will be consistent with all the other dashboard JSON files.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
a709abf8bf mgr/dashboard: deprecated variable usage in Grafana dashboards
Fixes: https://tracker.ceph.com/issues/50059

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-07 14:31:53 +02:00
Dan Mick
de491c128a monitoring/grafana/build/Makefile: work around buildah bug
Workaround https://github.com/containers/buildah/issues/3253
by pushing to a local OCI-format image to clear out erroneously-left
'parent' field in buildah commit --squash output.  Can be removed
when the fix for the above is available.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
b56ff43232 monitoring/grafana/build/Makefile: use --authfile
podman login caches auth tokens in auth.json; for sudo, it may be
placed in /run/containers/0 or it may be in /run/users/0/containers;
the latter directory is removed when root "logs out", which isn't
clear what it means with sudo/su.  Several builds failed because
they couldn't find the cached auth between sudo podman login and sudo
podman push.  Sidestep the confusion by just using a local file for
the auth cache.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00