Commit Graph

116 Commits

Author SHA1 Message Date
Pere Diaz Bou
57c26311de monitoring/grafana: replace filestore osd count
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 14:14:41 +01:00
Pere Diaz Bou
a3cf5c5e9f monitoring/grafana: use Path class instead of split
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
1e4d85d04f monitoring/grafana: remove explicit str casting
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
2b4f3561d2 monitoring/grafana: add generated json files
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
b381a83e9b monitoring/grafana: ValueError instead of RuntimeError
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:12 +01:00
Pere Diaz Bou
4c302234ff monitoring/grafana: Replace missing legendFormat warning with error
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-18 13:24:10 +01:00
Patrick Seidensal
7d7488018e monitoring: Add unit tests for OSD panels in ceph-cluster dashboard
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
4a6b2c1dfb monitoring: fix display ceph_osd_in in Grafana panel
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Patrick Seidensal
18d3a71618 mgr/prometheus: Fix regression with OSD/host details/overview dashboards
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.

As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk.  This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros).  The data we
have expected is simply different in some rare cases.

I have not found a sole PromQL solution to this issue. What we basically
need is the following.

1. Match on labels `host` and `instance` to get one or more OSD names
   from a metadata metric (`ceph_disk_occupation`) to let a user know
   about which OSDs belong to which disk.

2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
   in which case the value of `ceph_daemon` must not refer to more than
   a single OSD. The exact opposite to requirement 1.

As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.

Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk).  This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.

`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.

    foo * on(ceph_daemon) group_left ceph_disk_occupation

`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).

    foo * on(device,instance)
    group_left(ceph_daemon) ceph_disk_occupation_human

Fixes: https://tracker.ceph.com/issues/52974

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2022-01-13 13:27:55 +01:00
Ernesto Puerta
978d5829f2
Merge pull request #44294 from rhcs-dashboard/feature-bluestore-onode
mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: neha-ojha <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-01-11 11:24:21 +01:00
Aashish Sharma
15aa4dffa9 mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard
Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters

Fixes: https://tracker.ceph.com/issues/53577
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-01-05 14:22:53 +05:30
Ernesto Puerta
cdc9f742df
Merge pull request #44190 from rhcs-dashboard/grafana-regex
monitoring/grafana: improve grafana unit tests variable substitution

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-21 17:58:17 +01:00
Pere Diaz Bou
bbbdf8e6a2 monitoring/grafana: doctest util regex
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-15 09:36:08 +01:00
Pere Diaz Bou
2286ddc1c2 monitoring/grafana: rename tox promql test
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Pere Diaz Bou
5ebdb746e8 monitoring/grafana: improve grafana unit tests variable substitution
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-12-14 09:36:23 +01:00
Ernesto Puerta
d10b0b7e72
mgr/dashboard: disable Promql test in ARM
Temporarily disable this test while debugging the issue (since https://github.com/ceph/ceph/pull/43669
originally passed the ARM check).

Fixes: https://tracker.ceph.com/issues/53451
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
2021-12-13 20:20:44 +01:00
Avan Thakkar
8d83126e51 mgr/dashboard: introduce HAProxy metrics for RGW
Fixes: https://tracker.ceph.com/issues/53311
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-12-09 20:03:03 +05:30
Pere Diaz Bou
44d3e4c264 monitoring/grafana: Grafana query tester
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-11-16 10:30:49 +01:00
Pere Diaz Bou
e1bc6f24ff monitoring: ethernet bonding filter in Network Load
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2021-10-27 09:08:20 +02:00
Aashish Sharma
ed954b0e6c mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards
This PR intends to refactor cephfs dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:36:31 +05:30
Aashish Sharma
e490e2f3ab mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards
This PR intends to refactor osds dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:13:50 +05:30
Aashish Sharma
8c48821c21 mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards
This PR intends to refactor pools dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:10:56 +05:30
Aashish Sharma
e737aaa000 mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards
This PR intends to refactor rbd dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 12:09:04 +05:30
Aashish Sharma
eb01954cd9 mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards
This PR intends to refactor radosgw dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-19 11:57:28 +05:30
Ernesto Puerta
19535b1d0e
Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards
mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-10-18 17:14:03 +02:00
Aashish Sharma
f7714de294 mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards
This PR intends to refactor hosts dashboards using grafonnet

Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-10-12 11:05:02 +05:30
Avan Thakkar
d388c5e958 mgr/dashboard: replace Client connections with active-stdby mgrs
Fixes: https://tracker.ceph.com/issues/52121
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2021-10-11 21:53:23 +05:30
Patrick Seidensal
af94237621
monitoring: update grafana-piechart-panel plugin
Fixes: https://tracker.ceph.com/issues/51211

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-09-10 15:28:17 +02:00
Kefu Chai
1835fd86dd cmake: exclude "grafonnet-lib" target from "all"
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-20 22:50:42 +08:00
Kefu Chai
1fdd632d0c cmake: silence build output when building external deps
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-08-16 21:27:57 +08:00
Ernesto Puerta
559afae0b9
Merge pull request #41570 from jhrcz-ls/wip-cephfs-overview-use-rate
mgr/dashboard: cephfs MDS Workload to use rate for counter type metric
2021-08-12 20:53:07 +02:00
Aashish Sharma
4907c78bb7 mgr/dashboard: fix grafonnet build error
This PR tends to fix the issue caused by #42194

Fixes:https://tracker.ceph.com/issues/52238
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-12 17:48:33 +05:30
Ernesto Puerta
afadfede0d
Merge pull request #42194 from rhcs-dashboard/add-grafonnet-grafana
mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based code
2021-08-11 18:11:59 +02:00
Aashish Sharma
e9bd94515f mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based Code
This PR intends to add grafonnet to generate grafana JSON files

Fixes: https://tracker.ceph.com/issues/45184
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-08-11 19:23:54 +05:30
Ernesto Puerta
cc6b18a92c
Merge pull request #41880 from david-caro/fix_cluster_grafana_dashboard
monitoring/grafana/cluster: use per-unit max and limit values

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: p-se <NOT@FOUND>
2021-08-02 13:03:46 +02:00
Jan Horáček
5bf516dcc7 [mgr/dashboard] cephfs metrics in MDS Workload panels to use rate because of counter type metric
Fixes: https://tracker.ceph.com/issues/51954
Signed-off-by: Jan Horacek <jan.horacek@livesport.eu>
2021-07-29 10:09:41 +02:00
Seena Fallah
feb8f784d2 monitoring: fix Physical Device Latency unit
Based on the expr it should be seconds

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-07-07 17:00:30 +04:30
Ernesto Puerta
62e3a5c41c
Merge pull request #41838 from p-se/grafana-clean-up
monitoring: Clean up Grafana dashboards

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: jan--f <NOT@FOUND>
Reviewed-by: p-se <NOT@FOUND>
Reviewed-by: Paul Cuzner <pcuzner@redhat.com>
2021-06-25 20:45:28 +02:00
David Caro
c981298039
monitoring/grafana/cluster: use per-unit max and limit values
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.

Signed-off-by: David Caro <david@dcaro.es>
2021-06-16 10:38:41 +02:00
Patrick Seidensal
037410713f
monitoring: remove instance label from ceph-cluster.json completely
The `instance` label is only useful if

- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
  other nodes

In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on.  It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change.  The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).

(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)

Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on).  That'd split the data, which
I don't think is a useful feature, but rather looks broken.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:11:30 +02:00
Patrick Seidensal
4270a13d6c
mgr/dashboard: Fix Grafana Ceph Cluster health status widget
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.

This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
f51cab109d
mgr/dashboard: Fix decimals in OSC Capacity Utilization widget
Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
5527c1c54f
mgr/dashboard: Remove hard-coded timezone off Grafana dashboards
Remove hard-coded timezone off Grafana dashboards to enable the Grafana
administrator to decide which timezone should be used for dashboards.

If we hard-coded those values, changing the global settings in Grafana
wouldn't have an effect. And the administrators can't change the
automatically imported Grafana dashboards provided by us.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
8218d43e5f
monitoring: convert newline character to LF
Convert newline character from CRLF in `rbd-details.json` to LF, so that
it will be consistent with all the other dashboard JSON files.

Fixes: https://tracker.ceph.com/issues/51212

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-16 09:10:32 +02:00
Patrick Seidensal
a709abf8bf mgr/dashboard: deprecated variable usage in Grafana dashboards
Fixes: https://tracker.ceph.com/issues/50059

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-06-07 14:31:53 +02:00
Dan Mick
de491c128a monitoring/grafana/build/Makefile: work around buildah bug
Workaround https://github.com/containers/buildah/issues/3253
by pushing to a local OCI-format image to clear out erroneously-left
'parent' field in buildah commit --squash output.  Can be removed
when the fix for the above is available.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
b56ff43232 monitoring/grafana/build/Makefile: use --authfile
podman login caches auth tokens in auth.json; for sudo, it may be
placed in /run/containers/0 or it may be in /run/users/0/containers;
the latter directory is removed when root "logs out", which isn't
clear what it means with sudo/su.  Several builds failed because
they couldn't find the cached auth between sudo podman login and sudo
podman push.  Sidestep the confusion by just using a local file for
the auth cache.

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
a3b4bc73f7 monitoring/grafana/build/Makefile: cleanup, ready for jenkins
- allow env setting of versions of components
- add docker/quay username/password variables
- derive container version from grafana version
- make arch-specific tags
- expand clean target to remove container images
- remove release-specific targets, "all" target
- move push operations to separate "push" target

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
0fdbe673c8 monitoring/grafana/build/Makefile: use curl instead of wget
build machines tend to already have curl installed

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00
Dan Mick
2faadc2d5c monitoring/grafana/build/Makefile: use "sudo buildah"
Some build machines don't have /etc/sub{u,g}id set up for
so-called "rootless" (non-privileged) operation.  Use sudo
to avoid the need for "rootless".

Signed-off-by: Dan Mick <dmick@redhat.com>
2021-05-26 13:37:25 -07:00