Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.
This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.
In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).
Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Fix issues with PromQL expressions and vector matching with the
`ceph_disk_occupation` metric.
As it turns out, `ceph_disk_occupation` cannot simply be used as
expected, as there seem to be some edge cases for users that have
several OSDs on a single disk. This leads to issues which cannot be
approached by PromQL alone (many-to-many PromQL erros). The data we
have expected is simply different in some rare cases.
I have not found a sole PromQL solution to this issue. What we basically
need is the following.
1. Match on labels `host` and `instance` to get one or more OSD names
from a metadata metric (`ceph_disk_occupation`) to let a user know
about which OSDs belong to which disk.
2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric,
in which case the value of `ceph_daemon` must not refer to more than
a single OSD. The exact opposite to requirement 1.
As both operations are currently performed on a single metric, and there
is no way to satisfy both requirements on a single metric, the intention
of this commit is to extend the metric by providing a similar metric
that satisfies one of the requirements. This enables the queries to
differentiate between a vector matching operation to show a string to
the user (where `ceph_daemon` could possibly be `osd.1` or
`osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in
the condition for the matching.
Although the `ceph_daemon` label is used on a variety of daemons, only
OSDs seem to be affected by this issue (only if more than one OSD is run
on a single disk). This means that only the `ceph_disk_occupation`
metadata metric seems to need to be extended and provided as two
metrics.
`ceph_disk_occupation` is supposed to be used for matching the
`ceph_daemon` label value.
foo * on(ceph_daemon) group_left ceph_disk_occupation
`ceph_disk_occupation_human` is supposed to be used for anything where
the resulting data is displayed to be consumed by humans (graphs, alert
messages, etc).
foo * on(device,instance)
group_left(ceph_daemon) ceph_disk_occupation_human
Fixes: https://tracker.ceph.com/issues/52974
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
Some of the expressions modified in c40290390d7 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.
This commit fills in those gaps.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:
```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters
Fixes: https://tracker.ceph.com/issues/53577
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor cephfs dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor osds dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor pools dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor rbd dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor radosgw dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
This PR intends to refactor hosts dashboards using grafonnet
Fixes:https://tracker.ceph.com/issues/52777
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
so we don't build this target when running "make", and hence avoid
accessing the internet in a building envronment where the internest
access is not allowed.
Signed-off-by: Kefu Chai <kchai@redhat.com>
when download/building grafonnet-lib, dpdk, spdk, liburing and fio,
they dump lots of output during configuration and building phrases,
all of which is irrelevant to us. so let's just silence it.
Signed-off-by: Kefu Chai <kchai@redhat.com>
The value we get is a perunit, so the limits and the max value should
be over 1, not 100. Note that the value being shown was correct, it
was the gauge that was not showing the correct indicators.
Signed-off-by: David Caro <david@dcaro.es>
The `instance` label is only useful if
- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
other nodes
In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on. It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change. The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).
(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)
Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on). That'd split the data, which
I don't think is a useful feature, but rather looks broken.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
The health status widget doesn't show any status because it requires its
query to return a single result. But in case a mgr instance had failed,
it would return more, provided the incident has happened in the
requested time frame.
This is simply an issue of the `instant` switch being disabled for that
widget. As only one mgr instance can ever be providing data at a time,
enabling `instant` completely solves that issue.
Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>