Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.
Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.
There also are some consistency cleanups here and there:
* `datasource` was not used consistently
* `cluster` label_values are determined from `ceph_health_status`
* `job` template and filters on this label were removed to align multi cluster
support solely via the `cluster` label
* `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
to now show all instance values, but those of hosts with some Ceph
component / daemon.
* Enable showMultiCluster=True since `cluster` label is now always present,
via https://github.com/ceph/ceph/pull/54964
Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
'timeseries' panel
The graph panel type is deprecated, and disappears after Grafana v9.1 (current version is 10.0) to prevent more old type panels being created. These should be migrated to the timeseries panel type, to avoid potential problems with future Grafana versions.
Fixes: https://tracker.ceph.com/issues/61720
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
mgr/dashboard: Show the OSDs Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard
Reviewed-by: Nizamudeen A <nia@redhat.com>
Those RBD IO statistics graphs are empty out of the box and it's on
purpose. Instead of giving an impression that those graphs are broken,
point users to a documentation explaining about optional steps to enable
those statistics.
https://docs.ceph.com/en/latest/mgr/prometheus/#rbd-io-statistics
Signed-off-by: Nobuto Murata <nobuto.murata@canonical.com>
Currently the alert generator is broken if you try to run `tox
-ealerts-fix`. I fixed it and ran the command and it built a new json
file as well.
Signed-off-by: Nizamudeen A <nia@redhat.com>
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.
Fixes: https://tracker.ceph.com/issues/63088
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
monitoring: grafana mons out of quorum should be count - sum
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
File 'prometheus_alerts.yml' file should not be edited directly.
The changes should be added to 'prometheus_alerts.libsonnet' file
(and/or any other appropriate lib/j sonnet files) and generated
using 'make generate' command.
Adding all the changes to 'prometheus_alerts.libsonnet' file and
building/generating the prometheus_alerts YAML file.
PS: all the changes seen in 'prometheus_alerts.yml' file is due
to the re-arrangement of lines. The file remains same.
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
Made following changes to files,
Makefile:
Add needed 'tox' target to generate alert files
Now we can do 'make generate' OR 'make test'
to generate all the yaml files (and run tests)
alerts.jsonnet:
Added an 'import' line to include 'config.libsonnet' file.
This fix the errors in generating 'prometheus_alerts.yml' file
tox.ini:
Added all the existing 'alerts-' targets to 'envlist'
Added the missing 'alerts-test' target to 'testenv'
Added 'jsonnet' to 'allowlist_externals', which prevents a
deprecation waring
A minor spell correction
lint-jsonnet.sh:
Made errors more verbose.
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
not count / sum
For example, with 3 mons total, all in quorum, original
will do 3/3 = 1, showing 1 out of quorum (likely typo fix)
Fixes: https://tracker.ceph.com/issues/61923
Signed-off-By: Paul Reece <paulreece42@gmail.com>
fixing case sensitive
Signed-off-by: Paul Reece <paulreece42@gmail.com>
Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
as allowlist_externals was introduced in
tox v4.0. see
5e33fda1a4 , but
this option was backported to 3.18 as an alias of whitelist_externals, so we don't need
to specify the minversion to 4.0 in this change.
as we started using tox 4.0 and up (v4.0.2 in specific). tox complains
and fails like:
alerts-lint: failed with promtool is not allowed, use allowlist_externals to allow it
alerts-lint: FAIL code 1 (9.25 seconds)
see https://tox.wiki/en/latest/faq.html#tox-4-removed-tox-ini-keys
and https://tox.wiki/en/latest/config.html#allowlist_externals
it'd be nice to use a more inclusive language also. so, in this change,
s/whitelist_externals/allowlist_externals/ in all tox.ini in this
project.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Prometheus reports an error - many-to-many matching not allowed: matching labels must be unique on one side for CephPoolGrowthWarning if we have same pool ids on two different instances.
Fixes: https://tracker.ceph.com/issues/58017
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
Do only use `instance` to query for hostnames in single-cluster-mode.
Consider the cluster matcher only in multi-cluster-mode. In this case
the query will look like:
`"label_values({cluster=~\"$cluster\"}, instance)"`.
Fixes: https://tracker.ceph.com/issues/57987
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.
This could lead to redundancies and performance being degraded unnoticed.
To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.
Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>