Commit Graph

86 Commits

Author SHA1 Message Date
Aashish Sharma
2e54c9a01e mgr/dashboard: Add a new chart for replication delta per shard in rgw sync overview grafana dashboard
Fixes: https://tracker.ceph.com/issues/66994

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-07-17 15:49:15 +05:30
Nizamudeen A
d11b25ed0f
Merge pull request #56014 from badone/wip-tracker-63591-pyyaml-cython_sources
install-deps: Update Pyyaml version

Reviewed-by: Ankush Behl <cloudbehl@gmail.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2024-05-21 12:00:53 -04:00
Aashish Sharma
1622ad8f76 mgr/dashboard: fix cluster filter typo in multi-cluster-overview
grafana dashboard

Fixes: https://tracker.ceph.com/issues/65760

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-05-02 17:12:26 +05:30
Aashish Sharma
fce7c520f3
Merge pull request #56575 from cloudbehl/ceph-cluster-json-update
monitoring/ceph-mixin: Add cluster variable to ceph-cluster.json

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
2024-05-02 15:34:50 +05:30
Nizamudeen A
a8d01fff00
Merge pull request #55495 from frittentheke/issue_64321
monitoring/ceph-mixin: Cleanup of variables, queries and tests (to fix showMultiCluster=True)

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ankush Behl <cloudbehl@gmail.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2024-05-02 13:55:37 +05:30
Adam King
c6871bbaf5 monitoring/ceph-mixin: set NVMeoFMaxGatewaysPerGroup to 4
Recommendation from the nvmeof team

Signed-off-by: Adam King <adking@redhat.com>
2024-04-22 08:48:15 -04:00
Christian Rohmann
090b8e17f1 Cleanup of variables, queries and tests to enable showMultiCluster=True
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via https://github.com/ceph/ceph/pull/54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
2024-04-22 08:29:37 +02:00
Aashish Sharma
c2f4aa7887 mgr/dashboard: replace deprecated table panel in grafana with a newer
table panel

Fixes: https://tracker.ceph.com/issues/65174

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-04-02 12:22:53 +05:30
cloudbehl
2df3ce1902 monitoring/ceph-mixin: Add cluster variable to ceph-cluster.json
Fixes: https://tracker.ceph.com/issues/65218

Signed-off-by: cloudbehl <cloudbehl@gmail.com>
2024-03-29 13:29:33 +05:30
Brad Hubbard
7863d297ea install-deps: Update Pyyaml version
Move to 6.0.1 to overcome https://github.com/yaml/pyyaml/issues/601

Fixes: https://tracker.ceph.com/issues/63591

Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2024-03-07 14:13:11 +10:00
Nizamudeen A
278daa4ba2
Merge pull request #55574 from ceph/feature-multi-cluster-management-monitoring
mgr/dashboard: introduce multi cluster management and monitoring in ceph dashboard 

Reviewed-by: Nizamudeen A <nia@redhat.com>
2024-03-06 10:26:29 +05:30
Nizamudeen A
b8811c844f mgr/dashboard: introduce multi-cluster overview page
https://tracker.ceph.com/issues/64530
Signed-off-by: Nizamudeen A <nia@redhat.com>
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-03-05 19:05:37 +05:30
Aashish Sharma
d9c92e562d
Merge pull request #55510 from pcuzner/add-nvmeof-alerts
ceph-mixin: Update mixin to include alerts for the nvmeof gateway(s)


Reviewed-by: Aashish Sharma <aasharma@redhat.com>
2024-02-29 10:36:58 +05:30
Aashish Sharma
6e5efb626f mgr/dashboard: replace piechart plugin charts with native pie chart
panel

Fixes: https://tracker.ceph.com/issues/64579

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-02-27 14:20:31 +05:30
Paul Cuzner
c2534a6dba ceph-mixins: Add test cases for nvmeof alerts
Signed-off-by: Paul Cuzner <pcuzner@ibm.com>
2024-02-27 09:51:11 +13:00
Paul Cuzner
e7d25482d1 ceph-mixins: nvmeof alerts added
Signed-off-by: Paul Cuzner <pcuzner@ibm.com>
2024-02-27 09:51:11 +13:00
Paul Cuzner
f1573b76f3 ceph-mixins: Add nvmeof alerts
Signed-off-by: Paul Cuzner <pcuzner@ibm.com>
2024-02-27 09:51:04 +13:00
Paul Cuzner
feb1e69034 ceph-mixins: Add vars to support nvmeof alerts
Signed-off-by: Paul Cuzner <pcuzner@ibm.com>
2024-02-26 11:15:15 +13:00
Aashish Sharma
495f669faf mgr/dashboard: Add a manage clusters page to the multi-cluster nav to
list/connect/disconnect/edit clusters in multi-cluster setup

Fixes: https://tracker.ceph.com/issues/64530
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-02-22 10:42:01 +05:30
Aashish Sharma
a85baa89da
Merge pull request #55314 from cloudbehl/rgw-dashboard-json
mgr/dashboard: Fixing RGW graph panels


Reviewed-by: Aashish Sharma <aasharma@redhat.com>
2024-02-13 12:00:33 +05:30
Aashish Sharma
a572a0c167 mgr/dashboard: Add RGW per user/bucket panels in grafana
Fixes: https://tracker.ceph.com/issues/64359

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-02-09 21:14:27 +05:30
Aashish Sharma
65e6714720 mgr/dashboards: add generated json files
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2024-02-07 14:27:31 +05:30
Guillaume Abrioux
76d8e0bbbf monitoring: add new alerts
This adds new hardware monitoring alerts.

Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
2024-01-25 14:43:30 +00:00
cloudbehl
191fda84b3 mgr/dashboard: Fixing RGW graph panels
- Fixing grafana panels for rgw dashboards
- Fixing RGW overview dashboard queries

fixes https://tracker.ceph.com/issues/64177

Signed-off-by: cloudbehl <cloudbehl@gmail.com>
2024-01-25 18:41:03 +05:30
Aashish Sharma
2573426f54 mgr/dashboard: upgrade from old 'graph' type panels to the new
'timeseries' panel

The graph panel type is deprecated, and disappears after Grafana v9.1 (current version is 10.0) to prevent more old type panels being created. These should be migrated to the timeseries panel type, to avoid potential problems with future Grafana versions.

Fixes: https://tracker.ceph.com/issues/61720

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-12-22 11:19:40 +05:30
Nizamudeen A
a42e286fc0
Merge pull request #54355 from nobuto-m/info-rbd-stats-pools
mgr/dashboard: info on why RBD graphs are empty

Reviewed-by: Ankush Behl <cloudbehl@gmail.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-11-30 13:38:54 +05:30
Aashish Sharma
39fea8f71c
Merge pull request #51340 from Javlopez/feature/12087-upgrade-and-generate-grafana-dashboards
monitoring: add new dashboards

Fixes: https://tracker.ceph.com/issues/63592

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
2023-11-20 11:33:07 +05:30
Aashish Sharma
70d8c5b565
Merge pull request #53650 from rhcs-dashboard/fix-62969-main
mgr/dashboard: Show the OSDs Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard

Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-11-17 11:24:45 +05:30
Nobuto Murata
9c026fa18c mgr/dashboard: info on why RBD graphs are empty
Those RBD IO statistics graphs are empty out of the box and it's on
purpose. Instead of giving an impression that those graphs are broken,
point users to a documentation explaining about optional steps to enable
those statistics.
https://docs.ceph.com/en/latest/mgr/prometheus/#rbd-io-statistics

Signed-off-by: Nobuto Murata <nobuto.murata@canonical.com>
2023-11-06 15:50:50 +09:00
Aashish Sharma
88d0a9f45d
Merge pull request #53807 from rhcs-dashboard/fix-63088-main
mgr/dashboard: Consider null values as zero in grafana panels


Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-10-25 13:01:03 +05:30
Javier
f0e8565b49 monitoring: update libsonnet files for generate ceph-cluster.json
add ceph-cluster.libsonnet file to generate ceph-cluster.json

Fixes: https://tracker.ceph.com/issues/61443
Signed-off-by: Javier <sjavierlopez@gmail.com>
2023-10-20 18:07:33 -06:00
Nizamudeen A
a5027e37ec mgr/dashboard: fix broken alert generator
Currently the alert generator is broken if you try to run `tox
-ealerts-fix`. I fixed it and ran the command and it built a new json
file as well.

Signed-off-by: Nizamudeen A <nia@redhat.com>
2023-10-13 12:42:50 +05:30
Aashish Sharma
a29e6a8673 mgr/dashboard: Show the OSD's Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard
Fixes: https://tracker.ceph.com/issues/62969

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-10-11 11:46:03 +05:30
Juan Miguel Olmo
b7b7ef90f4
Merge pull request #50132 from aruniiird/add-rbd-mirror-mon-alerts
ceph-mixin: Add RBD Mirror monitoring alerts
2023-10-10 13:37:01 +02:00
Aashish Sharma
6f3f58cb8e mgr/dashboard: Consider null values as zero in grafana panels
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.

Fixes: https://tracker.ceph.com/issues/63088

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-10-04 12:31:42 +05:30
Nizamudeen A
b5bf9d70cb
Merge pull request #52150 from paulreece42/wip-grafana-quorum-fix
monitoring: grafana mons out of quorum should be count - sum

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-09-21 12:36:21 +05:30
Josh Soref
73479a1e05 dashboard: fix spelling errors
* access
* availability
* dashboard
* depth
* dimless
* evaluation
* executing
* existing
* facts
* gigabytes
* idempotent
* independent
* initial
* inventory
* managed
* must not
* notification
* notifications
* orchestrator
* previously
* promises
* purging
* queried
* repetitive
* split
* subdirectories
* tenant
* the
* timestamp
* transformed
* unavailable
* visibility
* yourself

Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com>
2023-08-09 11:14:20 -04:00
Arun Kumar Mohan
5c21134064 ceph-mixin: add RBD Mirror monitoring alerts
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Arun Kumar Mohan
e9d803d608 ceph-mixin: fix manually edited 'prometheus_alerts.yml' file
File 'prometheus_alerts.yml' file should not be edited directly.
The changes should be added to 'prometheus_alerts.libsonnet' file
(and/or any other appropriate lib/j sonnet files) and generated
using 'make generate' command.

Adding all the changes to 'prometheus_alerts.libsonnet' file and
building/generating the prometheus_alerts YAML file.

PS: all the changes seen in 'prometheus_alerts.yml' file is due
to the re-arrangement of lines. The file remains same.

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Arun Kumar Mohan
86d040e2fc ceph-mixin: fix ceph-mixin setup
Made following changes to files,

Makefile:
    Add needed 'tox' target to generate alert files
    Now we can do 'make generate' OR 'make test'
    to generate all the yaml files (and run tests)

alerts.jsonnet:
    Added an 'import' line to include 'config.libsonnet' file.
    This fix the errors in generating 'prometheus_alerts.yml' file

tox.ini:
    Added all the existing 'alerts-' targets to 'envlist'
    Added the missing 'alerts-test' target to 'testenv'
    Added 'jsonnet' to 'allowlist_externals', which prevents a
    deprecation waring
    A minor spell correction

lint-jsonnet.sh:
    Made errors more verbose.

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Paul Reece
6ff02381a3 monitoring: grafana mons out of quorum should be count - sum
not count / sum

For example, with 3 mons total, all in quorum, original
will do 3/3 = 1, showing 1 out of quorum (likely typo fix)

Fixes: https://tracker.ceph.com/issues/61923

Signed-off-By: Paul Reece <paulreece42@gmail.com>

fixing case sensitive
Signed-off-by: Paul Reece <paulreece42@gmail.com>
2023-08-07 16:16:18 +00:00
Aashish Sharma
3063c8a4fb
Merge pull request #48783 from rhcs-dashboard/fix-grafana-stat-panel
mgr/dashboard: Replace vonage-status-panel with native grafana stat panel


Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-02-08 18:34:30 +05:30
Pere Diaz Bou
8e07fbd2ea
Merge pull request #48843 from rhcs-dashboard/expose_slow_ops
mgr/prometheus: expose daemon health metrics

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-12-20 12:25:32 +01:00
Pere Diaz Bou
5a2b7c25b6 mgr/prometheus: expose daemon health metrics
Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-12-20 09:44:49 +01:00
Aashish Sharma
3e08b81b40 mgr/dashboard: Replace vonage-status-panel with native grafana stat panel
Fixes: https://tracker.ceph.com/issues/58295
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-12-16 10:51:47 +05:30
Kefu Chai
34e2e33870 *: s/whitelist_externals/allowlist_externals/
as allowlist_externals was introduced in
tox v4.0. see
5e33fda1a4 , but
this option was backported to 3.18 as an alias of whitelist_externals, so we don't need
to specify the minversion to 4.0 in this change.

as we started using tox 4.0 and up (v4.0.2 in specific). tox complains
and fails like:

alerts-lint: failed with promtool is not allowed, use allowlist_externals to allow it
  alerts-lint: FAIL code 1 (9.25 seconds)

see https://tox.wiki/en/latest/faq.html#tox-4-removed-tox-ini-keys
and https://tox.wiki/en/latest/config.html#allowlist_externals

it'd be nice to use a more inclusive language also. so, in this change,
s/whitelist_externals/allowlist_externals/ in all tox.ini in this
project.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-12-08 15:07:00 +08:00
Nizamudeen A
3f1c1b6376
Merge pull request #48526 from rhcs-dashboard/fix-cephPoolGrowth-alert
mgr/dashboard: Fix CephPoolGrowthWarning alert

Reviewed-by: Pegonzal <NOT@FOUND>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-11-29 18:29:01 +05:30
Aashish Sharma
97189b66af mgr/dashboard: Fix CephPoolGrowthWarning alert
Prometheus reports an error - many-to-many matching not allowed: matching labels must be unique on one side for CephPoolGrowthWarning if we have same pool ids on two different instances.

Fixes: https://tracker.ceph.com/issues/58017
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-11-22 11:55:41 +05:30
Tatjana Dehler
08352b6540
ceph-mixing: fix ceph_hosts variable
Do only use `instance` to query for hostnames in single-cluster-mode.
Consider the cluster matcher only in multi-cluster-mode. In this case
the query will look like:
`"label_values({cluster=~\"$cluster\"}, instance)"`.

Fixes: https://tracker.ceph.com/issues/57987
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-11-11 16:35:05 +01:00
Christian Kugler
4aecdad350
ceph-mixin: Add Prometheus Alert for Degraded Bond
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.

This could lead to redundancies and performance being degraded unnoticed.

To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.

Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>
2022-11-02 14:48:57 +01:00