Commit Graph

228 Commits

Author SHA1 Message Date
Aashish Sharma
2573426f54 mgr/dashboard: upgrade from old 'graph' type panels to the new
'timeseries' panel

The graph panel type is deprecated, and disappears after Grafana v9.1 (current version is 10.0) to prevent more old type panels being created. These should be migrated to the timeseries panel type, to avoid potential problems with future Grafana versions.

Fixes: https://tracker.ceph.com/issues/61720

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-12-22 11:19:40 +05:30
Nizamudeen A
dd0a5aac96 monitoring: upgrade grafana container to 9.4.12
Fixes the CVEs mentioned here: https://grafana.com/blog/2023/06/06/grafana-security-release-new-grafana-versions-with-security-fixes-for-cve-2023-2183-and-cve-2023-2801/

Signed-off-by: Nizamudeen A <nia@redhat.com>
2023-12-06 11:56:44 +05:30
Nizamudeen A
a42e286fc0
Merge pull request #54355 from nobuto-m/info-rbd-stats-pools
mgr/dashboard: info on why RBD graphs are empty

Reviewed-by: Ankush Behl <cloudbehl@gmail.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-11-30 13:38:54 +05:30
Aashish Sharma
39fea8f71c
Merge pull request #51340 from Javlopez/feature/12087-upgrade-and-generate-grafana-dashboards
monitoring: add new dashboards

Fixes: https://tracker.ceph.com/issues/63592

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
2023-11-20 11:33:07 +05:30
Aashish Sharma
70d8c5b565
Merge pull request #53650 from rhcs-dashboard/fix-62969-main
mgr/dashboard: Show the OSDs Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard

Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-11-17 11:24:45 +05:30
Nobuto Murata
9c026fa18c mgr/dashboard: info on why RBD graphs are empty
Those RBD IO statistics graphs are empty out of the box and it's on
purpose. Instead of giving an impression that those graphs are broken,
point users to a documentation explaining about optional steps to enable
those statistics.
https://docs.ceph.com/en/latest/mgr/prometheus/#rbd-io-statistics

Signed-off-by: Nobuto Murata <nobuto.murata@canonical.com>
2023-11-06 15:50:50 +09:00
Aashish Sharma
88d0a9f45d
Merge pull request #53807 from rhcs-dashboard/fix-63088-main
mgr/dashboard: Consider null values as zero in grafana panels


Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-10-25 13:01:03 +05:30
Javier
f0e8565b49 monitoring: update libsonnet files for generate ceph-cluster.json
add ceph-cluster.libsonnet file to generate ceph-cluster.json

Fixes: https://tracker.ceph.com/issues/61443
Signed-off-by: Javier <sjavierlopez@gmail.com>
2023-10-20 18:07:33 -06:00
Nizamudeen A
a5027e37ec mgr/dashboard: fix broken alert generator
Currently the alert generator is broken if you try to run `tox
-ealerts-fix`. I fixed it and ran the command and it built a new json
file as well.

Signed-off-by: Nizamudeen A <nia@redhat.com>
2023-10-13 12:42:50 +05:30
Aashish Sharma
a29e6a8673 mgr/dashboard: Show the OSD's Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard
Fixes: https://tracker.ceph.com/issues/62969

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-10-11 11:46:03 +05:30
Juan Miguel Olmo
b7b7ef90f4
Merge pull request #50132 from aruniiird/add-rbd-mirror-mon-alerts
ceph-mixin: Add RBD Mirror monitoring alerts
2023-10-10 13:37:01 +02:00
Aashish Sharma
6f3f58cb8e mgr/dashboard: Consider null values as zero in grafana panels
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.

Fixes: https://tracker.ceph.com/issues/63088

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2023-10-04 12:31:42 +05:30
Nizamudeen A
b5bf9d70cb
Merge pull request #52150 from paulreece42/wip-grafana-quorum-fix
monitoring: grafana mons out of quorum should be count - sum

Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-09-21 12:36:21 +05:30
Josh Soref
73479a1e05 dashboard: fix spelling errors
* access
* availability
* dashboard
* depth
* dimless
* evaluation
* executing
* existing
* facts
* gigabytes
* idempotent
* independent
* initial
* inventory
* managed
* must not
* notification
* notifications
* orchestrator
* previously
* promises
* purging
* queried
* repetitive
* split
* subdirectories
* tenant
* the
* timestamp
* transformed
* unavailable
* visibility
* yourself

Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com>
2023-08-09 11:14:20 -04:00
Arun Kumar Mohan
5c21134064 ceph-mixin: add RBD Mirror monitoring alerts
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Arun Kumar Mohan
e9d803d608 ceph-mixin: fix manually edited 'prometheus_alerts.yml' file
File 'prometheus_alerts.yml' file should not be edited directly.
The changes should be added to 'prometheus_alerts.libsonnet' file
(and/or any other appropriate lib/j sonnet files) and generated
using 'make generate' command.

Adding all the changes to 'prometheus_alerts.libsonnet' file and
building/generating the prometheus_alerts YAML file.

PS: all the changes seen in 'prometheus_alerts.yml' file is due
to the re-arrangement of lines. The file remains same.

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Arun Kumar Mohan
86d040e2fc ceph-mixin: fix ceph-mixin setup
Made following changes to files,

Makefile:
    Add needed 'tox' target to generate alert files
    Now we can do 'make generate' OR 'make test'
    to generate all the yaml files (and run tests)

alerts.jsonnet:
    Added an 'import' line to include 'config.libsonnet' file.
    This fix the errors in generating 'prometheus_alerts.yml' file

tox.ini:
    Added all the existing 'alerts-' targets to 'envlist'
    Added the missing 'alerts-test' target to 'testenv'
    Added 'jsonnet' to 'allowlist_externals', which prevents a
    deprecation waring
    A minor spell correction

lint-jsonnet.sh:
    Made errors more verbose.

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
2023-08-09 12:19:04 +05:30
Paul Reece
6ff02381a3 monitoring: grafana mons out of quorum should be count - sum
not count / sum

For example, with 3 mons total, all in quorum, original
will do 3/3 = 1, showing 1 out of quorum (likely typo fix)

Fixes: https://tracker.ceph.com/issues/61923

Signed-off-By: Paul Reece <paulreece42@gmail.com>

fixing case sensitive
Signed-off-by: Paul Reece <paulreece42@gmail.com>
2023-08-07 16:16:18 +00:00
Nizamudeen A
fb4a2f2b4e monitoring/grafana: update the grafana version
Signed-off-by: Nizamudeen A <nia@redhat.com>
2023-04-03 21:53:39 +05:30
Aashish Sharma
3063c8a4fb
Merge pull request #48783 from rhcs-dashboard/fix-grafana-stat-panel
mgr/dashboard: Replace vonage-status-panel with native grafana stat panel


Reviewed-by: Nizamudeen A <nia@redhat.com>
2023-02-08 18:34:30 +05:30
Pere Diaz Bou
8e07fbd2ea
Merge pull request #48843 from rhcs-dashboard/expose_slow_ops
mgr/prometheus: expose daemon health metrics

Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-12-20 12:25:32 +01:00
Pere Diaz Bou
5a2b7c25b6 mgr/prometheus: expose daemon health metrics
Until now daemon health metrics were stored without being used. One of
the most helpful metrics there is SLOW_OPS with respect to OSDs and MONs
which this commit tries to expose to bring fine grained metrics to find
troublesome OSDs instead of having a lone healthcheck of slow ops in the
whole cluster.

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-12-20 09:44:49 +01:00
Aashish Sharma
3e08b81b40 mgr/dashboard: Replace vonage-status-panel with native grafana stat panel
Fixes: https://tracker.ceph.com/issues/58295
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-12-16 10:51:47 +05:30
Kefu Chai
34e2e33870 *: s/whitelist_externals/allowlist_externals/
as allowlist_externals was introduced in
tox v4.0. see
5e33fda1a4 , but
this option was backported to 3.18 as an alias of whitelist_externals, so we don't need
to specify the minversion to 4.0 in this change.

as we started using tox 4.0 and up (v4.0.2 in specific). tox complains
and fails like:

alerts-lint: failed with promtool is not allowed, use allowlist_externals to allow it
  alerts-lint: FAIL code 1 (9.25 seconds)

see https://tox.wiki/en/latest/faq.html#tox-4-removed-tox-ini-keys
and https://tox.wiki/en/latest/config.html#allowlist_externals

it'd be nice to use a more inclusive language also. so, in this change,
s/whitelist_externals/allowlist_externals/ in all tox.ini in this
project.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-12-08 15:07:00 +08:00
Nizamudeen A
3f1c1b6376
Merge pull request #48526 from rhcs-dashboard/fix-cephPoolGrowth-alert
mgr/dashboard: Fix CephPoolGrowthWarning alert

Reviewed-by: Pegonzal <NOT@FOUND>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-11-29 18:29:01 +05:30
Aashish Sharma
97189b66af mgr/dashboard: Fix CephPoolGrowthWarning alert
Prometheus reports an error - many-to-many matching not allowed: matching labels must be unique on one side for CephPoolGrowthWarning if we have same pool ids on two different instances.

Fixes: https://tracker.ceph.com/issues/58017
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2022-11-22 11:55:41 +05:30
Tatjana Dehler
08352b6540
ceph-mixing: fix ceph_hosts variable
Do only use `instance` to query for hostnames in single-cluster-mode.
Consider the cluster matcher only in multi-cluster-mode. In this case
the query will look like:
`"label_values({cluster=~\"$cluster\"}, instance)"`.

Fixes: https://tracker.ceph.com/issues/57987
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-11-11 16:35:05 +01:00
Christian Kugler
4aecdad350
ceph-mixin: Add Prometheus Alert for Degraded Bond
Currently there is no alert for a network interface card to be misconfigured or
failed which is part of a network bond.

This could lead to redundancies and performance being degraded unnoticed.

To solve this, I use node exporter metrics to look at the number of total peers
of the bond and the ones that are active. If the numbers differ, something is up
and should be looked at.

Fixes: https://tracker.ceph.com/issues/57962
Signed-off-by: Christian Kugler <syphdias+git@gmail.com>
2022-11-02 14:48:57 +01:00
zdover23
23aa2be306
Merge pull request #47305 from zdover23/wip-doc-2022-07-25-pr4600-cleanup
doc/monitoring: add min vers of apps in mon stack

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
2022-09-13 13:44:43 +10:00
Nizamudeen A
d84a03e989
Merge pull request #47700 from s0nea/wip-rgw-overview-labels
monitoring/ceph-mixin: add RGW host to label info

Reviewed-by: MrFreezeex <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
2022-09-09 17:36:40 +05:30
Tatjana Dehler
15fa97d49d
monitoring/ceph-mixin: add RGW host to label info
Add the missing information about the RGW instance to the labels of the
"Average GET/PUT Latencies" panel on the "RGW Overview" dashboard.

Fixes: https://tracker.ceph.com/issues/57166
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-09-06 16:19:19 +02:00
Zac Dover
367695f5b0 doc/monitoring: add min vers of apps in mon stack
https://tracker.ceph.com/issues/45447

This PR adds recommended versions of grafana and
prometheus and alert manager.

This PR is a second attempt at getting the information
in the following PR into the docs:
https://github.com/ceph/ceph/pull/46000/files

Himadri Maheshwari deserves the credit for the work
in this commit.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
Signed-off-by: Himadri Maheshwari <himadri.maheshwari7915@gmail.com>
2022-09-05 07:36:52 +10:00
Arthur Outhenin-Chalandre
f744a93ef1
Merge pull request #47707 from bosc0/fix_alert
Ceph-mixin: Fix CephNodeNetworkPacket alerts
2022-08-30 12:49:23 +02:00
Arthur Outhenin-Chalandre
4909e795c9
Merge pull request #47669 from MrFreezeex/jb-path
ceph-mixin: fix PATH issues with jsonnet-bundler
2022-08-30 08:35:04 +02:00
Aswin Toni
351e1ac639 ceph-mixin: fix CephNodeNetworkPacket alerts
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-23 15:26:52 +02:00
Tatjana Dehler
42ff9370a0
monitoring/ceph-mixin: add entries to envlist
Add the missing entries `jsonnet-bundler-install` and
`jsonnet-bundler-update` to envlist.

Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-08-19 12:08:56 +02:00
Aswin Toni
35183140f6 ceph-mixin: fix config inheritance
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-18 16:21:36 +02:00
Arthur Outhenin-Chalandre
d46e14c71b
ceph-mixin: fix PATH issues with jsonnet-bundler
In 4a3afcf, the $PATH is set for the test, but we cannot set multiple
properties with a single `set_property()` cmake command. We fix that by
adding the installation path of jsonnet-bundler
(CMAKE_CURRENT_BINARY_DIR) to the $PATH used for every tox test.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Co-Authored-By: Kefu Chai <tchaikov@gmail.com>
2022-08-18 13:43:34 +02:00
Aswin Toni
2e0e684fc2 ceph-mixin: Remove jsonnet building
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Aswin Toni
5cdc1c62c5 prometheus: add multicluster support to alerts
Signed-off-by: Aswin Toni <aswin.toni@cern.ch>
2022-08-17 12:08:56 +02:00
Kefu Chai
4a3afcf277 cmake: set $PATH for tests using jsonnet tools
otherwise they would not able to find executables installed into
${CMAKE_CURRENT_BINARY_DIR}.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2022-08-16 10:53:29 +08:00
Nizamudeen A
e9d361f621
Merge pull request #47334 from s0nea/wip-osd-objectstore-types-fix
monitoring/ceph-mixin: OSD overview typo fix

Reviewed-by: MrFreezeex <NOT@FOUND>
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2022-08-01 13:47:03 +05:30
Anthony D'Atri
9b65974468 monitoring/ceph-mixin: clean up prometheus_alerts.yml
Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
2022-07-28 19:17:51 -07:00
Tatjana Dehler
8faaca2082
monitoring/ceph-mixin: OSD overview typo fix
Correct a wrongly set bracket on ceph-dashboard -> OSD Overview ->
OSD Objectstore Types resulting in a parser error.

Fixes: https://tracker.ceph.com/issues/56948
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
2022-07-28 15:15:32 +02:00
Arthur Outhenin-Chalandre
37add644d1
ceph-mixin: remove timepicker override in every dashboards
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:54:26 +02:00
Arthur Outhenin-Chalandre
5db37300fd
ceph-mixin: rationalize local helper functions to utils
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-24 11:50:49 +02:00
Arthur Outhenin-Chalandre
0b7cc6bc99
ceph-mixin: fix typos
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-18 10:02:54 +02:00
Arthur Outhenin-Chalandre
c8f086c182
ceph-mixin: fix test with rate and label changes
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:42:29 +02:00
Arthur Outhenin-Chalandre
3b6356c872
ceph-mixin: don't add cluster matcher if showcluster is disabled
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-17 09:41:21 +02:00
Arthur Outhenin-Chalandre
fd4f484d22
ceph-mixin: refactor the structure of _config and utils
Before this refactor we couln't override the config externally. Now the
_config is correctly propagated and not only taken from the
config.libsonnet file.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
2022-05-16 15:26:56 +02:00