Commit Graph

24 Commits

Author SHA1 Message Date
Aashish Sharma
58d635455d mgr/dashboard: Incorrect MTU mismatch warning
The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue

Fixes:https://tracker.ceph.com/issues/52028
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-09-02 15:34:36 +05:30
Aashish Sharma
8d2f39e6c5 mgr/dashboard:Simplify some complex calculations in test_alerts.yml
run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue.

Fixes: https://tracker.ceph.com/issues/49952
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-03-25 12:05:07 +05:30
Aashish Sharma
53a5816ded mgr/dashboard:test prometheus rules through promtool
This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file.

Fixes: https://tracker.ceph.com/issues/45415
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-03-08 10:16:22 +05:30
Ernesto Puerta
dff5b78d3b
Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch
mgr/dashboard: fix MTU Mismatch alert

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-02-17 14:14:17 +01:00
Ernesto Puerta
e2d73297cf
Merge pull request #38030 from p-se/prom-alert-package-drops-leeway
mgr/dashboard: prometheus alerting: add some leeway for package drops and errors

Reviewed-by: Stephan Müller <smueller@suse.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
2021-02-16 20:45:44 +01:00
Patrick Seidensal
9ac248b0c3 mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%)
Fixes: https://tracker.ceph.com/issues/48201

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2021-02-16 14:43:00 +01:00
Aashish Sharma
8527489b91 mgr/dashboard:fix MTU Mismatch alert
This PR intends to fix the expression used for MTU Mismatch alert in prometheus

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-02-15 10:13:39 +05:30
Aashish Sharma
06cc0d8743 mgr/dashboard: trigger alert if some nodes have a MTU different than the median value
This PR intends to alert a user if a specific network is configured with a custom MTU

Fixes: https://tracker.ceph.com/issues/48748
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
2021-01-22 11:20:13 +05:30
Daniël Vos
79568d51c6 mgr/prometheus: Fix 'pool filling up' with >50% usage
Fixes: https://tracker.ceph.com/issues/48354
Signed-off-by: Daniël Vos <danielvos@outlook.com>
2020-12-01 16:31:09 +01:00
Paul Cuzner
2010432b50 mgr/prometheus: Add healthcheck metric for SLOW_OPS
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future

If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2020-11-02 15:30:49 +13:00
Benoît Knecht
653c3f6682 monitoring: Fix "10% OSDs down" alert description
The alert was triggered when less than 90% of OSDs were _up_, but then the
description took that value and described it as the percentage of OSDs being
_down_. So with 12% of OSDs down, the alert description would read:

```
88% or 88 of 100 OSDs are down (>=10%).
```

which can be panic-inducing.

This commit changes the alert expression to actually compute the ratio of OSDs
being down, which makes the correct value appear in the description.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2020-05-06 18:49:26 +02:00
Volker Theile
e197e4d7f4 monitoring: alert for pool fill up broken
Fixes: https://tracker.ceph.com/issues/44991
Signed-off-by: Volker Theile <vtheile@suse.com>
2020-04-08 15:02:45 +02:00
Volker Theile
a5ade11a31
Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert
monitoring: root volume full alert fires false positives

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Jan Fajerski <jfajerski@suse.com>
Reviewed-by: Volker Theile <vtheile@suse.com>
2020-04-06 14:17:17 +02:00
Patrick Seidensal
6935dc5592 monitoring: alert for prediction of disk and pool fill up broken
Fixes: https://tracker.ceph.com/issues/44776

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 13:44:28 +01:00
Patrick Seidensal
f8e347f771 monitoring: root volume full alert fires false positives
Fixes: https://tracker.ceph.com/issues/44780

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-03-27 11:06:08 +01:00
James Cheng
1b980ef88c
monitoring: Fix pool capacity incorrect
Signed-off-by: James Cheng <james59988@gmail.com>
2020-02-18 19:19:13 +08:00
Aleksei Zakharov
4eb58f7ccc monitoring/grafana,prometheus: add per-pool pg states support
Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>
2020-01-29 17:28:36 +03:00
Patrick Seidensal
fb51c589b5 monitoring: add details to Prometheus' alerts
Fixes: https://tracker.ceph.com/issues/43764

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2020-01-24 14:21:31 +01:00
Jan Fajerski
e098536acc
Merge pull request #32325 from Kriechi/fix-42982
monitoring: fix prometheus alert for full pools
2020-01-20 10:42:36 +01:00
Thomas Kriechbaumer
9abddc0dd3 monitoring: fix prometheus alert for full pools
The existing alert (introduced via
https://tracker.ceph.com/issues/24977) already triggers when still 50%
of storage space are available.

Fixes: https://tracker.ceph.com/issues/42982
Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>
2019-12-18 15:04:51 +01:00
Patrick Seidensal
d262adeb21 monitoring: fix indentation of ceph default alerts
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:40:55 +01:00
Patrick Seidensal
e923af3430 monitoring: wait before firing osd full alert
Fixes: https://tracker.ceph.com/issues/42862

Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
2019-11-18 12:39:27 +01:00
Volker Theile
8e6838c740 monitoring: SNMP OID per every Prometheus alert rule
Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md.

Example OID:
1.3.6.1.4.1.50495.15.1.2.2.1

All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details.

Signed-off-by: Volker Theile <vtheile@suse.com>
2019-05-28 09:59:50 +02:00
Jan Fajerski
c0e58bd8ae monitoring: add a few prometheus alerts
Alerts are from
https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
but updated for the mgr module and node_exporter >= 0.15.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>
2019-04-26 11:21:39 +02:00