RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-03-25 11:48:05 +00:00

Author	SHA1	Message	Date
Aashish Sharma	58d635455d	mgr/dashboard: Incorrect MTU mismatch warning The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue Fixes:https://tracker.ceph.com/issues/52028 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-09-02 15:34:36 +05:30
Aashish Sharma	8d2f39e6c5	mgr/dashboard:Simplify some complex calculations in test_alerts.yml run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue. Fixes: https://tracker.ceph.com/issues/49952 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-25 12:05:07 +05:30
Aashish Sharma	53a5816ded	mgr/dashboard:test prometheus rules through promtool This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file. Fixes: https://tracker.ceph.com/issues/45415 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-08 10:16:22 +05:30
Ernesto Puerta	dff5b78d3b	Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch mgr/dashboard: fix MTU Mismatch alert Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-17 14:14:17 +01:00
Ernesto Puerta	e2d73297cf	Merge pull request #38030 from p-se/prom-alert-package-drops-leeway mgr/dashboard: prometheus alerting: add some leeway for package drops and errors Reviewed-by: Stephan Müller <smueller@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-16 20:45:44 +01:00
Patrick Seidensal	9ac248b0c3	mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%) Fixes: https://tracker.ceph.com/issues/48201 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-02-16 14:43:00 +01:00
Aashish Sharma	8527489b91	mgr/dashboard:fix MTU Mismatch alert This PR intends to fix the expression used for MTU Mismatch alert in prometheus Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-02-15 10:13:39 +05:30
Aashish Sharma	06cc0d8743	mgr/dashboard: trigger alert if some nodes have a MTU different than the median value This PR intends to alert a user if a specific network is configured with a custom MTU Fixes: https://tracker.ceph.com/issues/48748 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-01-22 11:20:13 +05:30
Daniël Vos	79568d51c6	mgr/prometheus: Fix 'pool filling up' with >50% usage Fixes: https://tracker.ceph.com/issues/48354 Signed-off-by: Daniël Vos <danielvos@outlook.com>	2020-12-01 16:31:09 +01:00
Paul Cuzner	2010432b50	mgr/prometheus: Add healthcheck metric for SLOW_OPS SLOW_OPS is triggered by op tracker, and generates a health alert but healthchecks do not create metrics for prometheus to use as alert triggers. This change adds SLOW_OPS metric, and provides a simple means to extend to other relevant health checks in the future If the extract of the value from the health check message fails we log an error and remove the metric from the metric set. In addition the metric description has changed to better reflect the scenarios where SLOW_OPS can be triggered. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-11-02 15:30:49 +13:00
Benoît Knecht	653c3f6682	monitoring: Fix "10% OSDs down" alert description The alert was triggered when less than 90% of OSDs were _up_, but then the description took that value and described it as the percentage of OSDs being _down_. So with 12% of OSDs down, the alert description would read: ``` 88% or 88 of 100 OSDs are down (>=10%). ``` which can be panic-inducing. This commit changes the alert expression to actually compute the ratio of OSDs being down, which makes the correct value appear in the description. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2020-05-06 18:49:26 +02:00
Volker Theile	e197e4d7f4	monitoring: alert for pool fill up broken Fixes: https://tracker.ceph.com/issues/44991 Signed-off-by: Volker Theile <vtheile@suse.com>	2020-04-08 15:02:45 +02:00
Volker Theile	a5ade11a31	Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert monitoring: root volume full alert fires false positives Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-04-06 14:17:17 +02:00
Patrick Seidensal	6935dc5592	monitoring: alert for prediction of disk and pool fill up broken Fixes: https://tracker.ceph.com/issues/44776 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 13:44:28 +01:00
Patrick Seidensal	f8e347f771	monitoring: root volume full alert fires false positives Fixes: https://tracker.ceph.com/issues/44780 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 11:06:08 +01:00
James Cheng	1b980ef88c	monitoring: Fix pool capacity incorrect Signed-off-by: James Cheng <james59988@gmail.com>	2020-02-18 19:19:13 +08:00
Aleksei Zakharov	4eb58f7ccc	monitoring/grafana,prometheus: add per-pool pg states support Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>	2020-01-29 17:28:36 +03:00
Patrick Seidensal	fb51c589b5	monitoring: add details to Prometheus' alerts Fixes: https://tracker.ceph.com/issues/43764 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-01-24 14:21:31 +01:00
Jan Fajerski	e098536acc	Merge pull request #32325 from Kriechi/fix-42982 monitoring: fix prometheus alert for full pools	2020-01-20 10:42:36 +01:00
Thomas Kriechbaumer	9abddc0dd3	monitoring: fix prometheus alert for full pools The existing alert (introduced via https://tracker.ceph.com/issues/24977) already triggers when still 50% of storage space are available. Fixes: https://tracker.ceph.com/issues/42982 Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>	2019-12-18 15:04:51 +01:00
Patrick Seidensal	d262adeb21	monitoring: fix indentation of ceph default alerts Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:40:55 +01:00
Patrick Seidensal	e923af3430	monitoring: wait before firing osd full alert Fixes: https://tracker.ceph.com/issues/42862 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:39:27 +01:00
Volker Theile	8e6838c740	monitoring: SNMP OID per every Prometheus alert rule Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md. Example OID: 1.3.6.1.4.1.50495.15.1.2.2.1 All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details. Signed-off-by: Volker Theile <vtheile@suse.com>	2019-05-28 09:59:50 +02:00
Jan Fajerski	c0e58bd8ae	monitoring: add a few prometheus alerts Alerts are from https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml but updated for the mgr module and node_exporter >= 0.15. Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-04-26 11:21:39 +02:00

24 Commits