RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-01-02 09:02:34 +00:00

Author	SHA1	Message	Date
Paul Cuzner	4750ac0d77	mgr/prometheus: add test cases and validation using tox Focus all tests inside a tests directory, and use pytest/tox to perform validation of the overall content. tox tests also use promtool if available to provide rule checks and unittest runs. In addition to these checks a validate_rules script provides the format, and content checks against all rules - which is also called via tox (but can be run independently too) Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-22 13:36:40 +13:00
Paul Cuzner	e0dfc02063	mgr/prometheus: track individual healthchecks as metrics This patch creates a health history object maintained in the modules kvstore. The history and current health checks are used to create a metric per healthcheck whilst also providing a history feature. Two new commands are added: ceph healthcheck history ls ceph healthcheck history clear In addition to the new commands, the additional metrics have been used to update the prometheus alerts Fixes: https://tracker.ceph.com/issues/52638 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-22 13:32:39 +13:00
Aashish Sharma	ed954b0e6c	mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards This PR intends to refactor cephfs dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:36:31 +05:30
Aashish Sharma	e490e2f3ab	mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards This PR intends to refactor osds dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:13:50 +05:30
Aashish Sharma	8c48821c21	mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards This PR intends to refactor pools dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:10:56 +05:30
Aashish Sharma	e737aaa000	mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards This PR intends to refactor rbd dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:09:04 +05:30
Aashish Sharma	eb01954cd9	mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards This PR intends to refactor radosgw dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 11:57:28 +05:30
Ernesto Puerta	19535b1d0e	Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-10-18 17:14:03 +02:00
Ernesto Puerta	9b40c9df26	Merge pull request #43377 from rhcs-dashboard/fix-clients-connection-query mgr/dashboard: replace "Ceph-cluster" Client connections with active-standby MGRs Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: neha-ojha <NOT@FOUND> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-10-13 13:37:51 +02:00
Sebastian Wagner	53382d70eb	Merge pull request #43274 from pcuzner/add-mib monitoring:Adding the Ceph MIB Reviewed-by: Sebastian Wagner <sewagner@redhat.com>	2021-10-12 22:29:06 +02:00
Aashish Sharma	f7714de294	mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards This PR intends to refactor hosts dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-12 11:05:02 +05:30
Avan Thakkar	d388c5e958	mgr/dashboard: replace Client connections with active-stdby mgrs Fixes: https://tracker.ceph.com/issues/52121 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2021-10-11 21:53:23 +05:30
Paul Cuzner	b96aa5d184	monitoring:Updated README Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-06 14:32:47 +13:00
Ernesto Puerta	ba9e17d2d2	Merge pull request #43132 from p-se/monitoring-grafana-piechart-update monitoring: update grafana-piechart-panel plugin Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-09-28 18:37:45 +02:00
Paul Cuzner	f9213ad9cf	monitoring:Adding the Ceph MIB The ceph MIB has been created and maintained in a a separate repo: https://github.com/SUSE/prometheus-webhook-snmp This patch brings this MIB into the main ceph repo, so alert changes can target prometheus and potentially SNMP environments within the same PR. Kudos to Volker Theile for creating the MIB. Fixes: https://tracker.ceph.com/issues/52708 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-09-23 11:06:19 +12:00
Patrick Seidensal	af94237621	monitoring: update grafana-piechart-panel plugin Fixes: https://tracker.ceph.com/issues/51211 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-09-10 15:28:17 +02:00
Aashish Sharma	58d635455d	mgr/dashboard: Incorrect MTU mismatch warning The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue Fixes:https://tracker.ceph.com/issues/52028 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-09-02 15:34:36 +05:30
Kefu Chai	1835fd86dd	cmake: exclude "grafonnet-lib" target from "all" so we don't build this target when running "make", and hence avoid accessing the internet in a building envronment where the internest access is not allowed. Signed-off-by: Kefu Chai <kchai@redhat.com>	2021-08-20 22:50:42 +08:00
Kefu Chai	1fdd632d0c	cmake: silence build output when building external deps when download/building grafonnet-lib, dpdk, spdk, liburing and fio, they dump lots of output during configuration and building phrases, all of which is irrelevant to us. so let's just silence it. Signed-off-by: Kefu Chai <kchai@redhat.com>	2021-08-16 21:27:57 +08:00
Ernesto Puerta	559afae0b9	Merge pull request #41570 from jhrcz-ls/wip-cephfs-overview-use-rate mgr/dashboard: cephfs MDS Workload to use rate for counter type metric	2021-08-12 20:53:07 +02:00
Aashish Sharma	4907c78bb7	mgr/dashboard: fix grafonnet build error This PR tends to fix the issue caused by #42194 Fixes:https://tracker.ceph.com/issues/52238 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-08-12 17:48:33 +05:30
Ernesto Puerta	afadfede0d	Merge pull request #42194 from rhcs-dashboard/add-grafonnet-grafana mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based code	2021-08-11 18:11:59 +02:00
Aashish Sharma	e9bd94515f	mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based Code This PR intends to add grafonnet to generate grafana JSON files Fixes: https://tracker.ceph.com/issues/45184 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-08-11 19:23:54 +05:30
Ernesto Puerta	cc6b18a92c	Merge pull request #41880 from david-caro/fix_cluster_grafana_dashboard monitoring/grafana/cluster: use per-unit max and limit values Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-08-02 13:03:46 +02:00
Jan Horáček	5bf516dcc7	[mgr/dashboard] cephfs metrics in MDS Workload panels to use rate because of counter type metric Fixes: https://tracker.ceph.com/issues/51954 Signed-off-by: Jan Horacek <jan.horacek@livesport.eu>	2021-07-29 10:09:41 +02:00
Seena Fallah	feb8f784d2	monitoring: fix Physical Device Latency unit Based on the expr it should be seconds Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2021-07-07 17:00:30 +04:30
Ernesto Puerta	62e3a5c41c	Merge pull request #41838 from p-se/grafana-clean-up monitoring: Clean up Grafana dashboards Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: jan--f <NOT@FOUND> Reviewed-by: p-se <NOT@FOUND> Reviewed-by: Paul Cuzner <pcuzner@redhat.com>	2021-06-25 20:45:28 +02:00
David Caro	c981298039	monitoring/grafana/cluster: use per-unit max and limit values The value we get is a perunit, so the limits and the max value should be over 1, not 100. Note that the value being shown was correct, it was the gauge that was not showing the correct indicators. Signed-off-by: David Caro <david@dcaro.es>	2021-06-16 10:38:41 +02:00
Patrick Seidensal	037410713f	monitoring: remove instance label from ceph-cluster.json completely The `instance` label is only useful if - the exporter returns only data about its node or instance - the exporter provides an instance label and then may return data about other nodes In this case, it's about the Prometheus mgr module, which is a single exporter providing data about a whole cluster, so not only data related to the node (or instance) the mgr module is running on. It is completely irrelevant on which node the exporter runs on, the data provided doesn't change. The exporter also doesn't provide `instance` labels (which Prometheus wouldn't change due to our configuration, see "honor_labels" setting). (Actually there's one exception where `instance` labels are provided by the Ceph mgr module, but that doesn't affect the Ceph Cluster dashboard.) Note that keeping that instance label on this particular dashboard would enable the user to switch between a previously failed mgr instance and the data collected from there and the currently running mgr instance (on which the Prometheus mgr module runs on). That'd split the data, which I don't think is a useful feature, but rather looks broken. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:11:30 +02:00
Patrick Seidensal	4270a13d6c	mgr/dashboard: Fix Grafana Ceph Cluster health status widget The health status widget doesn't show any status because it requires its query to return a single result. But in case a mgr instance had failed, it would return more, provided the incident has happened in the requested time frame. This is simply an issue of the `instant` switch being disabled for that widget. As only one mgr instance can ever be providing data at a time, enabling `instant` completely solves that issue. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	f51cab109d	mgr/dashboard: Fix decimals in OSC Capacity Utilization widget Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	5527c1c54f	mgr/dashboard: Remove hard-coded timezone off Grafana dashboards Remove hard-coded timezone off Grafana dashboards to enable the Grafana administrator to decide which timezone should be used for dashboards. If we hard-coded those values, changing the global settings in Grafana wouldn't have an effect. And the administrators can't change the automatically imported Grafana dashboards provided by us. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	8218d43e5f	monitoring: convert newline character to LF Convert newline character from CRLF in `rbd-details.json` to LF, so that it will be consistent with all the other dashboard JSON files. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	a709abf8bf	mgr/dashboard: deprecated variable usage in Grafana dashboards Fixes: https://tracker.ceph.com/issues/50059 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-07 14:31:53 +02:00
Dan Mick	de491c128a	monitoring/grafana/build/Makefile: work around buildah bug Workaround https://github.com/containers/buildah/issues/3253 by pushing to a local OCI-format image to clear out erroneously-left 'parent' field in buildah commit --squash output. Can be removed when the fix for the above is available. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	b56ff43232	monitoring/grafana/build/Makefile: use --authfile podman login caches auth tokens in auth.json; for sudo, it may be placed in /run/containers/0 or it may be in /run/users/0/containers; the latter directory is removed when root "logs out", which isn't clear what it means with sudo/su. Several builds failed because they couldn't find the cached auth between sudo podman login and sudo podman push. Sidestep the confusion by just using a local file for the auth cache. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	a3b4bc73f7	monitoring/grafana/build/Makefile: cleanup, ready for jenkins - allow env setting of versions of components - add docker/quay username/password variables - derive container version from grafana version - make arch-specific tags - expand clean target to remove container images - remove release-specific targets, "all" target - move push operations to separate "push" target Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	0fdbe673c8	monitoring/grafana/build/Makefile: use curl instead of wget build machines tend to already have curl installed Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	2faadc2d5c	monitoring/grafana/build/Makefile: use "sudo buildah" Some build machines don't have /etc/sub{u,g}id set up for so-called "rootless" (non-privileged) operation. Use sudo to avoid the need for "rootless". Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	9d37c6efbd	monitoring/grafana/build/Makefile: pull dashboards from local dir Use the dashboard definition files in this workspace directly Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	444d6f6623	monitoring/grafana/build/Makefile: Add ARCH variable Allow building for other archs, in particular arm64 Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	508b1d387f	monitoring/grafana/build/Makefile: fully qualify source image Some build machines may not have a default docker repo configured. Specify docker.io. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:24 -07:00
Ernesto Puerta	ac5d24e5ca	mgr/dashboard: remove non-null id in Grafana dashb Testing added to prevent this situation. Fixes: https://tracker.ceph.com/issues/50918 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2021-05-21 13:54:48 +02:00
Alfonso Martínez	7d79efb025	mgr/dashboard: fix OSDs Host details/overview grafana graphs Fixes: https://tracker.ceph.com/issues/50686 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-05-07 15:38:07 +02:00
Ernesto Puerta	458ad48024	Merge pull request #40715 from pcuzner/pool-overview-enhancement mgr/dashboard:include compression stats on pool dashboard Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-05-05 18:08:58 +02:00
Paul Cuzner	81788b1f21	mgr/dashboard:include compression stats on pool dashboard This is a replacement dashboard configuration for the pool overview page. It provides a cluster wide view of capacity consumed and compression effectiveness, and breaks this down by each pool within the configuration. Fixes: https://tracker.ceph.com/issues/50226 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-05-03 12:26:06 +12:00
Ernesto Puerta	381685f17f	Merge pull request #40072 from wornet-mwo/dashboard--grafana-hostname-corrections mgr/dashboard: Fixed name clash when hostname similar to another Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-04-29 19:40:57 +02:00
Michael Wodniok	e97e27ebdb	dashboard: Fixed name clash when hostname similar to anaother Fixes: #49769 Signed-off-by: Michael Wodniok <wodniok@wor.net>	2021-04-27 08:42:59 +02:00
Malcolm Holmes	382e293656	monitoring/grafana: Remove erroneous elements in hosts-overview Grafana dashboard The hosts-overview Grafana dashboard json file contains a repeated element, making it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet from parsing the dashboard, which prevents the deployment of this dashboard via Jsonnet. Fixes: https://tracker.ceph.com/issues/50410 Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>	2021-04-17 23:11:48 +01:00
Aashish Sharma	8d2f39e6c5	mgr/dashboard:Simplify some complex calculations in test_alerts.yml run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue. Fixes: https://tracker.ceph.com/issues/49952 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-25 12:05:07 +05:30
Aashish Sharma	53a5816ded	mgr/dashboard:test prometheus rules through promtool This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file. Fixes: https://tracker.ceph.com/issues/45415 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-08 10:16:22 +05:30
Ernesto Puerta	dff5b78d3b	Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch mgr/dashboard: fix MTU Mismatch alert Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-17 14:14:17 +01:00
Ernesto Puerta	e2d73297cf	Merge pull request #38030 from p-se/prom-alert-package-drops-leeway mgr/dashboard: prometheus alerting: add some leeway for package drops and errors Reviewed-by: Stephan Müller <smueller@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-16 20:45:44 +01:00
Patrick Seidensal	9ac248b0c3	mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%) Fixes: https://tracker.ceph.com/issues/48201 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-02-16 14:43:00 +01:00
Aashish Sharma	8527489b91	mgr/dashboard:fix MTU Mismatch alert This PR intends to fix the expression used for MTU Mismatch alert in prometheus Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-02-15 10:13:39 +05:30
Aashish Sharma	06cc0d8743	mgr/dashboard: trigger alert if some nodes have a MTU different than the median value This PR intends to alert a user if a specific network is configured with a custom MTU Fixes: https://tracker.ceph.com/issues/48748 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-01-22 11:20:13 +05:30
Alfonso Martínez	9441fda4dc	mgr/dashboard/monitoring: upgrade Grafana version due to CVE-2020-13379 Fixes: https://tracker.ceph.com/issues/48685 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-01-07 16:53:26 +01:00
Kefu Chai	30487c755c	Merge pull request #38282 from vosdev/ceph-pool-alert mgr/prometheus: Fix 'pool filling up' with >50% usage Reviewed-by: Patrick Seidensal <pseidensal@suse.com>	2020-12-12 12:10:44 +08:00
Daniël Vos	79568d51c6	mgr/prometheus: Fix 'pool filling up' with >50% usage Fixes: https://tracker.ceph.com/issues/48354 Signed-off-by: Daniël Vos <danielvos@outlook.com>	2020-12-01 16:31:09 +01:00
haoyixing	0e7e036aa7	doc/dev: use http://docs.ceph.com/en/latest/ instead of /docs/master/ for docs Several links under http://docs.ceph.com/docs/master/ were unable to access. Change them to http://docs.ceph.com/en/lastest so we can access them directly. Signed-off-by: haoyixing <haoyixing@kuaishou.com>	2020-11-24 12:49:47 +08:00
Paul Cuzner	2010432b50	mgr/prometheus: Add healthcheck metric for SLOW_OPS SLOW_OPS is triggered by op tracker, and generates a health alert but healthchecks do not create metrics for prometheus to use as alert triggers. This change adds SLOW_OPS metric, and provides a simple means to extend to other relevant health checks in the future If the extract of the value from the health check message fails we log an error and remove the metric from the metric set. In addition the metric description has changed to better reflect the scenarios where SLOW_OPS can be triggered. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-11-02 15:30:49 +13:00
Seena Fallah	0fd28f646c	monitoring: Use null yaxes min for OSD read latency According to seriesOverrides that negative-Y for read param there shouldn't be a minimum for yaxes Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2020-10-12 19:56:18 +03:30
Patrick Seidensal	fe64b9d176	mgr/dashboard: Fix many-to-many issue in host-details dashboard The labels on one side do not match the labels of the other side, where a label_replace is used. The fix uses the same label_replace on the missing side. Fixes: https://tracker.ceph.com/issues/47334 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-09-07 12:37:40 +02:00
Avan Thakkar	f039e5585d	mgr/dashboard: cpu stats incorrectly displayed Fixes: https://tracker.ceph.com/issues/46683 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-07-23 11:57:32 +05:30
pcuzner	0021dd278b	Merge pull request #35610 from pcuzner/wip-grafana-container monitoring: add grafana container build file	2020-07-06 13:06:55 +12:00
Lenz Grimmer	399521d66b	Merge pull request #34532 from rhcs-dashboard/wip-45068-fix-parse-error mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-06-30 10:50:59 +02:00
Paul Cuzner	3c813729dc	monitoring:add grafama container build file This commit provides the Makefile to create the ceph-grafana containers for nautilus, octopus and master releases. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-06-17 17:20:45 +12:00
Kiefer Chang	b963b7fbe9	monitoring: fixing some issues in RBD detail dashboard - Exchange read/write legends in The `I/O Bytes per second` panel. - Rename `I/O Bytes per second` to `Throughput`. - Rename `IOPS Count` to just `IOPS`. - Remove instance name from legends. - Fixes typos: `Averange` -> `Average`. Fixes: https://tracker.ceph.com/issues/45735 Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>	2020-05-28 14:49:31 +08:00
Alfonso Martínez	cf4ff7d2f0	mgr/dashboard: grafana panels for rgw multisite sync performance * RGW sync perf. counters are now exposed through grafana panels. * Sync Performance tab is only shown if rgw realm is detected. * Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility). Fixes: https://tracker.ceph.com/issues/45310 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2020-05-22 13:36:10 +02:00
Benoît Knecht	653c3f6682	monitoring: Fix "10% OSDs down" alert description The alert was triggered when less than 90% of OSDs were _up_, but then the description took that value and described it as the percentage of OSDs being _down_. So with 12% of OSDs down, the alert description would read: ``` 88% or 88 of 100 OSDs are down (>=10%). ``` which can be panic-inducing. This commit changes the alert expression to actually compute the ratio of OSDs being down, which makes the correct value appear in the description. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2020-05-06 18:49:26 +02:00
Lenz Grimmer	9334471340	Merge pull request #33991 from SchoolGuy/monitoring/rbd-image-details mgr/dashboard/grafana: Add rbd-image details dashboard Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com> Reviewed-by: Patrick Seidensal <pnawracay@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-05-04 09:59:53 +02:00
Enno Gotthold	dfb1e0020e	mgr/dashboard: Remove additional unneeded steps for the metrics calculation Signed-off-by: Enno Gotthold <egotthold@suse.de>	2020-04-28 13:34:16 +02:00
Ernesto Puerta	3fd804f10b	monitoring: fix decimal precision in Grafana % Set decimal precision to 2 positions for charts using percentunits. Fixes: https://tracker.ceph.com/issues/45183 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2020-04-22 13:39:16 +02:00
Avan Thakkar	47b515c094	mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Fixes: https://tracker.ceph.com/issues/45068 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-04-21 23:03:09 +05:30
Volker Theile	e197e4d7f4	monitoring: alert for pool fill up broken Fixes: https://tracker.ceph.com/issues/44991 Signed-off-by: Volker Theile <vtheile@suse.com>	2020-04-08 15:02:45 +02:00
Volker Theile	a5ade11a31	Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert monitoring: root volume full alert fires false positives Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-04-06 14:17:17 +02:00
Lenz Grimmer	b6ad9a804b	Merge pull request #34240 from krig/grafana-dashboards-fixes mgr/dashboard: Repair broken grafana panels Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Stephan Müller <smueller@suse.com>	2020-04-06 10:55:20 +02:00
Patrick Seidensal	6935dc5592	monitoring: alert for prediction of disk and pool fill up broken Fixes: https://tracker.ceph.com/issues/44776 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 13:44:28 +01:00
Kristoffer Grönlund	b7abaab5bd	dashboard: Convert FQDN to hostname in grafana panels The $ceph_hosts variable contained the FQDN for hosts while the instance label created by ceph only has the hostname. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:15 +01:00
Kristoffer Grönlund	136d21e21d	dashboard: Resolve FQDN / hostname mismatch in hosts overview panel In the AVG Disk Utilization panel, the result is calculated by combining the output of node_disk_io_time_seconds_total with the output of ceph_disk_occupation. However, the first vector encodes the instance label with the full FQDN while the ceph label only contains the hostname:port. In order for these to match correctly, the domain name and port has to be stripped from the labels. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:09 +01:00
Kristoffer Grönlund	8b61b8d3d7	dashboard: Use exported_instance to identify OSDs When moving to LVM-based ceph-volume setups, several grafana dashboards stopped working. The problem is that (device, instance) no longer results in unique labels which causes errors like: "many-to-many matching not allowed: matching labels must be unique on one side" Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:33:01 +01:00
Kristoffer Grönlund	4444333243	dashboard: AVG RAM Utilization panel always showed "N/A" The references to `$osd_hosts` etc. were encoded as `[[osd_hosts]]` in the PromQL expression divisor, and the panel always displayed N/A as the result of the query. Replacing the `[[...]]` with `$...` makes the expression work again. Fixes: https://tracker.ceph.com/issues/44784 Signed-off-by: Kristoffer Grönlund <kgronlund@suse.com>	2020-03-27 12:32:52 +01:00
Patrick Seidensal	f8e347f771	monitoring: root volume full alert fires false positives Fixes: https://tracker.ceph.com/issues/44780 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-27 11:06:08 +01:00
Kefu Chai	a12f9f19e0	Merge pull request #32749 from james58899/fix-capacity monitoring: Fix pool capacity incorrect Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com>	2020-03-27 16:13:29 +08:00
Enno Gotthold	9707cb30cb	mgr/dashboard: Add grafana chart for rbd image details Fixes: https://tracker.ceph.com/issues/44623 Signed-off-by: Enno Gotthold <egotthold@suse.de> This dashboard will per default be empty as the already existing dashboard with the summary for all rbd images.	2020-03-26 08:21:30 +01:00
Alfonso Martínez	1f0cddfafc	monitoring: fix RGW grafana chart 'Average GET/PUT Latencies' Fixes: https://tracker.ceph.com/issues/44538 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2020-03-10 12:05:26 +01:00
Patrick Seidensal	1794b55e64	monitoring: restore lost `pool full` alert Fixes: https://tracker.ceph.com/issues/44366 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-03-02 11:43:03 +01:00
James Cheng	1b980ef88c	monitoring: Fix pool capacity incorrect Signed-off-by: James Cheng <james59988@gmail.com>	2020-02-18 19:19:13 +08:00
Avan Thakkar	dd8cb9d2d6	mgr/dashboard: UI fixes Fixes: https://tracker.ceph.com/issues/42914 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-02-10 22:57:57 +05:30
Aleksei Zakharov	a37cf380ad	mgr/grafana: sum pg states for cluster Also, revert table formatting. Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>	2020-01-29 17:28:36 +03:00
Aleksei Zakharov	4eb58f7ccc	monitoring/grafana,prometheus: add per-pool pg states support Signed-off-by: Aleksei Zakharov <zaharov@selectel.ru>	2020-01-29 17:28:36 +03:00
Patrick Seidensal	fb51c589b5	monitoring: add details to Prometheus' alerts Fixes: https://tracker.ceph.com/issues/43764 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-01-24 14:21:31 +01:00
Jan Fajerski	e098536acc	Merge pull request #32325 from Kriechi/fix-42982 monitoring: fix prometheus alert for full pools	2020-01-20 10:42:36 +01:00
Bryan Stillwell	8eafb09acb	Switch spelling of utilization Prefer the non-British spelling of utilization since that's what the majority of the code base seems to use. Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>	2020-01-07 16:57:36 -07:00
Thomas Kriechbaumer	9abddc0dd3	monitoring: fix prometheus alert for full pools The existing alert (introduced via https://tracker.ceph.com/issues/24977) already triggers when still 50% of storage space are available. Fixes: https://tracker.ceph.com/issues/42982 Signed-off-by: Thomas Kriechbaumer <thomas@kriechbaumer.name>	2019-12-18 15:04:51 +01:00
Lenz Grimmer	11a1708e19	mgr/dashboard: grafana charts match time picker selection. (#31964 ) mgr/dashboard: grafana charts match time picker selection. Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com> Reviewed-by: Patrick Seidensal <pnawracay@suse.com>	2019-12-03 17:09:00 +00:00
Alfonso Martínez	5ba114330e	mgr/dashboard: grafana charts match time picker selection. Fixes: https://tracker.ceph.com/issues/43097 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2019-12-03 14:15:10 +01:00
Ernesto Puerta	1182073f0c	mgr/dashboard,grafana: remove shortcut menu Remove shortcut menu (links) and add check in grafana CI script. Fixes: https://tracker.ceph.com/issues/43091 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2019-12-03 10:21:35 +01:00
Patrick Seidensal	d262adeb21	monitoring: fix indentation of ceph default alerts Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:40:55 +01:00
Patrick Seidensal	e923af3430	monitoring: wait before firing osd full alert Fixes: https://tracker.ceph.com/issues/42862 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2019-11-18 12:39:27 +01:00
Radu Toader	3beaf63761	mgr/dashboard: fix grafana dashboards Fixes: https://tracker.ceph.com/issues/42542 Sort order was wrong for some dashboards, fixed empty / buggy Top 3 clients IOPS by pool / Throughput - in Pools Overall performance fixed Avg utilization Multiple series found - in Host Overall performance Fixed invalid dimensions for plot - in OSD Overall performance Signed-off-by: Radu Toader <radu.m.toader@gmail.com>	2019-10-30 11:03:03 +02:00
Volker Theile	8e6838c740	monitoring: SNMP OID per every Prometheus alert rule Use the Ceph enterprise OID 50495 (https://www.iana.org/assignments/enterprise-numbers/enterprise-numbers) and create OIDs for every Prometheus alert rule according to the schema at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/README.md. Example OID: 1.3.6.1.4.1.50495.15.1.2.2.1 All alert rule OIDs are located below the object identifier 15 (15 for p which is the first character of prometheus). Check out the MIB at https://github.com/SUSE/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt for more details. Signed-off-by: Volker Theile <vtheile@suse.com>	2019-05-28 09:59:50 +02:00
Jan Fajerski	e7a4437fdc	monitoring: update Grafana dashboards Fix various panels that used outdated metric names, cluncky or unnecessary label_replace calls. Also unify the style of many panels. Fixes: http://tracker.ceph.com/issues/39652 Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-05-14 13:47:55 +02:00
Jan Fajerski	c0e58bd8ae	monitoring: add a few prometheus alerts Alerts are from https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml but updated for the mgr module and node_exporter >= 0.15. Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-04-26 11:21:39 +02:00
Jan Fajerski	287e209351	monitoring/grafana: fix typo in README Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2019-04-16 14:19:51 +02:00
Neha Gupta	739fdbad37	mgr/dashboard: Fixed performance details context for host list row selection Fixes: http://tracker.ceph.com/issues/37854 Signed-off-by: Neha Gupta <gnehapk@gmail.com>	2019-01-18 13:36:49 +09:00
Jason Dillaman	f4ac899950	monitoring/grafana: new RBD overview dashboard page This page pulls RBD stats from the Natuatilus prometheus exporter. Signed-off-by: Jason Dillaman <dillaman@redhat.com>	2019-01-11 16:41:46 -05:00
Boris Ranto	1ade714910	cmake: Support grafana dashboard installation We are currently hosting the grafana dashboards in our repo but we do not install them. This patch adds the cmake support. Signed-off-by: Boris Ranto <branto@redhat.com>	2018-10-25 17:09:02 +02:00
Lenz Grimmer	94aefee3b0	Merge pull request #24314 from rhcs-dashboard/dashboards mgr/dashboard: Grafana dashboard updates and additions Reviewed-by: Boris Ranto <branto@redhat.com>	2018-10-19 12:42:23 +02:00
Paul Cuzner	a848411bd8	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	a99618ce41	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	b64289ca3d	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	5432470914	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	bc5eea09c8	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	ba1a3b3a09	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	f97fee3a83	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	02b5414d19	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	7c04098e68	MGR/dashboard: make grafana datasource selectable Grafana dashboard updated to use a templating variable for the datasource Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	2c346efd12	Fix linewidth issue in pools overview dashboard Linewidth was set to two, but the idea is that a linewidth of >1 is reserved for eye-catcher plot lines like maximums Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	b84f0ce45f	Refresh of the dashboards Fixes some minor anomalies and tested against node_exporter 0.15 and 0.16 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	7d97bb28a8	Updated requirements information Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	0e655f8400	Added new Overview dashboards These new dashboard definitions provide the high level views for the hosts in the cluster and the OSDs. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	4292a7a357	Screenshots added for all dashboards Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	3c7c32f2ed	Add Host level details dashboard The host-details.json file provides a view of host level metrics. The panels are arranged in two rows; Overview : Cpu/RAM/Network related stats OSD Performance: OSD physical drive stats The overview row is shown by default. Click on the OSD Performance row to show the remaining graphs Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Paul Cuzner	a0d9325c4d	Document the current state of the dashboards Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:26:08 +13:00
Paul Cuzner	8ebf2ede7f	Initial grafana dashboard definitions Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2018-10-09 08:23:39 +13:00
Maxime	68b044a75e	[grafana] Fix OSD Capacity Utlization graph Signed-off-by: Maxime <maxime@root314.com>	2018-10-04 13:44:12 +02:00
Jan Fajerski	7e7ae7a0fe	add monitoring subdir and Grafana cluster dashboard Signed-off-by: Jan Fajerski <jfajerski@suse.com>	2018-05-07 14:25:29 +02:00

1 2 3 4 5

228 Commits