RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-01-01 08:32:24 +00:00

Author	SHA1	Message	Date
Seena Fallah	feb8f784d2	monitoring: fix Physical Device Latency unit Based on the expr it should be seconds Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2021-07-07 17:00:30 +04:30
Ernesto Puerta	62e3a5c41c	Merge pull request #41838 from p-se/grafana-clean-up monitoring: Clean up Grafana dashboards Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: jan--f <NOT@FOUND> Reviewed-by: p-se <NOT@FOUND> Reviewed-by: Paul Cuzner <pcuzner@redhat.com>	2021-06-25 20:45:28 +02:00
Patrick Seidensal	037410713f	monitoring: remove instance label from ceph-cluster.json completely The `instance` label is only useful if - the exporter returns only data about its node or instance - the exporter provides an instance label and then may return data about other nodes In this case, it's about the Prometheus mgr module, which is a single exporter providing data about a whole cluster, so not only data related to the node (or instance) the mgr module is running on. It is completely irrelevant on which node the exporter runs on, the data provided doesn't change. The exporter also doesn't provide `instance` labels (which Prometheus wouldn't change due to our configuration, see "honor_labels" setting). (Actually there's one exception where `instance` labels are provided by the Ceph mgr module, but that doesn't affect the Ceph Cluster dashboard.) Note that keeping that instance label on this particular dashboard would enable the user to switch between a previously failed mgr instance and the data collected from there and the currently running mgr instance (on which the Prometheus mgr module runs on). That'd split the data, which I don't think is a useful feature, but rather looks broken. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:11:30 +02:00
Patrick Seidensal	4270a13d6c	mgr/dashboard: Fix Grafana Ceph Cluster health status widget The health status widget doesn't show any status because it requires its query to return a single result. But in case a mgr instance had failed, it would return more, provided the incident has happened in the requested time frame. This is simply an issue of the `instant` switch being disabled for that widget. As only one mgr instance can ever be providing data at a time, enabling `instant` completely solves that issue. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	f51cab109d	mgr/dashboard: Fix decimals in OSC Capacity Utilization widget Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	5527c1c54f	mgr/dashboard: Remove hard-coded timezone off Grafana dashboards Remove hard-coded timezone off Grafana dashboards to enable the Grafana administrator to decide which timezone should be used for dashboards. If we hard-coded those values, changing the global settings in Grafana wouldn't have an effect. And the administrators can't change the automatically imported Grafana dashboards provided by us. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	8218d43e5f	monitoring: convert newline character to LF Convert newline character from CRLF in `rbd-details.json` to LF, so that it will be consistent with all the other dashboard JSON files. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	a709abf8bf	mgr/dashboard: deprecated variable usage in Grafana dashboards Fixes: https://tracker.ceph.com/issues/50059 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-07 14:31:53 +02:00
Dan Mick	de491c128a	monitoring/grafana/build/Makefile: work around buildah bug Workaround https://github.com/containers/buildah/issues/3253 by pushing to a local OCI-format image to clear out erroneously-left 'parent' field in buildah commit --squash output. Can be removed when the fix for the above is available. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	b56ff43232	monitoring/grafana/build/Makefile: use --authfile podman login caches auth tokens in auth.json; for sudo, it may be placed in /run/containers/0 or it may be in /run/users/0/containers; the latter directory is removed when root "logs out", which isn't clear what it means with sudo/su. Several builds failed because they couldn't find the cached auth between sudo podman login and sudo podman push. Sidestep the confusion by just using a local file for the auth cache. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	a3b4bc73f7	monitoring/grafana/build/Makefile: cleanup, ready for jenkins - allow env setting of versions of components - add docker/quay username/password variables - derive container version from grafana version - make arch-specific tags - expand clean target to remove container images - remove release-specific targets, "all" target - move push operations to separate "push" target Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	0fdbe673c8	monitoring/grafana/build/Makefile: use curl instead of wget build machines tend to already have curl installed Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	2faadc2d5c	monitoring/grafana/build/Makefile: use "sudo buildah" Some build machines don't have /etc/sub{u,g}id set up for so-called "rootless" (non-privileged) operation. Use sudo to avoid the need for "rootless". Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	9d37c6efbd	monitoring/grafana/build/Makefile: pull dashboards from local dir Use the dashboard definition files in this workspace directly Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	444d6f6623	monitoring/grafana/build/Makefile: Add ARCH variable Allow building for other archs, in particular arm64 Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	508b1d387f	monitoring/grafana/build/Makefile: fully qualify source image Some build machines may not have a default docker repo configured. Specify docker.io. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:24 -07:00
Ernesto Puerta	ac5d24e5ca	mgr/dashboard: remove non-null id in Grafana dashb Testing added to prevent this situation. Fixes: https://tracker.ceph.com/issues/50918 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2021-05-21 13:54:48 +02:00
Alfonso Martínez	7d79efb025	mgr/dashboard: fix OSDs Host details/overview grafana graphs Fixes: https://tracker.ceph.com/issues/50686 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-05-07 15:38:07 +02:00
Ernesto Puerta	458ad48024	Merge pull request #40715 from pcuzner/pool-overview-enhancement mgr/dashboard:include compression stats on pool dashboard Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-05-05 18:08:58 +02:00
Paul Cuzner	81788b1f21	mgr/dashboard:include compression stats on pool dashboard This is a replacement dashboard configuration for the pool overview page. It provides a cluster wide view of capacity consumed and compression effectiveness, and breaks this down by each pool within the configuration. Fixes: https://tracker.ceph.com/issues/50226 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-05-03 12:26:06 +12:00
Ernesto Puerta	381685f17f	Merge pull request #40072 from wornet-mwo/dashboard--grafana-hostname-corrections mgr/dashboard: Fixed name clash when hostname similar to another Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-04-29 19:40:57 +02:00
Michael Wodniok	e97e27ebdb	dashboard: Fixed name clash when hostname similar to anaother Fixes: #49769 Signed-off-by: Michael Wodniok <wodniok@wor.net>	2021-04-27 08:42:59 +02:00
Malcolm Holmes	382e293656	monitoring/grafana: Remove erroneous elements in hosts-overview Grafana dashboard The hosts-overview Grafana dashboard json file contains a repeated element, making it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet from parsing the dashboard, which prevents the deployment of this dashboard via Jsonnet. Fixes: https://tracker.ceph.com/issues/50410 Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>	2021-04-17 23:11:48 +01:00
Aashish Sharma	8d2f39e6c5	mgr/dashboard:Simplify some complex calculations in test_alerts.yml run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue. Fixes: https://tracker.ceph.com/issues/49952 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-25 12:05:07 +05:30
Aashish Sharma	53a5816ded	mgr/dashboard:test prometheus rules through promtool This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file. Fixes: https://tracker.ceph.com/issues/45415 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-08 10:16:22 +05:30
Ernesto Puerta	dff5b78d3b	Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch mgr/dashboard: fix MTU Mismatch alert Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-17 14:14:17 +01:00
Ernesto Puerta	e2d73297cf	Merge pull request #38030 from p-se/prom-alert-package-drops-leeway mgr/dashboard: prometheus alerting: add some leeway for package drops and errors Reviewed-by: Stephan Müller <smueller@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-16 20:45:44 +01:00
Patrick Seidensal	9ac248b0c3	mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%) Fixes: https://tracker.ceph.com/issues/48201 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-02-16 14:43:00 +01:00
Aashish Sharma	8527489b91	mgr/dashboard:fix MTU Mismatch alert This PR intends to fix the expression used for MTU Mismatch alert in prometheus Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-02-15 10:13:39 +05:30
Aashish Sharma	06cc0d8743	mgr/dashboard: trigger alert if some nodes have a MTU different than the median value This PR intends to alert a user if a specific network is configured with a custom MTU Fixes: https://tracker.ceph.com/issues/48748 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-01-22 11:20:13 +05:30
Alfonso Martínez	9441fda4dc	mgr/dashboard/monitoring: upgrade Grafana version due to CVE-2020-13379 Fixes: https://tracker.ceph.com/issues/48685 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-01-07 16:53:26 +01:00
Kefu Chai	30487c755c	Merge pull request #38282 from vosdev/ceph-pool-alert mgr/prometheus: Fix 'pool filling up' with >50% usage Reviewed-by: Patrick Seidensal <pseidensal@suse.com>	2020-12-12 12:10:44 +08:00
Daniël Vos	79568d51c6	mgr/prometheus: Fix 'pool filling up' with >50% usage Fixes: https://tracker.ceph.com/issues/48354 Signed-off-by: Daniël Vos <danielvos@outlook.com>	2020-12-01 16:31:09 +01:00
haoyixing	0e7e036aa7	doc/dev: use http://docs.ceph.com/en/latest/ instead of /docs/master/ for docs Several links under http://docs.ceph.com/docs/master/ were unable to access. Change them to http://docs.ceph.com/en/lastest so we can access them directly. Signed-off-by: haoyixing <haoyixing@kuaishou.com>	2020-11-24 12:49:47 +08:00
Paul Cuzner	2010432b50	mgr/prometheus: Add healthcheck metric for SLOW_OPS SLOW_OPS is triggered by op tracker, and generates a health alert but healthchecks do not create metrics for prometheus to use as alert triggers. This change adds SLOW_OPS metric, and provides a simple means to extend to other relevant health checks in the future If the extract of the value from the health check message fails we log an error and remove the metric from the metric set. In addition the metric description has changed to better reflect the scenarios where SLOW_OPS can be triggered. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-11-02 15:30:49 +13:00
Seena Fallah	0fd28f646c	monitoring: Use null yaxes min for OSD read latency According to seriesOverrides that negative-Y for read param there shouldn't be a minimum for yaxes Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2020-10-12 19:56:18 +03:30
Patrick Seidensal	fe64b9d176	mgr/dashboard: Fix many-to-many issue in host-details dashboard The labels on one side do not match the labels of the other side, where a label_replace is used. The fix uses the same label_replace on the missing side. Fixes: https://tracker.ceph.com/issues/47334 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-09-07 12:37:40 +02:00
Avan Thakkar	f039e5585d	mgr/dashboard: cpu stats incorrectly displayed Fixes: https://tracker.ceph.com/issues/46683 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-07-23 11:57:32 +05:30
pcuzner	0021dd278b	Merge pull request #35610 from pcuzner/wip-grafana-container monitoring: add grafana container build file	2020-07-06 13:06:55 +12:00
Lenz Grimmer	399521d66b	Merge pull request #34532 from rhcs-dashboard/wip-45068-fix-parse-error mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-06-30 10:50:59 +02:00
Paul Cuzner	3c813729dc	monitoring:add grafama container build file This commit provides the Makefile to create the ceph-grafana containers for nautilus, octopus and master releases. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-06-17 17:20:45 +12:00
Kiefer Chang	b963b7fbe9	monitoring: fixing some issues in RBD detail dashboard - Exchange read/write legends in The `I/O Bytes per second` panel. - Rename `I/O Bytes per second` to `Throughput`. - Rename `IOPS Count` to just `IOPS`. - Remove instance name from legends. - Fixes typos: `Averange` -> `Average`. Fixes: https://tracker.ceph.com/issues/45735 Signed-off-by: Kiefer Chang <kiefer.chang@suse.com>	2020-05-28 14:49:31 +08:00
Alfonso Martínez	cf4ff7d2f0	mgr/dashboard: grafana panels for rgw multisite sync performance * RGW sync perf. counters are now exposed through grafana panels. * Sync Performance tab is only shown if rgw realm is detected. * Prometheus module: added metrics suitable for prometheus consumption (from existing ones, not replacing for backward compatibility). Fixes: https://tracker.ceph.com/issues/45310 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2020-05-22 13:36:10 +02:00
Benoît Knecht	653c3f6682	monitoring: Fix "10% OSDs down" alert description The alert was triggered when less than 90% of OSDs were _up_, but then the description took that value and described it as the percentage of OSDs being _down_. So with 12% of OSDs down, the alert description would read: ``` 88% or 88 of 100 OSDs are down (>=10%). ``` which can be panic-inducing. This commit changes the alert expression to actually compute the ratio of OSDs being down, which makes the correct value appear in the description. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2020-05-06 18:49:26 +02:00
Lenz Grimmer	9334471340	Merge pull request #33991 from SchoolGuy/monitoring/rbd-image-details mgr/dashboard/grafana: Add rbd-image details dashboard Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Laura Paduano <lpaduano@suse.com> Reviewed-by: Patrick Seidensal <pnawracay@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-05-04 09:59:53 +02:00
Enno Gotthold	dfb1e0020e	mgr/dashboard: Remove additional unneeded steps for the metrics calculation Signed-off-by: Enno Gotthold <egotthold@suse.de>	2020-04-28 13:34:16 +02:00
Ernesto Puerta	3fd804f10b	monitoring: fix decimal precision in Grafana % Set decimal precision to 2 positions for charts using percentunits. Fixes: https://tracker.ceph.com/issues/45183 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2020-04-22 13:39:16 +02:00
Avan Thakkar	47b515c094	mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Fixes: https://tracker.ceph.com/issues/45068 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-04-21 23:03:09 +05:30
Volker Theile	e197e4d7f4	monitoring: alert for pool fill up broken Fixes: https://tracker.ceph.com/issues/44991 Signed-off-by: Volker Theile <vtheile@suse.com>	2020-04-08 15:02:45 +02:00
Volker Theile	a5ade11a31	Merge pull request #34239 from p-se/wip-pse-fix-false-root-vol-full-alert monitoring: root volume full alert fires false positives Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Jan Fajerski <jfajerski@suse.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-04-06 14:17:17 +02:00

1 2 3

102 Commits