RepoMirrors/ceph

mirror of https://github.com/ceph/ceph synced 2025-01-03 09:32:43 +00:00

Author	SHA1	Message	Date
Rishabh Dave	a6f5efb620	monitoring: mention PyYAML only once in requirements Following error occurs while running "sudo install-deps.sh" - ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML') PyYAML is mentioned twice as a requirement. It is mentioned once in both the following files - monitoring/ceph-mixin/requirements-lint.txt monitoring/ceph-mixin/requirements-alerts.txt These requirements were added in commits `44d3e4c264` and `4750ac0d77`. Fixes: https://tracker.ceph.com/issues/54185 Signed-off-by: Rishabh Dave <ridave@redhat.com>	2022-02-08 11:19:15 +05:30
Nizamudeen A	27592b7561	cephadm: change shared_folder directory for prometheus and grafana After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus and monitoring/grafana/dashboards directories are changed to monitoring/ceph-mixins. That broke the shared_folders in the cephadm bootstrap script. Changed all the instances of monitoring/prometheus and monitoring/grafana/dashboards to monitoring/ceph-mixins Also, renaming all the instances of prometheus_alerts.yaml to prometheus_alerts.yml. Fixes: https://tracker.ceph.com/issues/54176 Signed-off-by: Nizamudeen A <nia@redhat.com>	2022-02-07 16:34:37 +05:30
Ernesto Puerta	6a4b1e148d	Merge pull request #44796 from pcuzner/remove-old-mib monitoring: remove old MIB Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com> Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-02-04 17:42:08 +01:00
Arthur Outhenin-Chalandre	8ff1e6b399	monitoring: build jsonnet/jb only for testing Build jsonnet and jb in the testso that we can build ceph without internet access and still be able to run the test needed for monitoring using jsonnet tools. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>	2022-02-03 13:08:37 +01:00
Arthur Outhenin-Chalandre	ecaf9070ae	spec: debian: monitoring: build jsonnet from source to use 0.18.0 As this new version is recently released it's still not in every distro we use. We now build jsonnet from source so that we can use this new version of jsonnet. This commit could be reverted later on when the new version would be available everywhere. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>	2022-02-03 13:08:36 +01:00
Arthur Outhenin-Chalandre	98236e3a1d	mgr/dashboard: monitoring: refactor into ceph-mixin Mixin is a way to bundle dashboards, prometheus rules and alerts into jsonnet package. Shifting to mixin will allow easier integration with monitoring automation that some users may use. This commit moves `/monitoring/grafana/dashboards` and `/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts was also converted to Jsonnet using an automated way (from yaml to json to jsonnet). This commit minimises any change made to the generated files and should not change neithers the dashboards nor the Prometheus alerts. In the future some configuration will also be added to jsonnet to add more functionalities to the dashboards or alerts (i.e.: multi cluster). Fixes: https://tracker.ceph.com/issues/53374 Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>	2022-02-03 13:08:20 +01:00
Ernesto Puerta	c47ace9215	Merge pull request #43707 from BenoitKnecht/ceph-mgr-service-id mgr: Fix ceph_daemon label in ceph_rgw_* metrics Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-02-02 18:39:57 +01:00
Paul Cuzner	cbeab5c566	monitoring: remove old MIB The MIB file that matches the OID definitions in the alerts is CEPH-MIB.txt. The old MIB from the original SuSE snmp gateway work, therefore needs to be removed to avoid confusion. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2022-01-27 11:24:34 +13:00
Pere Diaz Bou	57c26311de	monitoring/grafana: replace filestore osd count Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 14:14:41 +01:00
Pere Diaz Bou	a3cf5c5e9f	monitoring/grafana: use Path class instead of split Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 13:24:12 +01:00
Pere Diaz Bou	1e4d85d04f	monitoring/grafana: remove explicit str casting Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 13:24:12 +01:00
Pere Diaz Bou	2b4f3561d2	monitoring/grafana: add generated json files Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 13:24:12 +01:00
Pere Diaz Bou	b381a83e9b	monitoring/grafana: ValueError instead of RuntimeError Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 13:24:12 +01:00
Pere Diaz Bou	4c302234ff	monitoring/grafana: Replace missing legendFormat warning with error Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-18 13:24:10 +01:00
Patrick Seidensal	7d7488018e	monitoring: Add unit tests for OSD panels in ceph-cluster dashboard Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2022-01-13 13:27:55 +01:00
Patrick Seidensal	4a6b2c1dfb	monitoring: fix display ceph_osd_in in Grafana panel Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2022-01-13 13:27:55 +01:00
Patrick Seidensal	18d3a71618	mgr/prometheus: Fix regression with OSD/host details/overview dashboards Fix issues with PromQL expressions and vector matching with the `ceph_disk_occupation` metric. As it turns out, `ceph_disk_occupation` cannot simply be used as expected, as there seem to be some edge cases for users that have several OSDs on a single disk. This leads to issues which cannot be approached by PromQL alone (many-to-many PromQL erros). The data we have expected is simply different in some rare cases. I have not found a sole PromQL solution to this issue. What we basically need is the following. 1. Match on labels `host` and `instance` to get one or more OSD names from a metadata metric (`ceph_disk_occupation`) to let a user know about which OSDs belong to which disk. 2. Match on labels `ceph_daemon` of the `ceph_disk_occupation` metric, in which case the value of `ceph_daemon` must not refer to more than a single OSD. The exact opposite to requirement 1. As both operations are currently performed on a single metric, and there is no way to satisfy both requirements on a single metric, the intention of this commit is to extend the metric by providing a similar metric that satisfies one of the requirements. This enables the queries to differentiate between a vector matching operation to show a string to the user (where `ceph_daemon` could possibly be `osd.1` or `osd.1+osd.2`) and to match a vector by having a single `ceph_daemon` in the condition for the matching. Although the `ceph_daemon` label is used on a variety of daemons, only OSDs seem to be affected by this issue (only if more than one OSD is run on a single disk). This means that only the `ceph_disk_occupation` metadata metric seems to need to be extended and provided as two metrics. `ceph_disk_occupation` is supposed to be used for matching the `ceph_daemon` label value. foo * on(ceph_daemon) group_left ceph_disk_occupation `ceph_disk_occupation_human` is supposed to be used for anything where the resulting data is displayed to be consumed by humans (graphs, alert messages, etc). foo * on(device,instance) group_left(ceph_daemon) ceph_disk_occupation_human Fixes: https://tracker.ceph.com/issues/52974 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2022-01-13 13:27:55 +01:00
Benoît Knecht	2daaa052ea	monitoring/grafana: Add tests for radosgw panels Some of the expressions modified in c40290390d7 were not covered by any tests, especially those in the `radosgw-detail.json` dashboard. This commit fills in those gaps. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2022-01-11 13:17:48 +01:00
Benoît Knecht	adc36dea7f	monitoring/grafana: Update radosgw dashboards With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_` metrics, we need to update Grafana dashboards get that label back from `ceph_rgw_metadata` using this type of construct: ``` ceph_rgw_req on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata ``` Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>	2022-01-11 13:17:20 +01:00
Ernesto Puerta	978d5829f2	Merge pull request #44294 from rhcs-dashboard/feature-bluestore-onode mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Laura Flores <lflores@redhat.com> Reviewed-by: neha-ojha <NOT@FOUND> Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>	2022-01-11 11:24:21 +01:00
Aashish Sharma	15aa4dffa9	mgr/dashboard: monitoring:Implement BlueStore onode hit/miss counters into the dashboard Provide the details pulled from Bluestore stats in order to display the onode hit/miss counters Fixes: https://tracker.ceph.com/issues/53577 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2022-01-05 14:22:53 +05:30
Ernesto Puerta	cdc9f742df	Merge pull request #44190 from rhcs-dashboard/grafana-regex monitoring/grafana: improve grafana unit tests variable substitution Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com> Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-12-21 17:58:17 +01:00
Pere Diaz Bou	bbbdf8e6a2	monitoring/grafana: doctest util regex Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-12-15 09:36:08 +01:00
Pere Diaz Bou	2286ddc1c2	monitoring/grafana: rename tox promql test Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-12-14 09:36:23 +01:00
Pere Diaz Bou	5ebdb746e8	monitoring/grafana: improve grafana unit tests variable substitution Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-12-14 09:36:23 +01:00
Ernesto Puerta	d10b0b7e72	mgr/dashboard: disable Promql test in ARM Temporarily disable this test while debugging the issue (since https://github.com/ceph/ceph/pull/43669 originally passed the ARM check). Fixes: https://tracker.ceph.com/issues/53451 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2021-12-13 20:20:44 +01:00
Avan Thakkar	8d83126e51	mgr/dashboard: introduce HAProxy metrics for RGW Fixes: https://tracker.ceph.com/issues/53311 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2021-12-09 20:03:03 +05:30
Pere Diaz Bou	44d3e4c264	monitoring/grafana: Grafana query tester Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-11-16 10:30:49 +01:00
Paul Cuzner	7ffcbd7f79	mgr/prometheus: Update rule format and enhance SNMP support Rules now adhere to the format defined by Prometheus.io. This changes alert naming and each alert now includes a a summary description to provide a quick one-liner. In addition to reformatting some missing alerts for MDS and cephadm have been added, and corresponding tests added. The MIB has also been refactored, so it now passes standard lint tests and a README included for devs to understand the OID schema. Fixes: https://tracker.ceph.com/issues/53111 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-11-05 11:24:25 +13:00
Sebastian Wagner	aae2ea3897	Merge pull request #43293 from pcuzner/granular-alerts mgr/prometheus: expose ceph healthchecks as metrics Reviewed-by: Boris Ranto <branto@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Sebastian Wagner <sewagner@redhat.com>	2021-10-29 00:23:24 +02:00
Pere Diaz Bou	e1bc6f24ff	monitoring: ethernet bonding filter in Network Load Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>	2021-10-27 09:08:20 +02:00
Paul Cuzner	37b82b8793	mgr/prometheus: remove cmake tests Temporary removal of the cmake test integration Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-27 09:58:17 +13:00
Sebastian Wagner	b830c555d2	monitoring/prometheus: Add cmake integration Signed-off-by: Sebastian Wagner <sewagner@redhat.com>	2021-10-22 13:37:31 +13:00
Paul Cuzner	4750ac0d77	mgr/prometheus: add test cases and validation using tox Focus all tests inside a tests directory, and use pytest/tox to perform validation of the overall content. tox tests also use promtool if available to provide rule checks and unittest runs. In addition to these checks a validate_rules script provides the format, and content checks against all rules - which is also called via tox (but can be run independently too) Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-22 13:36:40 +13:00
Paul Cuzner	e0dfc02063	mgr/prometheus: track individual healthchecks as metrics This patch creates a health history object maintained in the modules kvstore. The history and current health checks are used to create a metric per healthcheck whilst also providing a history feature. Two new commands are added: ceph healthcheck history ls ceph healthcheck history clear In addition to the new commands, the additional metrics have been used to update the prometheus alerts Fixes: https://tracker.ceph.com/issues/52638 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-22 13:32:39 +13:00
Aashish Sharma	ed954b0e6c	mgr/dashboard: monitoring: grafonnet refactoring for cephfs dashboards This PR intends to refactor cephfs dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:36:31 +05:30
Aashish Sharma	e490e2f3ab	mgr/dashboard: monitoring: grafonnet refactoring for osds dashboards This PR intends to refactor osds dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:13:50 +05:30
Aashish Sharma	8c48821c21	mgr/dashboard: monitoring: grafonnet refactoring for pools dashboards This PR intends to refactor pools dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:10:56 +05:30
Aashish Sharma	e737aaa000	mgr/dashboard: monitoring: grafonnet refactoring for rbd dashboards This PR intends to refactor rbd dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 12:09:04 +05:30
Aashish Sharma	eb01954cd9	mgr/dashboard: monitoring: grafonnet refactoring for radosgw dashboards This PR intends to refactor radosgw dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-19 11:57:28 +05:30
Ernesto Puerta	19535b1d0e	Merge pull request #43469 from rhcs-dashboard/hosts-grafana-dashboards mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-10-18 17:14:03 +02:00
Ernesto Puerta	9b40c9df26	Merge pull request #43377 from rhcs-dashboard/fix-clients-connection-query mgr/dashboard: replace "Ceph-cluster" Client connections with active-standby MGRs Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com> Reviewed-by: neha-ojha <NOT@FOUND> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-10-13 13:37:51 +02:00
Sebastian Wagner	53382d70eb	Merge pull request #43274 from pcuzner/add-mib monitoring:Adding the Ceph MIB Reviewed-by: Sebastian Wagner <sewagner@redhat.com>	2021-10-12 22:29:06 +02:00
Aashish Sharma	f7714de294	mgr/dashboard: monitoring: grafonnet refactoring for hosts dashboards This PR intends to refactor hosts dashboards using grafonnet Fixes:https://tracker.ceph.com/issues/52777 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-10-12 11:05:02 +05:30
Avan Thakkar	d388c5e958	mgr/dashboard: replace Client connections with active-stdby mgrs Fixes: https://tracker.ceph.com/issues/52121 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2021-10-11 21:53:23 +05:30
Paul Cuzner	b96aa5d184	monitoring:Updated README Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-10-06 14:32:47 +13:00
Ernesto Puerta	ba9e17d2d2	Merge pull request #43132 from p-se/monitoring-grafana-piechart-update monitoring: update grafana-piechart-panel plugin Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-09-28 18:37:45 +02:00
Paul Cuzner	f9213ad9cf	monitoring:Adding the Ceph MIB The ceph MIB has been created and maintained in a a separate repo: https://github.com/SUSE/prometheus-webhook-snmp This patch brings this MIB into the main ceph repo, so alert changes can target prometheus and potentially SNMP environments within the same PR. Kudos to Volker Theile for creating the MIB. Fixes: https://tracker.ceph.com/issues/52708 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-09-23 11:06:19 +12:00
Patrick Seidensal	af94237621	monitoring: update grafana-piechart-panel plugin Fixes: https://tracker.ceph.com/issues/51211 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-09-10 15:28:17 +02:00
Aashish Sharma	58d635455d	mgr/dashboard: Incorrect MTU mismatch warning The MTU mismatch warning was being fired for those NIC's as well that are in down state. This PR intends to fix this issue Fixes:https://tracker.ceph.com/issues/52028 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-09-02 15:34:36 +05:30
Kefu Chai	1835fd86dd	cmake: exclude "grafonnet-lib" target from "all" so we don't build this target when running "make", and hence avoid accessing the internet in a building envronment where the internest access is not allowed. Signed-off-by: Kefu Chai <kchai@redhat.com>	2021-08-20 22:50:42 +08:00
Kefu Chai	1fdd632d0c	cmake: silence build output when building external deps when download/building grafonnet-lib, dpdk, spdk, liburing and fio, they dump lots of output during configuration and building phrases, all of which is irrelevant to us. so let's just silence it. Signed-off-by: Kefu Chai <kchai@redhat.com>	2021-08-16 21:27:57 +08:00
Ernesto Puerta	559afae0b9	Merge pull request #41570 from jhrcz-ls/wip-cephfs-overview-use-rate mgr/dashboard: cephfs MDS Workload to use rate for counter type metric	2021-08-12 20:53:07 +02:00
Aashish Sharma	4907c78bb7	mgr/dashboard: fix grafonnet build error This PR tends to fix the issue caused by #42194 Fixes:https://tracker.ceph.com/issues/52238 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-08-12 17:48:33 +05:30
Ernesto Puerta	afadfede0d	Merge pull request #42194 from rhcs-dashboard/add-grafonnet-grafana mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based code	2021-08-11 18:11:59 +02:00
Aashish Sharma	e9bd94515f	mgr/dashboard: monitoring: replace Grafana JSON with Grafonnet based Code This PR intends to add grafonnet to generate grafana JSON files Fixes: https://tracker.ceph.com/issues/45184 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-08-11 19:23:54 +05:30
Ernesto Puerta	cc6b18a92c	Merge pull request #41880 from david-caro/fix_cluster_grafana_dashboard monitoring/grafana/cluster: use per-unit max and limit values Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-08-02 13:03:46 +02:00
Jan Horáček	5bf516dcc7	[mgr/dashboard] cephfs metrics in MDS Workload panels to use rate because of counter type metric Fixes: https://tracker.ceph.com/issues/51954 Signed-off-by: Jan Horacek <jan.horacek@livesport.eu>	2021-07-29 10:09:41 +02:00
Seena Fallah	feb8f784d2	monitoring: fix Physical Device Latency unit Based on the expr it should be seconds Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2021-07-07 17:00:30 +04:30
Ernesto Puerta	62e3a5c41c	Merge pull request #41838 from p-se/grafana-clean-up monitoring: Clean up Grafana dashboards Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: jan--f <NOT@FOUND> Reviewed-by: p-se <NOT@FOUND> Reviewed-by: Paul Cuzner <pcuzner@redhat.com>	2021-06-25 20:45:28 +02:00
David Caro	c981298039	monitoring/grafana/cluster: use per-unit max and limit values The value we get is a perunit, so the limits and the max value should be over 1, not 100. Note that the value being shown was correct, it was the gauge that was not showing the correct indicators. Signed-off-by: David Caro <david@dcaro.es>	2021-06-16 10:38:41 +02:00
Patrick Seidensal	037410713f	monitoring: remove instance label from ceph-cluster.json completely The `instance` label is only useful if - the exporter returns only data about its node or instance - the exporter provides an instance label and then may return data about other nodes In this case, it's about the Prometheus mgr module, which is a single exporter providing data about a whole cluster, so not only data related to the node (or instance) the mgr module is running on. It is completely irrelevant on which node the exporter runs on, the data provided doesn't change. The exporter also doesn't provide `instance` labels (which Prometheus wouldn't change due to our configuration, see "honor_labels" setting). (Actually there's one exception where `instance` labels are provided by the Ceph mgr module, but that doesn't affect the Ceph Cluster dashboard.) Note that keeping that instance label on this particular dashboard would enable the user to switch between a previously failed mgr instance and the data collected from there and the currently running mgr instance (on which the Prometheus mgr module runs on). That'd split the data, which I don't think is a useful feature, but rather looks broken. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:11:30 +02:00
Patrick Seidensal	4270a13d6c	mgr/dashboard: Fix Grafana Ceph Cluster health status widget The health status widget doesn't show any status because it requires its query to return a single result. But in case a mgr instance had failed, it would return more, provided the incident has happened in the requested time frame. This is simply an issue of the `instant` switch being disabled for that widget. As only one mgr instance can ever be providing data at a time, enabling `instant` completely solves that issue. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	f51cab109d	mgr/dashboard: Fix decimals in OSC Capacity Utilization widget Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	5527c1c54f	mgr/dashboard: Remove hard-coded timezone off Grafana dashboards Remove hard-coded timezone off Grafana dashboards to enable the Grafana administrator to decide which timezone should be used for dashboards. If we hard-coded those values, changing the global settings in Grafana wouldn't have an effect. And the administrators can't change the automatically imported Grafana dashboards provided by us. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	8218d43e5f	monitoring: convert newline character to LF Convert newline character from CRLF in `rbd-details.json` to LF, so that it will be consistent with all the other dashboard JSON files. Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-16 09:10:32 +02:00
Patrick Seidensal	a709abf8bf	mgr/dashboard: deprecated variable usage in Grafana dashboards Fixes: https://tracker.ceph.com/issues/50059 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-06-07 14:31:53 +02:00
Dan Mick	de491c128a	monitoring/grafana/build/Makefile: work around buildah bug Workaround https://github.com/containers/buildah/issues/3253 by pushing to a local OCI-format image to clear out erroneously-left 'parent' field in buildah commit --squash output. Can be removed when the fix for the above is available. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	b56ff43232	monitoring/grafana/build/Makefile: use --authfile podman login caches auth tokens in auth.json; for sudo, it may be placed in /run/containers/0 or it may be in /run/users/0/containers; the latter directory is removed when root "logs out", which isn't clear what it means with sudo/su. Several builds failed because they couldn't find the cached auth between sudo podman login and sudo podman push. Sidestep the confusion by just using a local file for the auth cache. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	a3b4bc73f7	monitoring/grafana/build/Makefile: cleanup, ready for jenkins - allow env setting of versions of components - add docker/quay username/password variables - derive container version from grafana version - make arch-specific tags - expand clean target to remove container images - remove release-specific targets, "all" target - move push operations to separate "push" target Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	0fdbe673c8	monitoring/grafana/build/Makefile: use curl instead of wget build machines tend to already have curl installed Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	2faadc2d5c	monitoring/grafana/build/Makefile: use "sudo buildah" Some build machines don't have /etc/sub{u,g}id set up for so-called "rootless" (non-privileged) operation. Use sudo to avoid the need for "rootless". Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	9d37c6efbd	monitoring/grafana/build/Makefile: pull dashboards from local dir Use the dashboard definition files in this workspace directly Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	444d6f6623	monitoring/grafana/build/Makefile: Add ARCH variable Allow building for other archs, in particular arm64 Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:25 -07:00
Dan Mick	508b1d387f	monitoring/grafana/build/Makefile: fully qualify source image Some build machines may not have a default docker repo configured. Specify docker.io. Signed-off-by: Dan Mick <dmick@redhat.com>	2021-05-26 13:37:24 -07:00
Ernesto Puerta	ac5d24e5ca	mgr/dashboard: remove non-null id in Grafana dashb Testing added to prevent this situation. Fixes: https://tracker.ceph.com/issues/50918 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>	2021-05-21 13:54:48 +02:00
Alfonso Martínez	7d79efb025	mgr/dashboard: fix OSDs Host details/overview grafana graphs Fixes: https://tracker.ceph.com/issues/50686 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-05-07 15:38:07 +02:00
Ernesto Puerta	458ad48024	Merge pull request #40715 from pcuzner/pool-overview-enhancement mgr/dashboard:include compression stats on pool dashboard Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-05-05 18:08:58 +02:00
Paul Cuzner	81788b1f21	mgr/dashboard:include compression stats on pool dashboard This is a replacement dashboard configuration for the pool overview page. It provides a cluster wide view of capacity consumed and compression effectiveness, and breaks this down by each pool within the configuration. Fixes: https://tracker.ceph.com/issues/50226 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2021-05-03 12:26:06 +12:00
Ernesto Puerta	381685f17f	Merge pull request #40072 from wornet-mwo/dashboard--grafana-hostname-corrections mgr/dashboard: Fixed name clash when hostname similar to another Reviewed-by: Aashish Sharma <aasharma@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: p-se <NOT@FOUND>	2021-04-29 19:40:57 +02:00
Michael Wodniok	e97e27ebdb	dashboard: Fixed name clash when hostname similar to anaother Fixes: #49769 Signed-off-by: Michael Wodniok <wodniok@wor.net>	2021-04-27 08:42:59 +02:00
Malcolm Holmes	382e293656	monitoring/grafana: Remove erroneous elements in hosts-overview Grafana dashboard The hosts-overview Grafana dashboard json file contains a repeated element, making it invalid JSON. Some JSON parsers handle this. However, this prevents Jsonnet from parsing the dashboard, which prevents the deployment of this dashboard via Jsonnet. Fixes: https://tracker.ceph.com/issues/50410 Signed-off-by: Malcolm Holmes <mdh@odoko.co.uk>	2021-04-17 23:11:48 +01:00
Aashish Sharma	8d2f39e6c5	mgr/dashboard:Simplify some complex calculations in test_alerts.yml run-promtool-unittests is failing with difference in floating point values in some complex calculations. This PR intends to simplify those calculations and fix this issue. Fixes: https://tracker.ceph.com/issues/49952 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-25 12:05:07 +05:30
Aashish Sharma	53a5816ded	mgr/dashboard:test prometheus rules through promtool This PR intends to add unit testing for prometheus rules using promtool. To run the tests run 'run-promtool-unittests.sh' file. Fixes: https://tracker.ceph.com/issues/45415 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-03-08 10:16:22 +05:30
Ernesto Puerta	dff5b78d3b	Merge pull request #39462 from rhcs-dashboard/fix-alerts-mtuMismatch mgr/dashboard: fix MTU Mismatch alert Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-17 14:14:17 +01:00
Ernesto Puerta	e2d73297cf	Merge pull request #38030 from p-se/prom-alert-package-drops-leeway mgr/dashboard: prometheus alerting: add some leeway for package drops and errors Reviewed-by: Stephan Müller <smueller@suse.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Nizamudeen A <nia@redhat.com>	2021-02-16 20:45:44 +01:00
Patrick Seidensal	9ac248b0c3	mgr/dashboard: prometheus alerting: add some leeway for package drops and errors (1%) Fixes: https://tracker.ceph.com/issues/48201 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2021-02-16 14:43:00 +01:00
Aashish Sharma	8527489b91	mgr/dashboard:fix MTU Mismatch alert This PR intends to fix the expression used for MTU Mismatch alert in prometheus Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-02-15 10:13:39 +05:30
Aashish Sharma	06cc0d8743	mgr/dashboard: trigger alert if some nodes have a MTU different than the median value This PR intends to alert a user if a specific network is configured with a custom MTU Fixes: https://tracker.ceph.com/issues/48748 Signed-off-by: Aashish Sharma <aasharma@redhat.com>	2021-01-22 11:20:13 +05:30
Alfonso Martínez	9441fda4dc	mgr/dashboard/monitoring: upgrade Grafana version due to CVE-2020-13379 Fixes: https://tracker.ceph.com/issues/48685 Signed-off-by: Alfonso Martínez <almartin@redhat.com>	2021-01-07 16:53:26 +01:00
Kefu Chai	30487c755c	Merge pull request #38282 from vosdev/ceph-pool-alert mgr/prometheus: Fix 'pool filling up' with >50% usage Reviewed-by: Patrick Seidensal <pseidensal@suse.com>	2020-12-12 12:10:44 +08:00
Daniël Vos	79568d51c6	mgr/prometheus: Fix 'pool filling up' with >50% usage Fixes: https://tracker.ceph.com/issues/48354 Signed-off-by: Daniël Vos <danielvos@outlook.com>	2020-12-01 16:31:09 +01:00
haoyixing	0e7e036aa7	doc/dev: use http://docs.ceph.com/en/latest/ instead of /docs/master/ for docs Several links under http://docs.ceph.com/docs/master/ were unable to access. Change them to http://docs.ceph.com/en/lastest so we can access them directly. Signed-off-by: haoyixing <haoyixing@kuaishou.com>	2020-11-24 12:49:47 +08:00
Paul Cuzner	2010432b50	mgr/prometheus: Add healthcheck metric for SLOW_OPS SLOW_OPS is triggered by op tracker, and generates a health alert but healthchecks do not create metrics for prometheus to use as alert triggers. This change adds SLOW_OPS metric, and provides a simple means to extend to other relevant health checks in the future If the extract of the value from the health check message fails we log an error and remove the metric from the metric set. In addition the metric description has changed to better reflect the scenarios where SLOW_OPS can be triggered. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-11-02 15:30:49 +13:00
Seena Fallah	0fd28f646c	monitoring: Use null yaxes min for OSD read latency According to seriesOverrides that negative-Y for read param there shouldn't be a minimum for yaxes Signed-off-by: Seena Fallah <seenafallah@gmail.com>	2020-10-12 19:56:18 +03:30
Patrick Seidensal	fe64b9d176	mgr/dashboard: Fix many-to-many issue in host-details dashboard The labels on one side do not match the labels of the other side, where a label_replace is used. The fix uses the same label_replace on the missing side. Fixes: https://tracker.ceph.com/issues/47334 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>	2020-09-07 12:37:40 +02:00
Avan Thakkar	f039e5585d	mgr/dashboard: cpu stats incorrectly displayed Fixes: https://tracker.ceph.com/issues/46683 Signed-off-by: Avan Thakkar <athakkar@redhat.com>	2020-07-23 11:57:32 +05:30
pcuzner	0021dd278b	Merge pull request #35610 from pcuzner/wip-grafana-container monitoring: add grafana container build file	2020-07-06 13:06:55 +12:00
Lenz Grimmer	399521d66b	Merge pull request #34532 from rhcs-dashboard/wip-45068-fix-parse-error mgr/dashboard: Prometheus query error in the metrics of Pools, OSDs and RBD images Reviewed-by: Alfonso Martínez <almartin@redhat.com> Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Volker Theile <vtheile@suse.com>	2020-06-30 10:50:59 +02:00
Paul Cuzner	3c813729dc	monitoring:add grafama container build file This commit provides the Makefile to create the ceph-grafana containers for nautilus, octopus and master releases. Signed-off-by: Paul Cuzner <pcuzner@redhat.com>	2020-06-17 17:20:45 +12:00

1 2 3 4 5

211 Commits