Commit Graph

39 Commits

Author SHA1 Message Date
Joshua Baergen c01b5fb37e monitors: Remove stats that have been dead since Luminous
Ceph commit reference e170405fd873723bec6ce691afad82641bab2ef1
2023-12-15 15:50:21 -07:00
Tyler Brekke 3c403081b5 osd: Add new collector for osd metadata
scrape created_at, ceph_version_when_created, and osd_objectstore
from ceph osd metadata
2023-08-24 17:37:43 -07:00
Alex Marangone 2fb4e78d4a ceph/osd: fix root/rack labels 2023-05-23 08:06:31 -07:00
Joshua Baergen 5b487fad9c osd: Internally poll PG dump for oldest active PG tracking
Without this, the granularity of the oldest active PG is based on
external scrape frequency, and an unlucky sequence of scrapes could see
the same PG inactive two scrapes in a row even though it was active in
between.

Preferably, we would update this even more often than 10 seconds, but PG
dumps can take a while.
2023-05-08 16:33:13 -06:00
Daniel R 46b06f317f
Fix timeouts and use goroutines for collectors/commands (#234)
* rados: timeouts on Mon/Mgr command & connections

* rados: remove unneeded timeouts

* make all collectors async

* fix osd collector

* only add 1 in waitgroups

* ceph: don't pass waitgroups to collectors

* monitors.go: use errgroup instead of waitgroup

* rados: add comment, pass arg & close channel
2023-03-14 14:00:33 -04:00
Alex Marangone 52ad633440 move to collector interface to avoid ugly switch 2023-02-14 12:24:34 -08:00
Alex Marangone 2235817fc4 move common mocking to a func 2023-02-14 11:36:09 -08:00
Alex Marangone abbe4444ef update test to support version being passed in Collect() 2023-02-14 11:17:51 -08:00
Alex Marangone ba15bf50a3 pass version to collectors when calling Collect() 2023-02-14 11:10:54 -08:00
Alex Marangone 69edc55596 exporter: do not reinitialize collectors on every collect
We store all the collectors in a map of string in order to
dynamically load/unload the rbd mirror collector
2023-02-14 09:27:21 -08:00
Daniel R d8bf71a8fc
Split cluster health state by plus sign
PR #226
2023-01-24 17:42:34 -05:00
Daniel R 50874e99af revert health_status_interp to gauge 2022-10-12 18:03:20 -04:00
Daniel R ae64dae6f8 add a comment indicating gauge deprecation 2022-10-11 11:58:42 -04:00
Daniel R b9af3ab29f bugfixes; stop defaulting map flags to 0 in the constmetric 2022-10-07 14:53:46 -04:00
Daniel R 1a741d7606 introduce constmetrics for osdmap flags 2022-10-07 14:17:29 -04:00
Daniel R c3a3d581aa migrate health checks from gauges to constmetrics 2022-10-06 16:20:24 -04:00
Daniel R 362cb4b8dd fix health unit tests 2022-10-06 16:20:19 -04:00
Daniel R 5dd16fe875 migrate pool_usage.go to constmetrics 2022-10-06 16:20:10 -04:00
Tyler Brekke 957b06df91 Add user to exporter for use with rbd/rgw commands 2022-08-25 15:20:57 -07:00
Tyler Brekke 19a3cd5c7e Add rbd-mirror health status 2022-08-24 14:23:23 -07:00
Xavier Villaneau ae09ffe3fe Add `hostname` label to `ceph_crash_reports` 2022-06-16 13:20:29 -04:00
Xavier Villaneau 2faa6cb82d Fix comments and docstring in getCrashLs 2022-06-15 17:04:04 -04:00
Xavier Villaneau 3141fef319 Use JSON output from `ceph crash ls` instead of plain output 2022-06-15 17:04:04 -04:00
Xavier Villaneau adf792c3e8 Use ConstMetrics for ceph_crash_reports
Makes the code simpler since we're not tracking state anymore.
Also rewrote the tests to be more in-line with the rest.
2022-06-15 17:04:04 -04:00
Xavier Villaneau 74c89af225 Implement new gauge counting crash reports
New metric: `ceph_crash_reports` which counts the entries returned by
`ceph crash ls` by daemon name and archival status.

This is not the same as `ceph_new_crash_reports` which is the value of
the `RECENT_CRASH` health check, and that only counts the non-archived
errors of the past two weeks. The new metric counts errors as long as
they are not purged (which is done after 1 year by defaults).
2022-06-15 17:04:04 -04:00
AKYD 763e5ecd21 Normalize ceph-ansible version format 2022-05-25 11:49:04 +03:00
Joshua Baergen ebd166be2d ceph: Support the Octopus+ mgrmap format. 2022-04-12 08:52:04 -06:00
Joshua Baergen 4e0f8910a4 Add missing tests for Octopus+ osdmap format.
In TestClusterHealthCollector, test all supported versions by default,
and split the osdmap tests for Nautilus vs. Octopus+. There were a
number of tests that included an osdmap that didn't need it, and the
osdmap was removed from them so that version-specific testing would not
be required.
2022-04-12 08:52:01 -06:00
haoyixing 407248ce1d feat: add misplaced ratio metric
Misplaced ratio equals to misplaced_objects deviding misplaced_total, not misplaced_objects / num_objects.
So add a separate metric to show misplaced ratio.

Signed-off-by: haoyixing <haoyixing@kuaishou.com>
2022-03-29 18:38:15 -07:00
Kyle 917a468065 update deps and reduce a warn to debug 2022-03-29 17:44:50 -07:00
Kyle 1d7bac531d update license headers 2022-03-23 14:02:21 -07:00
Kyle 4d817f487d fix staticcheck errors 2022-03-23 12:24:28 -07:00
Kyle d6b67a77c3 removed down osd duplicate filtering 2022-03-22 12:59:51 -07:00
Kyle 3a0b289eda filter duplicate OSD nodes for down health check and fix health tests 2022-03-21 15:28:20 -07:00
Kyle b806cf51bb remove pre-nautilus health check code 2022-03-21 14:52:34 -07:00
Kyle df7435b259 add DAEMON_OLD_VERSION health check, update readme, remove makefile 2022-03-21 13:56:19 -07:00
Kyle 2122a3331f support flattened osdmap format added in octopus 2022-03-16 14:13:57 -07:00
Xavier Villaneau 6f83fdd300 Restructure so that tests do not depend on go-ceph
- `ceph.Conn` interface no longer depends on go-ceph/rados,
  now defines its own `PoolStat` structure for our use.
- New separate `rados` package that implements the interface
- Merged `mocks` package into `ceph` to avoid circular import
2022-02-24 15:57:00 -05:00
Kyle 566f1fa5d3 a ton of refactoring 2022-02-23 15:43:46 -08:00