Joshua Baergen
c01b5fb37e
monitors: Remove stats that have been dead since Luminous
...
Ceph commit reference e170405fd873723bec6ce691afad82641bab2ef1
2023-12-15 15:50:21 -07:00
Tyler Brekke
3c403081b5
osd: Add new collector for osd metadata
...
scrape created_at, ceph_version_when_created, and osd_objectstore
from ceph osd metadata
2023-08-24 17:37:43 -07:00
Alex Marangone
2fb4e78d4a
ceph/osd: fix root/rack labels
2023-05-23 08:06:31 -07:00
Joshua Baergen
5b487fad9c
osd: Internally poll PG dump for oldest active PG tracking
...
Without this, the granularity of the oldest active PG is based on
external scrape frequency, and an unlucky sequence of scrapes could see
the same PG inactive two scrapes in a row even though it was active in
between.
Preferably, we would update this even more often than 10 seconds, but PG
dumps can take a while.
2023-05-08 16:33:13 -06:00
Daniel R
46b06f317f
Fix timeouts and use goroutines for collectors/commands ( #234 )
...
* rados: timeouts on Mon/Mgr command & connections
* rados: remove unneeded timeouts
* make all collectors async
* fix osd collector
* only add 1 in waitgroups
* ceph: don't pass waitgroups to collectors
* monitors.go: use errgroup instead of waitgroup
* rados: add comment, pass arg & close channel
2023-03-14 14:00:33 -04:00
Alex Marangone
52ad633440
move to collector interface to avoid ugly switch
2023-02-14 12:24:34 -08:00
Alex Marangone
2235817fc4
move common mocking to a func
2023-02-14 11:36:09 -08:00
Alex Marangone
abbe4444ef
update test to support version being passed in Collect()
2023-02-14 11:17:51 -08:00
Alex Marangone
ba15bf50a3
pass version to collectors when calling Collect()
2023-02-14 11:10:54 -08:00
Alex Marangone
69edc55596
exporter: do not reinitialize collectors on every collect
...
We store all the collectors in a map of string in order to
dynamically load/unload the rbd mirror collector
2023-02-14 09:27:21 -08:00
Daniel R
d8bf71a8fc
Split cluster health state by plus sign
...
PR #226
2023-01-24 17:42:34 -05:00
Daniel R
50874e99af
revert health_status_interp to gauge
2022-10-12 18:03:20 -04:00
Daniel R
ae64dae6f8
add a comment indicating gauge deprecation
2022-10-11 11:58:42 -04:00
Daniel R
b9af3ab29f
bugfixes; stop defaulting map flags to 0 in the constmetric
2022-10-07 14:53:46 -04:00
Daniel R
1a741d7606
introduce constmetrics for osdmap flags
2022-10-07 14:17:29 -04:00
Daniel R
c3a3d581aa
migrate health checks from gauges to constmetrics
2022-10-06 16:20:24 -04:00
Daniel R
362cb4b8dd
fix health unit tests
2022-10-06 16:20:19 -04:00
Daniel R
5dd16fe875
migrate pool_usage.go to constmetrics
2022-10-06 16:20:10 -04:00
Tyler Brekke
957b06df91
Add user to exporter for use with rbd/rgw commands
2022-08-25 15:20:57 -07:00
Tyler Brekke
19a3cd5c7e
Add rbd-mirror health status
2022-08-24 14:23:23 -07:00
Xavier Villaneau
ae09ffe3fe
Add hostname
label to ceph_crash_reports
2022-06-16 13:20:29 -04:00
Xavier Villaneau
2faa6cb82d
Fix comments and docstring in getCrashLs
2022-06-15 17:04:04 -04:00
Xavier Villaneau
3141fef319
Use JSON output from ceph crash ls
instead of plain output
2022-06-15 17:04:04 -04:00
Xavier Villaneau
adf792c3e8
Use ConstMetrics for ceph_crash_reports
...
Makes the code simpler since we're not tracking state anymore.
Also rewrote the tests to be more in-line with the rest.
2022-06-15 17:04:04 -04:00
Xavier Villaneau
74c89af225
Implement new gauge counting crash reports
...
New metric: `ceph_crash_reports` which counts the entries returned by
`ceph crash ls` by daemon name and archival status.
This is not the same as `ceph_new_crash_reports` which is the value of
the `RECENT_CRASH` health check, and that only counts the non-archived
errors of the past two weeks. The new metric counts errors as long as
they are not purged (which is done after 1 year by defaults).
2022-06-15 17:04:04 -04:00
AKYD
763e5ecd21
Normalize ceph-ansible version format
2022-05-25 11:49:04 +03:00
Joshua Baergen
ebd166be2d
ceph: Support the Octopus+ mgrmap format.
2022-04-12 08:52:04 -06:00
Joshua Baergen
4e0f8910a4
Add missing tests for Octopus+ osdmap format.
...
In TestClusterHealthCollector, test all supported versions by default,
and split the osdmap tests for Nautilus vs. Octopus+. There were a
number of tests that included an osdmap that didn't need it, and the
osdmap was removed from them so that version-specific testing would not
be required.
2022-04-12 08:52:01 -06:00
haoyixing
407248ce1d
feat: add misplaced ratio metric
...
Misplaced ratio equals to misplaced_objects deviding misplaced_total, not misplaced_objects / num_objects.
So add a separate metric to show misplaced ratio.
Signed-off-by: haoyixing <haoyixing@kuaishou.com>
2022-03-29 18:38:15 -07:00
Kyle
917a468065
update deps and reduce a warn to debug
2022-03-29 17:44:50 -07:00
Kyle
1d7bac531d
update license headers
2022-03-23 14:02:21 -07:00
Kyle
4d817f487d
fix staticcheck errors
2022-03-23 12:24:28 -07:00
Kyle
d6b67a77c3
removed down osd duplicate filtering
2022-03-22 12:59:51 -07:00
Kyle
3a0b289eda
filter duplicate OSD nodes for down health check and fix health tests
2022-03-21 15:28:20 -07:00
Kyle
b806cf51bb
remove pre-nautilus health check code
2022-03-21 14:52:34 -07:00
Kyle
df7435b259
add DAEMON_OLD_VERSION health check, update readme, remove makefile
2022-03-21 13:56:19 -07:00
Kyle
2122a3331f
support flattened osdmap format added in octopus
2022-03-16 14:13:57 -07:00
Xavier Villaneau
6f83fdd300
Restructure so that tests do not depend on go-ceph
...
- `ceph.Conn` interface no longer depends on go-ceph/rados,
now defines its own `PoolStat` structure for our use.
- New separate `rados` package that implements the interface
- Merged `mocks` package into `ceph` to avoid circular import
2022-02-24 15:57:00 -05:00
Kyle
566f1fa5d3
a ton of refactoring
2022-02-23 15:43:46 -08:00