Commit Graph

52 Commits

Author SHA1 Message Date
David Zafman
661996d434 mgr: Warn when too many reads are repaired on an OSD
Include test case
Configurable by setting mon_osd_warn_num_repaired (default 10)
Ignore new health warning with random eio injection test

Fixes: https://tracker.ceph.com/issues/41564

Signed-off-by: David Zafman <dzafman@redhat.com>
2020-06-16 17:45:27 -07:00
Igor Fedotov
deb0af6347 doc/rados/operations/health-checks: document bluestore spurious read
errors alert.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
2020-04-14 12:08:30 +03:00
Josh Durgin
772d7c1d3c mgr/pg_autoscaler: add warning when target bytes and ratio are both set
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
2020-02-10 10:08:36 +08:00
Josh Durgin
d62c121ee3 mgr/pg_autoscaler: remove target ratio warning
Since the ratios are normalized, they cannot exceed 1.0 or overcommit
combined with target_bytes.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
2020-02-10 10:08:36 +08:00
Tsung-Ju Lii
253cb9903e doc/rados/operations: fix OSD_OUT_OF_ORDER_FULL fullness ordering
Signed-off-by: Tsung-Ju Lii <usefulalgorithm@gmail.com>
2019-11-13 17:43:48 +08:00
Sage Weil
6e46b1c0e5 osd/OSDMap: health alert for non-power-of-two pg_num
Fixes: https://tracker.ceph.com/issues/41647
Signed-off-by: Sage Weil <sage@redhat.com>
2019-09-24 09:26:33 -05:00
Sage Weil
2a1b58b5ac doc/rados/operations/monitoring: document muting health alerts
I think someday the docs for how health alerts work (here) and the
enumeration of all actual alerts should be restructured.  For now this
si the simplest placde to fit this!

Signed-off-by: Sage Weil <sage@redhat.com>t
2019-08-14 20:40:08 -05:00
Sage Weil
95b8e9fa0d doc/rados/operations/health-checks: document MON_DISK_{LOW,CRIT,BIG}
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
dd5e985614 doc/rados/operations/health-checks: document OSD_NO_DOWN_OUT_INTERVAL
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
0eba993fad doc/rados/operations/health-checks: document AUTH_BAD_CAPS
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
7e9ba0a1c1 doc/reados/operations/health-checks: document PG_SLOW_SNAP_TRIMMING
The mitigation steps are weak, but it's not clear concrete guidance to
provide.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
078ef210d5 doc/rados/operations/health-checks: document MGR_DOWN
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
1b6745efb4 doc/rados/operations/health-alerts: document BLUESTORE_NO_COMPRESSION
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-14 20:40:08 -05:00
Sage Weil
f011c13547 Merge PR #29292 into master
* refs/pull/29292/head:
	os/bluestore: warn on no per-pool omap
	os/bluestore: fsck: warning (not error) by default on no per-pool omap
	os/bluestore: fsck: int64_t for error count
	os/bluestore: default size of 1 TB for testing
	os/bluestore: behave if we *do* set PGMETA and PERPOOL flags
	os/bluestore: do not set both PGMETA_OMAP and PERPOOL_OMAP
	os/bluestore: fsck: only generate 1 error per omap_head
	os/bluestore: make fsck repair convert to per-pool omap
	os/bluestore: teach fsck to tolerate per-pool omap
	os/bluestore: ondisk format change to 3 for per-pool omap
	mon/PGMap: add data/omap breakouts for 'df detail' view
	osd/osd_types: separate get_{user,allocated}_bytes() into data and omap variants
	mon/PGMap: fix stored_raw calculation
	mon/PGMap: add in actual omap usage into per-pool stats
	osd: report per-pool omap support via store_statfs_t
	os/bluestore: set per_pool_omap key on mkfs
	osd/osd_types: count per-pool omap capable OSDs
	os/bluestore: report omap_allocated per-pool
	os/bluestore: add pool prefix to omap keys
	kv/KeyValueDB: take key_prefix for estimate_prefix_size()
	os/bluestore: fix manual omap key manipulation to use Onode::get_omap_key()
	os/bluestore: make omap key helpers Onode methods
	os/bluestore: add Onode::get_omap_prefix() helper
	os/bluestore: change _do_omap_clear() args

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2019-08-09 10:40:45 -05:00
Sage Weil
b8501164ef os/bluestore: warn on no per-pool omap
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-09 08:21:18 -05:00
Neha Ojha
c9d2833b25
Merge pull request #29425 from aclamk/wip-bluestore-monitor-allocations
[bluestore][tools] Inspect allocations in bluestore

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-08-07 11:37:34 -07:00
Adam Kupczyk
713f9b4d09 doc/rados/operations/health-checks: document BlueStore fragmentation and BlueFS space available features
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
2019-08-07 19:18:21 +02:00
Sage Weil
143e1f0469 mgr/telemetry: force re-opt-in if the report contents change
Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-31 20:33:19 -05:00
Sage Weil
c885ee7f0c mgr/crash: raise RECENT_CRASH warning for recent (new) crashes
Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-19 09:43:04 -05:00
David Zafman
fa698e18e1 mon: Improve health status for backfill_toofull and recovery_toofull
Treat backfull_toofull as a warning condition because it can resolve itself.
Includes test case for PG_BACKFILL_FULL
Includes test case for recovery_toofull / PG_RECOVERY_FULL

Fixes: https://tracker.ceph.com/issues/39555

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-06-20 02:22:01 +00:00
Xie Xingguo
302d7bcdd8
Merge pull request #27735 from xiexingguo/wip-device-class-noout
osd: revamp {noup,nodown,noin,noout} related commands

Reviewed-by: Sage Weil <sage@redhat.com>
2019-06-05 14:17:06 +08:00
xie xingguo
a3b0dc29b9 doc: refresh {noup,nodown,noin,noout} changes
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2019-05-30 10:52:38 +08:00
zjh
94237d3693 osd: Better error message when OSD count is less than osd_pool_default_size
Fixes: http://tracker.ceph.com/issues/38617

Signed-off-by: zjh <jhzeng93@foxmail.com>
2019-04-28 20:09:13 +08:00
Sage Weil
c2190c1ff8 Merge PR #27519 into master
* refs/pull/27519/head:
	doc/rados/operations/health-checks: document new bluestore warnings
	os/bluestore: alert on fm/bdev size mismatch
	os/bluestore: introduce legacy statfs alert

Reviewed-by: Sage Weil <sage@redhat.com>
2019-04-16 14:31:49 -05:00
Sage Weil
b29495954b doc/rados/operations/health-checks: document new bluestore warnings
Signed-off-by: Sage Weil <sage@redhat.com>
2019-04-15 17:42:48 +03:00
Sage Weil
9aa9893b8f osd/OSDMap: raise OSD_FLAGS health alert for crush node flags, too
Signed-off-by: Sage Weil <sage@redhat.com>
2019-04-12 11:10:35 -05:00
Vangelis Tasoulas
24131fc59a
doc: Update documentation for the MANY_OBJECTS_PER_PG warning
The current documentation for the MANY_OBJECTS_PER_PG warning
states that The threshold can be raised to silence the health
warning by adjusting the mon_pg_warn_max_object_skew config
option on the monitors. It seems that this is not true (at least)
since the luminous times, and this option should be adjusted on
the managers.

I encountered this problem and I spend quite sometime injecting
the mon_pg_warn_max_object_skew to the monitors, added the option
ceph.conf and restarted the monitors several times but the warning
was not going away. I had to download the code to see what's
happening and I found out this:

$ git grep -A 3 mon_pg_warn_max_object_skew src/common/options.cc
src/common/options.cc:1480:    Option("mon_pg_warn_max_object_skew", Option::TYPE_FLOAT, Option::LEVEL_ADVANCED)
src/common/options.cc-1481-    .set_default(10.0)
src/common/options.cc-1482-    .set_description("max skew few average in objects per pg")
src/common/options.cc-1483-    .add_service("mgr"),

After I restarted the ceph-mgr service, the warning went away.

Signed-off-by: Vangelis Tasoulas <vangelis@tasoulas.net>
2019-04-05 19:53:35 +02:00
Sage Weil
242ef7824d doc/rados/operations: document BLUEFS_SPILLOVER
Signed-off-by: Sage Weil <sage@redhat.com>
2019-04-02 11:13:31 -05:00
Ashish Singh
7108e6a3c7 doc: Fix incorrect mention of 'osd_deep_mon_scrub_interval'
Fixed the incorrect mention of 'osd_deep_mon_scrub_interval' in health-checks.rst.
Changed it to 'osd_deep_scrub_interval'.

Fixes: https://tracker.ceph.com/issues/38310

Signed-off-by: Ashish Singh <assingh@redhat.com>
2019-02-21 12:10:41 +05:30
David Zafman
6a9895b97a mon: Fix scrub health warning handling and change config to a ratio
Make this mon_warn code clearer since it involves 2 values
Code used mon scrub interval instead of pg scrub interval
Rename config values to include _pg_ and ratio to make it more clear
Fix scrub warniing handling use per-pool intervals when specified

Fixes: http://tracker.ceph.com/issues/37264

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-01-23 16:49:33 -08:00
Sage Weil
b5e5ee6f40 Merge PR #25849 into master
* refs/pull/25849/head:
	qa/suites/rados/upgrade: one mon per node, and enable-msgr2 at end
	qa/rados/thrash-old-clients: avoid msgr2
	mon: make bootstrap rank check more robust
	mon: clean up probe debug output a bit
	msg/async: use v1 for v1 <-> [v2,v1] peers
	msg/async/AsyncMessenger: drop single-use _send_to
	mon/HealthMonitor: raise MON_MSGR2_NOT_ENABLED if mons not bound to msgr2
	doc/rados/operations/health-checks: document MON_* health warnings
	mon/MonMapMonitor: add 'mon enable-msgr2' command
	mon: respawn if rank addr changes
	mon/MonMap: calc_addr_mons() after setting rank addrvec

Reviewed-by: Ricardo Dias <rdias@suse.com>
2019-01-17 11:04:30 -06:00
Sage Weil
6ba8db68cd mon/HealthMonitor: raise MON_MSGR2_NOT_ENABLED if mons not bound to msgr2
If the ms_bind_msgr2 option is enabled, and all mons are nautilus,
raise a health alert if any mons aren't bound to msgr2 addresses.

Whitelist tests that mon_bind_addrvec=false or mon_bind_msgr2=false.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-01-15 10:42:29 -06:00
Sage Weil
57c4795c00 doc/rados/operations/health-checks: document MON_* health warnings
Signed-off-by: Sage Weil <sage@redhat.com>
2019-01-15 10:42:29 -06:00
Sage Weil
94620be57c Merge PR #25273 into master
* refs/pull/25273/head:
	doc/rados/operations/health-checks: Add LARGE_OMAP_OBJECTS

Reviewed-by: Sage Weil <sage@redhat.com>
2019-01-12 05:56:41 -06:00
Brad Hubbard
522a21ec62 doc/rados/operations/health-checks: Add LARGE_OMAP_OBJECTS
Document LARGE_OMAP_OBJECTS health check

Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2019-01-12 12:16:47 +10:00
Sage Weil
f490fd0130 doc/rados/operations: document autoscaler and its health warnings
Signed-off-by: Sage Weil <sage@redhat.com>
2018-12-18 13:30:54 -06:00
Bryan Stillwell
791b00daa1 doc: Multiple spelling fixes
I ran a lot of the docs through aspell and found a number of spelling problems.

Signed-off-by: Bryan Stillwell <bstillwell@godaddy.com>
2018-08-09 14:51:25 -06:00
Sage Weil
7ab8675fdf doc/rados/operations/health-checks: document DEVICE_HEALTH* messages
Signed-off-by: Sage Weil <sage@redhat.com>
2018-07-31 14:08:53 -05:00
John Spray
191cce74e1 doc: note new mgr module error codes
Signed-off-by: John Spray <john.spray@redhat.com>
2018-01-24 13:08:21 -05:00
Kefu Chai
f5f2ced624 mgr/PGMap: drop REQUEST_{SLOW,STUCK} HEALTH_WARNs in mimic
SLOW_OPS unifies both of them since mimic

Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-11-23 17:41:47 +08:00
Sage Weil
027672b777 doc/rados/operations/health-checks: fix TOO_MANY_PGS discussion
Fiddling with pgp_num doesn't help with TOO_MANY_PGS.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-09-14 16:01:14 -04:00
Alfredo Deza
3ec44df21d doc/rados/operations use new ref labels in health-checks
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2017-08-16 08:20:01 -04:00
Alfredo Deza
5a3da3acaf doc/rados/operations use new ref label in health-checks
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2017-08-16 08:20:01 -04:00
Alfredo Deza
d8932d62bf doc/rados/operations use the new ref label for crush map tunables
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2017-08-16 08:20:00 -04:00
Patrick Donnelly
81be13b34c
doc: remove duplicate CephFS health check doc
These are documented in doc/cephfs/health-messages.rst.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2017-08-04 12:28:38 -07:00
Kefu Chai
f273712e1b doc: document bluestore compression settings
Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-08-02 16:42:08 +08:00
Sage Weil
0afffa5c58 Merge pull request #16611 from liewegas/wip-doc-health
doc/rados/operations/health-checks: osd section

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: John Spray <john.spray@redhat.com>
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
2017-08-01 08:26:24 -05:00
Sage Weil
dbb1dd33e6 doc/rados/operations/health-checks: add PG health check commentary
Include a link to pg-repair.rst, although there is no
content there yet.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-08-01 09:25:42 -04:00
Sage Weil
6bac77e960 doc/rados/operations/health-checks: osd section
First paragraph: explain what the error means.

Second or later paragraph: describe steps to fix or mitigate.

Signed-off-by: Sage Weil <sage@redhat.com>
2017-08-01 09:25:41 -04:00
Kefu Chai
2670d244fd doc: various fixes
- radosgw/s3/bucketops.rst: fix Malformed table.
- operations/health-checks.rst: Title underline too short
- rbd/rados-rbd-cmds.rst: Title underline too short
- rados/operations/index.rst: include health-checks in toc

Signed-off-by: Kefu Chai <kchai@redhat.com>
2017-08-01 17:31:36 +08:00