Commit Graph

110 Commits

Author SHA1 Message Date
Yuri Weinstein
b8f632327f
Merge pull request #35279 from badone/wip-py2-fix-osd-scrub-repair.sh
qa/*/osd-scrub-repair.sh: Convert to python3 print syntax

Reviewed-by: Kefu Chai <kchai@redhat.com>
2020-06-03 11:12:21 -07:00
Neha Ojha
3a06af5af5 qa/standalone/scrub/osd-scrub-snaps.sh: fix grep pattern
The error looks like this:

2020-05-28T20:56:30.214+0000 7f66cdecf700 -1 log_channel(cluster) log [ERR] : scrub 1.0 1:ab946124:::obj15:head : can't decode 'snapset' attr void SnapSet::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer understand old encoding version 3 < 97: Malformed input

Fixes: https://tracker.ceph.com/issues/45760
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-05-28 22:41:38 +00:00
Neha Ojha
f72b19d09c qa/standalone/scrub/osd-scrub-repair.sh: fix grep pattern to match decode exception
We fail because the error message in the log looks like:

2020-05-27T21:02:48.447+0000 7fbfc4e60700 -1 log_channel(cluster) log [ERR] : scrub 3.0 3:5c7b2c47:::ROBJ16:head : can't decode 'snapset' attr void SnapSet::decode(ceph::buffer::v15_2_0::list::const_iterator&) no longer understand old encoding version 3 < 97: Malformed input

Fixes: https://tracker.ceph.com/issues/45660
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-05-28 00:38:17 +00:00
Brad Hubbard
80e7b7c19b qa/*/osd-scrub-repair.sh: Convert to python3 print syntax
Fixes: https://tracker.ceph.com/issues/45733

Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
2020-05-28 08:32:54 +10:00
Neha Ojha
7c8b627eaa qa/*/osd-scrub-repair.sh: don't fail if PG is in active+clean+wait
a0b453ad33 added the wait state, which can
make PGs stay in active+clean+wait for a while instead of going into
active+clean directly. As far as TEST_auto_repair_bluestore_failed is
concerned, we only care about the repair state being cleared.

Fixes: https://tracker.ceph.com/issues/45075
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-04-23 20:24:28 +00:00
Neha Ojha
4f82ebf41b qa/standalone/scrub/osd-scrub-repair.sh: fix race in TEST_auto_repair_bluestore_failed
We need to flush_pg_stats before checking for active+clean.

Fixed: https://tracker.ceph.com/issues/45075
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-04-20 18:29:51 +00:00
Kefu Chai
b1738cd1ef qa/standalone/scrub: s/$(pgid)/${pgid}/
to address the test failures like
```
2020-04-07T15:44:58.693 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed:  ceph pg dump
pgs
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh:498: TEST_auto_repair_bluestore_failed:  pgid
2020-04-07T15:44:58.694 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/scrub/osd-scrub-repair.sh: line 498: pgid: command not found
```

Signed-off-by: Kefu Chai <kchai@redhat.com>
2020-04-08 00:54:46 +08:00
Sage Weil
3212932ba1 Merge PR #33809 into octopus
* refs/pull/33809/head:
	qa/standalone/scrub/osd-scrub-repair: force osdmap prop to osds
	qa/standalone/scrub/osd-scrub-test: wait longer for update

Reviewed-by: David Zafman <dzafman@redhat.com>
2020-03-09 15:28:19 -05:00
Sage Weil
0447ed0ff9 qa/standalone/scrub/osd-scrub-repair: force osdmap prop to osds
flush_pg_stats isn't sufficient to ensure that OSDs have the latest
OSDMap.

Signed-off-by: Sage Weil <sage@redhat.com>
2020-03-08 14:52:10 -05:00
Sage Weil
ac9befd450 qa/standalone/scrub/osd-scrub-test: wait longer for update
Fixes: https://tracker.ceph.com/issues/43865
Signed-off-by: Sage Weil <sage@redhat.com>
2020-03-08 14:45:00 -05:00
David Zafman
e509b7c7d0 test: Add flush_pg_stats to avoid race with getting num_shards_repaired
Fixes: https://tracker.ceph.com/issues/44439

Signed-off-by: David Zafman <dzafman@redhat.com>
2020-03-06 04:25:37 +00:00
Sage Weil
acd4f5bc43 qa/standalone: python -> python3
Signed-off-by: Sage Weil <sage@redhat.com>
2019-12-20 13:33:21 -06:00
David Zafman
43f6218993 test: Use activate_osd() when restarting OSDs
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-12-05 15:13:31 -08:00
David Zafman
cca541d0f9 test: osd-scrub-snaps.sh: Fix race with osd restart and doing a scrub
Fixes: https://tracker.ceph.com/issues/43150

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-12-05 15:12:43 -08:00
Sage Weil
1e44d86b2c osd: change trigger_[deep_]scrub tommands to a pg tell command
This is cleaner.  All users are currently standalone tests; updated.

It also means that *all* commands that have a name=pgid arg are pg tell
commands.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-10-04 09:07:02 -05:00
David Zafman
b3e1c58b0e osd: Replace active/pending scrub tracking for local/remote
This is similar to how recovery reservations are split between
local and remote.

It was the case that scrubs_pending was used for reservations at
the replicas as well as at the primary while requesting reservations
from the replicas.  There was no need for scrubs_pending to turn
into scrubs_active at the primary as nothing treated that value
as special.  scrubber.active = true when scrubbing is
actually going.

Now scurbber.local_reserved indicates scrubs_local incremented
Now scrubber.remote_reserved indicates scrubs_remote incremented

Fixes: https://tracker.ceph.com/issues/41669

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-09-10 13:33:27 -07:00
Sage Weil
f5a1c57c94 qa/standalone/scrub/osd-scrub-snaps: snapmapper omap is now 'm'
...due to per-pool omap.

Fixes 91f533be71

Fixes: https://tracker.ceph.com/issues/41353
Signed-off-by: Sage Weil <sage@redhat.com>
2019-08-20 16:18:41 -05:00
Kefu Chai
fc55a51a87
Merge pull request #29579 from liewegas/wip-big-vs-bluestore
osd: scrub error on big objects; make bluestore refuse to start on big objects

Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-08-16 20:24:43 +08:00
David Zafman
5928fe8ca0 osd/PG: scrub error when objects are larger than osd_max_object_size
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-08-14 20:25:12 -05:00
Kefu Chai
f13c7c83d9
Merge pull request #29342 from Jeegn-Chen/wip-scrub-extended-sleep
osd: support osd_scrub_extended_sleep

Reviewed-by: David Zafman <dzafman@redhat.com>
2019-08-13 09:09:52 +08:00
Jeegn Chen
3bfb5c2621 osd: support osd_scrub_extended_sleep
1. always take osd_scrub_sleep for manually initiated
   scrubs
2. when scrub_time_permit() return true for scheduled
   ones, the existing osd_scrub_sleep is used
3. when scrub_time_permit() return false for scheduled
   ones, there may be 2 scenarios
   3.1 if osd_scrub_extended_sleep <= osd_scrub_sleep,
       let's take osd_scrub_sleep
   3.2 otherwise, let's take osd_scrub_extended_sleep

Fixes: http://tracker.ceph.com/issues/40955
Signed-off-by: Jeegn Chen <jeegnchen@tencent.com>
2019-08-12 16:54:36 +08:00
David Zafman
74d294d70b test: Bump sleep time for slower machines
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-08-05 07:40:09 -07:00
Sage Weil
1b46267cf7 Merge PR #28839 into master
* refs/pull/28839/head:
	osd: support osd_repair_during_recovery

Reviewed-by: David Zafman <dzafman@redhat.com>
2019-07-16 10:07:53 -05:00
Sage Weil
ff7813aa14 qa/standalone/scrub/osd-scrub-snaps.sh: adjust expected output
SnapSet now dumps just seq, not a (fake) SnapContext.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-12 09:55:06 -05:00
Sage Weil
23eaf7c498 qa/standalone/scrub/osd-scrub-snaps: fix kv grep
SnapMapper keys are now SNA_, not MAP_.

Fixes: http://tracker.ceph.com/issues/40725
Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-12 08:11:21 -05:00
Sage Weil
b2eb5232de Merge PR #28901 into master
* refs/pull/28901/head:
	qa/standalone/scrub/osd-scrub-repair: fix 'scrub ok' grep
	osd/osd_types: remove 'snap_context' from SnapSet::dump()

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2019-07-08 08:36:05 -05:00
Jeegn Chen
80f4e1f677 osd: support osd_repair_during_recovery
osd_repair_during_recovery=true allow explicitly requested reqair
to be scheduled on OSDs with active recovering.

Fixes: http://tracker.ceph.com/issues/40620
Signed-off-by: Jeegn Chen <jeegnchen@tencent.com>
2019-07-08 09:26:27 +08:00
Sage Weil
a960f2faa7 qa/standalone/scrub/osd-scrub-repair: fix 'scrub ok' grep
The log now also has a 'purged_snaps scrub ok' message that (generally)
precedes the first scrubbed PG.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-04 18:27:37 -05:00
Sage Weil
70ad54a0b3 osd/osd_types: remove 'snap_context' from SnapSet::dump()
We no longer have a snaps field with real values, so dumping this as a
"snap_context" is silly.  Instead, just dump the seq.

Adjust qa/standalone/scrub/osd-scrub-repair.sh accordingly.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-07-04 18:24:41 -05:00
David Zafman
fe3b693d0f
Merge pull request #28334 from dzafman/wip-40073
osd: Fix the way that auto repair triggers after regular scrub

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2019-07-03 15:27:27 -07:00
David Zafman
27918bb906 osd: Handle scrub interval changes
Global changes reschedule all PG scrubs
Pool changes reschedule pool PG scrubs

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-06-27 14:20:54 -07:00
David Zafman
893d227c82 test: Make sure that extra scheduled scrubs don't confuse test
Fixes: http://tracker.ceph.com/issues/40078

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-05-29 14:03:57 -07:00
David Zafman
39cc14bdc1
Merge pull request #27503 from dzafman/wip-39099
osd: Give recovery for inactive PGs a higher priority

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-04-25 15:06:56 -07:00
David Zafman
71d254647a test: osd-recovery-scrub.sh ignore error from kill_daemons()
Another work around for http://tracker.ceph.com/issues/38195

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-25 13:53:27 -07:00
David Zafman
7e77898001 test: Divergent testing of _merge_object_divergent_entries() cases
Case 1: A more recent update exists
Case 2: The first entry in the divergent sequence is a create
Case 3  NOT TESTED - Ohject currently missing
Case 4: We can rollback all of the entries
Case 5: We cannot rollback at least 1 of the entries

Support starting OSDs even when "noup" is set (don't wait for up).
Move create_ec_pool() to ceph-helpers.sh

Fixes: https://tracker.ceph.com/issues/39162

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-22 18:50:24 -07:00
David Zafman
69fa515c95 test: Make most tests use default objectstore bluestore
Change run_osd() to default objectstore bluestore
Use run_osd_filestore() to use the non-default objectstore
Fix inject_eio to handle any objectstore if config prefixed with type

Remaining tests using filestore:
	osd-pool-create.sh TEST_pool_create_rep_expected_num_objects
		Test filestore directory creation
	qa/standalone/osd/osd-dup.sh TEST_filestore_to_bluestore
		Obvious
	qa/standalone/osd/osd-rep-recov-eio.sh TEST_rep_read_unfound
		Requires data digest in object info
	qa/standalone/scrub/osd-scrub-repair.sh multiple tests
		Erasure code pools append mode for filestore is tested
	qa/standalone/special/ceph_objectstore_tool.py
		Test code verifies COT by directly examining filestore contents

Fixes: https://tracker.ceph.com/issues/39162

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-10 08:55:04 -07:00
David Zafman
57abdb11fa osd, test: Add num_shards_repaired to osd_stat_t for pushes with repair set 3(3)
Fixes: http://tracker.ceph.com/issues/38616

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-25 16:03:36 -07:00
David Zafman
d2ca3d2feb osd: Track num_objects_repaired in pg stats 2(3)
Leave repair pg state on until recovery finishes or a new scrub starts

Fixes: http://tracker.ceph.com/issues/38616

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-25 16:03:36 -07:00
David Zafman
2202e5d0b1 test, osd: Improvements to auto_repair 1(3)
Allow auto_repair for replicated bluestore pools
Regular scrub within auto repair parameters will trigger deep scrub
New state failed_repair if PG repair attempt could not fix everything
Set failed_repair if not possible to repair anything

Fixes: http://tracker.ceph.com/issues/38616

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-23 09:52:40 -07:00
David Zafman
315d324889 test: osd-scrub-repair.sh: use corrupt_and_repair_lrc for lrc tests
Fix for argument handling of create_ec_pool()
Always pass a value for allow_overwrites for consistency

Caused by: 3ca750d41d

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-23 09:52:40 -07:00
David Zafman
d4915ee503 qa: Don't create rbd pool because it creates an object
This also reverts commit 10b9626ea7.

Fixes: http://tracker.ceph.com/issues/38631

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-11 16:57:51 -07:00
Sage Weil
10b9626ea7 qa/standalone/scrub/osd-scrub-repair: fix unfound grep
It's now "1/2 unfound":

             1/2 objects unfound (50.000%)

..presumably due to the rbd pool init creating the rbd_directory.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-08 18:23:48 -06:00
David Zafman
ef2dc05de0 osd, test: Add test case with osd support for overdue PG scrubs and deep scrubs
Add trigger_deep_scrub osd command for testing
Publish stats when trigger_scrub/trigger_deep_scrub is used for testing
Add optional argument to trigger_scrub/trigger_deep_scrub
for amount of extra time to change last scrub stamps

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-01-23 16:49:33 -08:00
David Zafman
879d89aace test: Correct typo trying to call flush_pg_stats
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-01-23 16:49:33 -08:00
Vikhyat Umrao
8a694fc2f9 qa: specify filestore for misc tests
Signed-off-by: Vikhyat Umrao <vumrao@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
2019-01-16 13:09:19 -06:00
David Zafman
554ea73cb5 test: Disable duplicate request command test during scrub testing
Scrub testing requires an orderly control of scrubbing.  Most but not
all the time, the duplicate scrub request is ignored because the first
request hasn't finished.  Teuthology enables this environment variable
in the workunit handling.

Fixes: https://tracker.ceph.com/issues/36525

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-12-21 18:28:23 -08:00
David Zafman
975dbc5841 test: Minor improvement to create_ec_pool()
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-12-10 20:16:01 -08:00
David Zafman
1841928e28 test: Add test for requested scrub priority
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-11-14 23:57:20 -08:00
David Zafman
a159f162c5 test: osd-scrub-snaps.sh: After snapshot removal wait for snaptrim to complete
Due to deliberate corruptions snaptrim_error means snaptrim is done

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-11-08 14:48:20 -08:00
David Zafman
e37f95ac27 test: osd-scrub-snaps.sh: Testing with new --rmtype in ceph-objectstore-tool
Use --rmtype snapmap with new obj16 to remove snapmap only, check for repair message
Use --rmtype nosnapmap to remove obj5 while leaving snapmap behind

Signed-off-by: David Zafman <dzafman@redhat.com>
2018-11-08 14:48:20 -08:00