Commit Graph

112 Commits

Author SHA1 Message Date
xie xingguo
023524a26d osd/PeeringState: restart peering on any previous down acting member coming back
One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:
- keep checking the status of a speficied pg, who's up is [1, 2, 3]
- add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
  pg is remapped
- stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
- restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
  hence leaving the corresponding pg pinning undersized for a long time until all backfill
  targets completes

It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any thing else.

The right way to achieve the above goal is for:

	boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)

to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>
2020-02-21 17:52:52 +08:00
Sage Weil
f10cc22c60 Merge PR #32961 into master
* refs/pull/32961/head:
	qa/standalone/osd/osd-bench: debug bluestore

Reviewed-by: Neha Ojha <nojha@redhat.com>
2020-01-30 10:42:17 -06:00
Sage Weil
b99e506a3f qa/standalone/osd/osd-bench: debug bluestore
Looking for https://tracker.ceph.com/issues/43888

Signed-off-by: Sage Weil <sage@redhat.com>
2020-01-29 07:43:41 -06:00
David Zafman
e18519ad09 test: Update pg log test for new trimming behavior
Fixes: https://tracker.ceph.com/issues/43864

Signed-off-by: David Zafman <dzafman@redhat.com>
2020-01-28 15:23:45 -08:00
Neha
b20817795a qa/standalone/osd/osd-backfill-recovery-log.sh: fix TEST_backfill_log_2
Fixes: https://tracker.ceph.com/issues/43807
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-01-24 22:42:04 +00:00
Neha
994698277b qa/standalone/osd/osd-backfill-recovery-log.sh: fix TEST_backfill_log_1
Fixes: https://tracker.ceph.com/issues/43807
Signed-off-by: Neha Ojha <nojha@redhat.com>
2020-01-24 22:20:21 +00:00
David Zafman
9f7aabbe9f test: Fix wait_for_state() to wait for a PG to get into a state
To avoid confusion fix function names in osd-backfill-space.sh for how
they actually work.

Fixes: https://tracker.ceph.com/issues/43592

Signed-off-by: David Zafman <dzafman@redhat.com>
2020-01-13 18:39:38 -08:00
David Zafman
676d882649 test: Improve races by using kill_daemons which waits for OSDs terminate
osd-backfill-space.sh: More sleep time to make sure the backfill gets started

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-12-06 19:44:06 -08:00
David Zafman
43f6218993 test: Use activate_osd() when restarting OSDs
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-12-05 15:13:31 -08:00
Sage Weil
8994a65242 qa/standalone/osd/divergent-priors: add reproducer for bug 41816
Reproducer for https://tracker.ceph.com/issues/41816

Signed-off-by: Sage Weil <sage@redhat.com>
2019-09-21 10:09:15 -05:00
David Zafman
b98950e707 osd: Rename dump_reservations to dump_recovery_reservations
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-09-10 13:32:29 -07:00
David Zafman
fa698e18e1 mon: Improve health status for backfill_toofull and recovery_toofull
Treat backfull_toofull as a warning condition because it can resolve itself.
Includes test case for PG_BACKFILL_FULL
Includes test case for recovery_toofull / PG_RECOVERY_FULL

Fixes: https://tracker.ceph.com/issues/39555

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-06-20 02:22:01 +00:00
David Zafman
7959159e83 test: Adding standalone test of log copy handling
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-05-10 15:31:51 -07:00
sjust@redhat.com
252d5c20cf osd/: move stat updates and publishing to PeeringState
Signed-off-by: Samuel Just <sjust@redhat.com>
2019-05-01 11:22:24 -07:00
David Zafman
66b041fa4a
Merge pull request #27769 from dzafman/wip-39333
osd-backfill-space.sh test failed in TEST_backfill_multi_partial()

Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-04-26 11:55:04 -07:00
David Zafman
9931023457 test: osd-backfill-spsace.sh doesn't matter which PG wins the race
Fixes: http://tracker.ceph.com/issues/39333

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-26 10:11:00 -07:00
David Zafman
39cc14bdc1
Merge pull request #27503 from dzafman/wip-39099
osd: Give recovery for inactive PGs a higher priority

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-04-25 15:06:56 -07:00
David Zafman
444aa9f9fe osd, mon: New pool recovery priority range -10 to 10
Use OSD_POOL_PRIORITY_MAX and OSD_POOL_PRIORITY_MIN constants
Scale legacy priorities if exceeds maximum

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-25 13:53:27 -07:00
David Zafman
3a234164d0
Merge pull request #27279 from dzafman/wip-divergent
Improvements to standalone tests

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-04-24 10:58:11 -07:00
David Zafman
7e77898001 test: Divergent testing of _merge_object_divergent_entries() cases
Case 1: A more recent update exists
Case 2: The first entry in the divergent sequence is a create
Case 3  NOT TESTED - Ohject currently missing
Case 4: We can rollback all of the entries
Case 5: We cannot rollback at least 1 of the entries

Support starting OSDs even when "noup" is set (don't wait for up).
Move create_ec_pool() to ceph-helpers.sh

Fixes: https://tracker.ceph.com/issues/39162

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-22 18:50:24 -07:00
Sage Weil
755e8c4ef2 Merge PR #27595 into master
* refs/pull/27595/head:
	osd: add 'ceph osd stop <osd.nnn>' command

Reviewed-by: Sage Weil <sage@redhat.com>
2019-04-20 08:52:01 -05:00
xie xingguo
5dbae13ce0 osd: add 'ceph osd stop <osd.nnn>' command
stop command can be used to force stopping a specified osd daemon, e.g.,
you don't have to pre-figure out where it located.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2019-04-18 13:55:02 +08:00
Sage Weil
dc97651cbd Merge PR #27499 into master
* refs/pull/27499/head:
	qa/standalone/osd/osd-markdown: fix dup command disabling

Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-04-12 06:54:58 -05:00
Sage Weil
f7216d0b2c qa/standalone/osd/osd-markdown: fix dup command disabling
The ceph cli tool checks for the presence of the variable, not its value.

Fixes: http://tracker.ceph.com/issues/38359
Signed-off-by: Sage Weil <sage@redhat.com>
2019-04-10 16:44:38 -05:00
David Zafman
69fa515c95 test: Make most tests use default objectstore bluestore
Change run_osd() to default objectstore bluestore
Use run_osd_filestore() to use the non-default objectstore
Fix inject_eio to handle any objectstore if config prefixed with type

Remaining tests using filestore:
	osd-pool-create.sh TEST_pool_create_rep_expected_num_objects
		Test filestore directory creation
	qa/standalone/osd/osd-dup.sh TEST_filestore_to_bluestore
		Obvious
	qa/standalone/osd/osd-rep-recov-eio.sh TEST_rep_read_unfound
		Requires data digest in object info
	qa/standalone/scrub/osd-scrub-repair.sh multiple tests
		Erasure code pools append mode for filestore is tested
	qa/standalone/special/ceph_objectstore_tool.py
		Test code verifies COT by directly examining filestore contents

Fixes: https://tracker.ceph.com/issues/39162

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-10 08:55:04 -07:00
xie xingguo
6a8aedc107 qa: add new test case for pulling error
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
2019-04-04 11:04:43 +08:00
David Zafman
11f072fee1 Add checking of num_shards_repaired in osd stats
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-04-04 11:04:42 +08:00
Sage Weil
420edba243 Merge PR #27169 into master
* refs/pull/27169/head:
	common/config: parse --default-$option as a default value

Reviewed-by: Sébastien Han <seb@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-03-27 09:48:33 -05:00
Sage Weil
fdd2000631 common/config: parse --default-$option as a default value
Sometimes it is useful to specify an alternative default value for an
option via the command line such that it has a lower priority than the
mon config database, config file, the rest of the command line, or the
environment.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-26 11:00:27 -05:00
David Zafman
d2ca3d2feb osd: Track num_objects_repaired in pg stats 2(3)
Leave repair pg state on until recovery finishes or a new scrub starts

Fixes: http://tracker.ceph.com/issues/38616

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-03-25 16:03:36 -07:00
Sage Weil
be1187575b Merge PR #27021 into master
* refs/pull/27021/head:
	msg: remove XioMessenger
	qa/suites/rados/thrash-old-clients: add nautilus
	qa/suites/rados/thrash-old-clients: add mimic v1 variant
	qa/suites/rados/thrash-old-clients: add mimic
	qa/suites/rados/thrash-old-clients: collapse msgr and client choice
	qa: remove simplemessenger tests
	ceph_test_msgr: remove simple
	msg: remove SimpleMessenger

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Matt Benjamin <mbenjami@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
2019-03-22 04:42:30 -05:00
Sage Weil
28b4392a71 qa: remove simplemessenger tests
Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-20 06:10:25 -05:00
Sage Weil
fb915c4805 osd/PG: invalidate PG if merging with unexpected version
If the source or target PG version is 0'0, we may silently take the max
of the source and target and still leave the PG complete.  This
specifically can happen with an empty PG, as seen with bug 38655.  In
theory we could encounter one of the PGs with some other last_update
that doesn't match what we expect.  If that ever happens, make sure the
result is incomplete so that backfill can clean up.

Additionally check that the pool metadata for the last merge matches the
PGs at all.  This could mismatch if we have an osdmap gap and are forced
to do some merge without merge info at all... in which case we should
definitely invalidate: there should be newer copies of the PG(s), and we
have no idea whether the PGs we are merging are what we want.  If this is
some disaster recovery situation, an operator is always free to use
ceph-objectstore-tool to re-mark a PG complete (at their own peril!).

Fixes: http://tracker.ceph.com/issues/38655
Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-12 10:08:46 -05:00
Sage Weil
f978b27d2b qa/standalone/osd/pg-split-merge.sh: reproduce pg merge problem with empty pgs
This reproduces http://tracker.ceph.com/issues/38655

Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-11 17:10:28 -05:00
Sage Weil
bf74c1adc4 qa/standalone/osd/osd-rep-recov-eio: fix better
- no need for the default pool size
- no initial osds or it will collide with setup_osds later
- no need for rbd pool at all

Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-08 17:41:11 -06:00
Sage Weil
b59ff3860f qa/standalone/osd/osd-force-create-pg: create more pgs
Avoid warnings about too few pgs.

Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-06 16:27:56 -06:00
Sage Weil
cba0483b09 qa/standalone: make sure an osd is running before create_rbd_pool
'rbd pool init' now does IO.  Drop the pool, or change the pool size to 1.

Fixes: http://tracker.ceph.com/issues/38585
Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-06 16:27:56 -06:00
Sage Weil
01316aa7bd qa/standalone/osd/pg-split-merge: fix import_after_merge_and_gap
This test introduces a map gap.  What *should* happen is that when there is
such a gap, we cannot import.  Previously, the test didn't reliably produce
a map gap at all, and didn't check that import failed--it verified that it
passed.

Fix the test so that it reliably produces a gap *and* reports
min_last_epoch_clean to the mon so we can trim.  Then verify we fail to
import, but can with --force.  But remove the pg again, because if we
force an import with a map gap the osd will refuse to start.

Fixes: http://tracker.ceph.com/issues/38525
Signed-off-by: Sage Weil <sage@redhat.com>
2019-03-03 10:23:27 -06:00
Sage Weil
c6a7b2cbd1 qa/standalone/osd/osd-markdown: disable CLI command dups
The markdown test is based on marking down a specific number of times, but
the duplicate commands from the CLI may not get absorbed/batched by the
mon, breaking the test.  Override the default qa/tasks/workunit.py
behavior of sending dups.

Fixes: http://tracker.ceph.com/issues/38359
Signed-off-by: Sage Weil <sage@redhat.com>
2019-02-18 15:02:25 -06:00
David Zafman
64beabc4c6 test: Limit loops waiting for force-backfill/force-recovery to happen
Fixes: http://tracker.ceph.com/issues/38309

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-02-13 17:44:53 -08:00
David Zafman
910a95b9c8 test: osd-backfill-stats.sh Fix check of multi backfill OSDs, skip remapped test
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-02-07 20:05:58 -08:00
David Zafman
690ff9a21f
Merge pull request #26213 from dzafman/wip-38041
osd: Fix recovery and backfill priority handling

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2019-02-07 17:26:34 -08:00
David Zafman
ca5cf14fa8 test: Add scripts to test backfill/recovery priority handling
Signed-off-by: David Zafman <dzafman@redhat.com>
2019-02-07 15:46:23 -08:00
David Zafman
36e305c4b6 test: Ignore kill_daemons() error
Workaround for: http://tracker.ceph.com/issues/38195

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-02-05 11:31:32 -08:00
David Zafman
cc6339c0cd test: Increase timeouts in osd-backfill-space.sh because of failure seen
Fixes: http://tracker.ceph.com/issues/38027

Signed-off-by: David Zafman <dzafman@redhat.com>
2019-02-05 11:29:32 -08:00
David Zafman
99ddd3666b
Merge pull request #22797 from dzafman/wip-19753
osd: Deny reservation if expected backfill size would put us over bac…

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2019-01-18 07:42:00 -08:00
Vikhyat Umrao
8a694fc2f9 qa: specify filestore for misc tests
Signed-off-by: Vikhyat Umrao <vumrao@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
2019-01-16 13:09:19 -06:00
Sage Weil
b92be2ca9b qa/standalone/osd/osd-fast-mark-down: use v1 addr w/ simplemessenger
Signed-off-by: Sage Weil <sage@redhat.com>
2019-01-03 11:17:31 -06:00
David Zafman
094d39aa09 test: Add testing for erasure code backfill out of space detection
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-12-18 09:30:44 -08:00
David Zafman
3b8f86c8b0 test: Add testing for backfill out of space detection
Signed-off-by: David Zafman <dzafman@redhat.com>
2018-12-18 09:30:44 -08:00