Commit Graph

599 Commits

Author SHA1 Message Date
Yuri Weinstein
e419a29be5
Merge pull request #42735 from amathuria/wip-amathuria-scrub-stats
osd/scrub: Add stats to PG dump for number of objects scrubbed

Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
2022-01-14 10:46:28 -08:00
Gabriel BenHanokh
a39b1f3cf7 tools/ceph-bluestore-tool: Fix bluefs-bdev-expand command
Update allocation file when we expand-device
Add the expended space to the allocator and then force an update to the allocation file

There is also a new standalone test case for expand

Fixes: https://tracker.ceph.com/issues/53699
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
2022-01-12 18:07:59 +02:00
Aishwarya Mathuria
91885f1a87 qa/standalone: add test to check if objects_scrubbed is equal to number of objects in a PG once a scrub finishes
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
2022-01-12 14:57:40 +05:30
Yuri Weinstein
58faf5712e
Merge pull request #43919 from ronen-fr/wip-rf-test-nodeep
osd/scrub (& qa/standalone): test for scrub behavior when no-scrub is set but no-deep-scrub is not


Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Matan Breizman <Matan.Brz@gmail.com>
2021-12-08 13:04:57 -08:00
Neha Ojha
89d5b2a79e
Merge pull request #43336 from ifed01/wip-fix-bluefs-volumes-ops
qa/osd-bluefs-volume-ops: fix bluefs volumes ops test case

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
2021-12-02 08:39:41 -08:00
Ronen Friedman
20dd022715 qa/standalone: osd-scrub-repair.sh: fix expected "not scrubbed since" warnings count
Following PR#43244, the 'ceph tell pg deep_scrub' now sets both
deep-scrub and "regular" scrub time-stamps. This necessitated a modification
to TEST_scrub_warning, as more PGs in this test are late for their regular scrubbing.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-11-23 16:43:21 +00:00
Ronen Friedman
7008b85fc5 qa/standalone: test for scrub behavior when noscrub is set but nodeepscrub is not
A bug (https://tracker.ceph.com/issues/52901 - now fixed) resulted in
this combination of conditions leaving the PG in "scrubbing" state
forever. That bug was fixed by PR#43521. The patch here adds a
test to detect the (now fixed) wrong behavior.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-11-23 15:08:42 +00:00
Ronen Friedman
9dda986bd5 qa/standalone: fix scrub/osd-scrub-dump following changes to 'pg dump pgs' output
Make osd-scrub-dump test ignore the 'scrubbing' that might be late to disappear
from the modified (PR #43403) 'pg dump' output.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-11-12 18:43:41 +00:00
Ronen Friedman
10909c3cba osd/scrub: update the stand-alone tests to check 'scrub scheduling' entries
Analyzing and verifying the relevant entries in 'pg query' and
'pg dump' output.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-11-05 17:07:57 +02:00
Igor Fedotov
efb67445c2 qa/osd-bluefs-volume-ops: retry data writing if spillover hasn't
happened.

Fixes: https://tracker.ceph.com/issues/52676
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
2021-11-02 17:26:39 +03:00
Ronen Friedman
52e9fa16ef tests: modify osd-scrub-repair to match PR #43239 changes
PR #43239 has modified ECBackend::get_hash_info() behavior.
Modified the standalone scrub test to match.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-10-20 06:42:51 +00:00
Zack Cerza
b57539dc94 Revert "qa: support isal ec test for aarch64"
This commit has been causing scheduled jobs to request e.g. aarch64
smithi machines, which don't exist. The dispatcher then tries to find them forever, requiring the dispatcher to be killed and restarted. The queue
will sit idle until someone notices the problem.

Signed-off-by: Zack Cerza <zack@redhat.com>
2021-10-12 12:53:58 -06:00
Dai Zhiwei
eaa385f3da qa: support isal ec test for aarch64
modified:   qa/standalone/erasure-code/test-erasure-code-plugins.sh
	new file:   qa/suites/rados/thrash-erasure-code-isa/arch/aarch64.yaml

Signed-off-by: Dai Zhiwei <daizhiwei3@huawei.com>
2021-10-08 14:37:25 +08:00
Aishwarya Mathuria
1b4e416f81 osd/scrub: Add scrub duration to pg dump stats
Addition of a new column, SCRUB_DURATION, to the pg stats that stores the time taken for a PG scrub.

Fixes: https://tracker.ceph.com/issues/52605
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
2021-10-01 13:27:27 +05:30
Neha Ojha
e273418bbb
Merge pull request #42604 from sseshasa/wip-skip-osd-benchmark
osd: Add config option to skip running the osd benchmark during init and update documentation.

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2021-09-08 11:03:09 -07:00
Sridhar Seshasayee
f539bedc96 qa/standalone: Add standalone test to validate osd-mclock-skip-benchmark option
Add a standalone test - test_activate_osd_skip_benchmark() in ceph-helpers.sh
that exercises the osd-mclock-skip-benchmark option.

Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-09-01 14:19:03 +05:30
Igor Fedotov
0b0f8ef12f qa/osd-bluefs-volume-ops: reproduce bluefs migrate bug
Reproduces: https://tracker.ceph.com/issues/40434
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
2021-08-31 16:23:22 +03:00
Sridhar Seshasayee
464e9ea6c0 qa/standalone/misc: ver-health.sh: Increase wait_for_health_string() timeout
Modified test cases:

1. ver-health.sh:
  a. TEST_check_version_health_1():
    To avoid intermittent timeouts observed in wait_for_health_string(),
    increase the wait time to 20 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Sridhar Seshasayee
33d2a2c93b qa/standalone/scrub: Force a subset of scrub tests to use "wpq" scheduler
The following tests in the test files mentioned below use the
"osd_scrub_sleep" option to introduce delays during scrubbing to help
determine scrubbing states, validate reservations during scrubbing etc..
This works when using the "wpq" scheduler.

But when the "mclock_scheduler" is enabled, the "osd_scrub_sleep" is
disabled and overridden to 0. This is done to delegate the scheduling of
the background scrubs to the "mclock_scheduler" based on the set QoS
parameters. Due to this, the checks to verify the scrub states,
reservations etc. fail since the window to check them is very short
due to scrubs completing very quickly. This affects a small subset of
scrub tests mentioned below,

1. osd-scrub-dump.sh -> TEST_recover_unexpected()
2. osd-scrub-repair.sh -> TEST_auto_repair_bluestore_tag()
3. osd-scrub-test.sh -> TEST_scrub_abort(), TEST_deep_scrub_abort()

Only for the above tests, until there's a reliable way to query scrub
states with "--osd-scrub-sleep" set to 0, the "osd_op_queue" config
option is set to "wpq".

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Sridhar Seshasayee
f658ff3511 qa/standalone/erasure-code: Modify erasure-code tests for mclock scheduler
Modified test cases:

1. test-erasure-eio.sh:
  a. Test_ec_backfill_unfound():
    - Set osd_mclock_profile to high_recovery_ops profile.
    - Increase the wait for backfill_unfound timeout to 240 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Sridhar Seshasayee
bdf36cf045 qa/standalone/osd-backfill: Modify backfill tests for mclock scheduler
Modified test cases:

1. osd-backfill-prio.sh:
  Set osd_op_queue = wpq for all tests since the mclock doesn't
  consider recovery priority as part of its scheduling algorithm.

2. osd-backfill-space.sh:
  Set osd_mclock_profile to high_recovery_ops and increase the wait
  for backfills timeout to 1200 secs for the following tests:
  - TEST_backfill_test_simple()
  - TEST_backfill_test_multi()
  - TEST_backfill_test_sametarget()
  - TEST_backfill_multi_partial()
  - TEST_ec_backfill_simple()
  - TEST_ec_backfill_multi()
  - SKIP_TEST_ec_backfill_multi_partial()
  - SKIP_TEST_ec_backfill_multi_partial()

3. osd-backfill-stats:
  - TEST_backfill_ec_down_all_out():
   Set osd_mclock_profile to high_recovery_ops and increase the wait
   for recovery timeout to 240 secs.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Sridhar Seshasayee
2c577040cb qa/standalone/osd: Modify osd tests for mclock scheduler
Modified test cases:
1. osd-recovery-prio.sh:
   Set osd_op_queue = wpq for all tests since mclock
   doesn't consider recovery priority as part of its
   scheduling algorithm.

2. osd-recovery-stats.sh:
   a. TEST_recovery_undersized():
     - Set osd_mclock_profile to high_recovery_ops profile.
     - Increase wait for recovery timeout to 300 secs.

3. osd-rep-recov-eio.sh:
   a. TEST_rep_backfill_unfound():
     - Set osd_mclock_profile to high_recovery_ops profile.
     - Increase wait for backfill_unfound to 360 secs.

4. repeer-on-acting-back.sh:
   a. TEST_repeer_on_down_act():
     - Set osd_mclock_profile to high_recovery_ops profile.
       (To improve the test duration)

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Sridhar Seshasayee
5a85a6a035 qa/standalone: Modify ceph-helpers.sh tests for mclock scheduler.
List of changes:

1. Remove the enforcement to use osd_op_queue=wpq when an osd is brought
   up in the following functions:
   - run_osd()
   - run_osd_filestore() and
   - activate_osd()

2. New functions:
   - get_op_scheduler() - Get the current osd_op_queue for an osd.

3. Modified test cases:
   - test_run_osd() - Add check for osd_max_backfill count.
     The mclock scheduler overrides the count to 1000.

4. New test cases:
   - test_activate_osd_after_mark_down()
   - test_get_op_scheduler()

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-30 18:16:00 +05:30
Neha Ojha
2c528248df
Merge pull request #42410 from ronen-fr/wip-ronenf-standalone-repair
qa/standalone: fixing the timings when waiting for deep-scrub to start

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-21 06:57:41 -07:00
Ronen Friedman
ed45acee34 qa/standalone: fixing the timings when waiting for deep-scrub to start
initiate_and_fetch_state() initiates a scrub, then polls the published
PG state looking for 'scrubbing'. Calling flush_pg_stats() as part of
the polling process might cause the scrub and the following recovery to
be missed altogether.

Note: this polling mechanism is definitely not robust. Will be
redesigned in the future.

Fixes: https://tracker.ceph.com/issues/51581
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-07-20 08:57:37 +03:00
Sage Weil
01c006c2de Merge PR #42041 into master
* refs/pull/42041/head:
	mgr/restful: ignore min/max_size
	test/crush: drop min/max_size refs
	qa/workunits/mon/pool_ops: remove test for min/max_size check
	qa: scrub a few remaining mentions of ruleset
	qa/standalone/mon/osd-*: fix tests
	PendingReleaseNotes: note min/max_size removal
	mgr/dashboard: remove max/min_size and ruleset
	mon/OSDMonitor: fix calls to CrushTester
	crush: eliminate min_size and max_size
	test/cli/crushtool: reunumber rulesets in test maps
	crushtool: require min/max or num-rep for --test
	crush: remove last traces of 'ruleset'
	test/cli/crushtool: use 'id' instead of 'ruleset' in crush inputs
	crushtool: take --min-rep and --max-rep explicitly
	crush/CrushTester: drop --ruleset
	doc: scrub 'ruleset' from docs
	src/erasure-code: rule, not ruleset
	mon/OSDMonitor: remove check_crush_rule() callers
	mon/OSDMonitor: rule, not ruleset
	crushtool: remove check for overlapped ruels
	crush/CrushWrapper: get_osd_pool_default_crush_replicated_ruleset -> rule
	crush: remove find_rule()
	mon/OSDMonitor: use pool's crush rule directly
	osd/OSDMap: drop checks for ruleset == ruleid
	osd/OSDMap: use pool's crush rule_id directly
	mon/PGMap: use pool's crush_rule directly
	mon/OSDMonitor: remove crush ruleset->rule rewrite

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
2021-07-14 14:38:59 -04:00
Sridhar Seshasayee
84cab65e3a qa/standalone: Add missing teardowns to a subset of osd-scrub-repair tests
Tests identified with missing teardown within osd-scrub-repair.sh:
  1. TEST_periodic_scrub_replicated()
  2. TEST_scrub_warning()
  3. TEST_request_scrub_priority()

Centralize setup and teardown within the run() function for all the tests.

Fixes: https://tracker.ceph.com/issues/51580
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-08 13:31:42 +05:30
Sridhar Seshasayee
a96c34f0ee qa/standalone: Add missing teardowns to a subset of osd tests
The following files and tests in them did not teardown the
cluster after a test completed.

1. osd/osd-force-create.sh
2. osd/osd-reuse-id.sh
3. osd/pg-split-merge.sh

This wouldn't cause issues if the tests are run individually. But when
running all the tests in the files mentioned above, it could introduce
unexpected test failures down the line. For e.g., multiple tests may
create pools with same name and if they are not cleaned up properly, this
could result in unexpected failures in a subsequent test.

Fixes: https://tracker.ceph.com/issues/51580
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-07-08 13:28:31 +05:30
Sage Weil
4fc7c3093c qa/standalone/mon/osd-*: fix tests
Signed-off-by: Sage Weil <sage@newdream.net>
2021-07-07 10:31:57 -04:00
Patrick Donnelly
d6c66f3fa6
qa,pybind/mgr: allow disabling .mgr pool
This is mostly for testing: a lot of tests assume that there are no
existing pools. These tests relied on a config to turn off creating the
"device_health_metrics" pool which generally exists for any new Ceph
cluster. It would be better to make these tests tolerant of the new .mgr
pool but clearly there's a lot of these. So just convert the config to
make it work.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
2021-06-11 19:35:17 -07:00
Sridhar Seshasayee
94826eaadc qa/standalone: Use osd op queue = wpq in activate_osd()
This change is a follow-up to commit
b6e9c0903d5ad9a699b675f9fa7739e9cce9a5f3 that set the scheduler to wpq in
run_osd() and run_osd_filestore(). In addition, activate_osd() too has to
set the scheduler type to 'wpq' in order to be consistent and avoid test
failures.

The above is a temporary measure until all the standalone tests are
modified to run well with the mclock_scheduler.

Fixes: https://tracker.ceph.com/issues/51074
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
2021-06-09 15:02:58 +05:30
Ronen Friedman
d6eb3e3a3c test: recovery_scrub: do not display 'repair' status on auto-repair deep-scrub
A new test: auto_repair_bluestore_tag.

Based on auto_repair_bluestore_basic. Sets auto-repair, starts a periodic
deep-scrub, then verifies that the PG state, while scrubbing, is 'scrubbing+deep'
and not 'scrubbing+deep+repair'.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-05-18 17:43:28 +03:00
Neha Ojha
b6e9c0903d qa/standalone: use osd op queue = wpq
mclock_scheduler is now the default and some of these tests need to be modified
to run well with it. Continue using wpq until
https://tracker.ceph.com/issues/50574 is addressed.

Signed-off-by: Neha Ojha <nojha@redhat.com>
2021-05-06 17:54:38 +00:00
Loïc Dachary
7fe0ac7c11 qa: verify the benefits of mempool cacheline optimization
There already is a test to verify the mempool sharding works, in the sense that
it uses at least half of the variables available to count the number of
allocated objects and their total size. This new test verifies that, with
sharding, object counting is at least twice faster than without sharding. It
also collects cacheline contention data with the perf c2c tool. The manual
analysis of this data shows the optimization gain is indeed related to cacheline
contention.

Fixes: https://tracker.ceph.com/issues/49896

Signed-off-by: Loïc Dachary <loic@dachary.org>
2021-04-30 12:11:13 +08:00
Ilya Dryomov
7eb9c5ddb2 Merge branch 'master' into wip-unauthorized-gids
Sync up with master up to commit 3d8e73b266 ("Merge pull request
#40731 from tchaikov/wip-yamlize-options").  Specifically, bring in
src/common/options.cc yamlization and move new auth-related options
into src/common/options/global.yaml.in.

Conflicts:
	src/common/options.cc
	src/common/options/global.yaml.in

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-13 15:42:06 +02:00
Sage Weil
dcd90a1c8d Merge PR #40626 into master
* refs/pull/40626/head:
	qa/suites/rados/objectstore: separate store_test tests
	qa/standalone: split osd/ into 2 directories

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2021-04-12 22:38:49 -04:00
Sage Weil
0f65e5cffa qa/standalone: split osd/ into 2 directories
The whole osd/ directory takes 3 hours to run.  Of that, about half is
osd-backfill*:

2021-04-05T20:38:55.932 INFO:tasks.workunit:Running workunit osd/osd-backfill-prio.sh...
2021-04-05T20:47:27.184 INFO:tasks.workunit:Running workunit osd/osd-backfill-recovery-log.sh...
2021-04-05T20:55:59.497 INFO:tasks.workunit:Running workunit osd/osd-backfill-space.sh...
2021-04-05T21:48:47.549 INFO:tasks.workunit:Running workunit osd/osd-backfill-stats.sh...
2021-04-05T22:17:09.197 INFO:tasks.workunit:Running workunit osd/osd-bench.sh...

Signed-off-by: Sage Weil <sage@newdream.net>
2021-04-12 09:59:17 -05:00
Ronen Friedman
b8045f7b18 Revert "test: Add test for scrub parallelism"
This reverts commit dd63577ab3.

As 08c3ede084 (the tested functionality) is reverted.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
2021-04-07 08:37:03 +03:00
Sage Weil
72c4fc75ad qa/standalone: default to disable insecure global id reclaim
Signed-off-by: Sage Weil <sage@newdream.net>
2021-04-06 17:29:23 -04:00
Prashant D
92eb39ee6f crush/CrushCompiler: print weight with uniform precision
Fixes: https://tracker.ceph.com/issues/48508

Signed-off-by: Prashant D <pdhange@redhat.com>
2021-03-29 14:44:49 +11:00
David Zafman
eec821b6e5 test: osd-recovery-scrub.sh: Test fails if no scrubs happened for a recovering pg
Change TEST_recovery_scrub_2 to create more objects and use
osd_recovery_sleep to prevent recovery from finihing before
we start to scrub.  Verify that at least 1 scrub was started
while the pg was reovering.

Fixes: https://tracker.ceph.com/issues/49779

Signed-off-by: David Zafman <dzafman@redhat.com>
2021-03-14 16:19:46 -07:00
David Zafman
a4fd1d650e Revert "qa/standalone/scrub/osd-recovery-scrub: fix unnoticed recovery state"
This reverts commit 1323bdb839.

The tests needs to scrub while recovery is in progress, so catching
recovery from the logs after the fact isn't the proper setup.
We can use osd_recovery_sleep config.

Signed-off-by: David Zafman <dzafman@redhat.com>
2021-03-13 11:40:55 -08:00
David Zafman
dd63577ab3 test: Add test for scrub parallelism
Signed-off-by: David Zafman <dzafman@redhat.com>
2021-03-05 11:41:26 -08:00
Sage Weil
5e197a21e6 Merge PR #39455 into master
* refs/pull/39455/head:
	doc/man/8/ceph: document --max option
	src/test/osd/safe-to-destroy: adjust test
	ceph: print command output to stdout even on error
	mgr/DaemonServer: include details in 'osd ok-to-stop' output
	mgr: add --max <n> to 'osd ok-to-stop' command
	mgr: relax osd ok-to-stop condition on degraded pgs

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
2021-02-27 10:15:27 -05:00
Sage Weil
33dee7d7bf crush/CrushWrapper: update shadow trees on update_item()
insert_item() already does this, but update_item did not.

Fixes: https://tracker.ceph.com/issues/48065
Signed-off-by: Sage Weil <sage@newdream.net>
2021-02-22 14:21:04 -06:00
Sage Weil
722f57dee1 mgr: add --max <n> to 'osd ok-to-stop' command
Given and initial (set of) osd(s), if provide up to N OSDs that can be
stopped together without making PGs become unavailable.

This can be used to quickly identify large(r) batches of OSDs that can be
stopped together to (for example) upgrade.

Signed-off-by: Sage Weil <sage@newdream.net>
2021-02-20 09:53:51 -05:00
Kefu Chai
8dc097ff46 qa/standalone/mon/misc: verify that len(monmap.features.persistent) == 9
in beb62c029a, FEATURE_QUINCY was added to
ceph::features::mon::get_persistent(), so update the test accordingly.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2021-01-30 22:45:20 +08:00
Sage Weil
7bbc92eda3 mon: updates for quincy
Signed-off-by: Sage Weil <sage@newdream.net>
2021-01-28 13:29:28 -06:00
Neha Ojha
5c11f40c12
Merge pull request #38856 from dzafman/wip-48789
test: Fix osd-scrub-scaps.sh to handle DB format change

Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
2021-01-15 16:27:59 -08:00
Neha Ojha
6fc9166af4
Merge pull request #38726 from ronen-fr/wip-ronenf-48720
qa/standalone/scrub/osd-recovery-scrub: handle primary change when waiting for scrub

Reviewed-by: David Zafman <dzafman@redhat.com>
2021-01-15 13:46:30 -08:00