Modified test cases:
1. ver-health.sh:
a. TEST_check_version_health_1():
To avoid intermittent timeouts observed in wait_for_health_string(),
increase the wait time to 20 secs.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The following tests in the test files mentioned below use the
"osd_scrub_sleep" option to introduce delays during scrubbing to help
determine scrubbing states, validate reservations during scrubbing etc..
This works when using the "wpq" scheduler.
But when the "mclock_scheduler" is enabled, the "osd_scrub_sleep" is
disabled and overridden to 0. This is done to delegate the scheduling of
the background scrubs to the "mclock_scheduler" based on the set QoS
parameters. Due to this, the checks to verify the scrub states,
reservations etc. fail since the window to check them is very short
due to scrubs completing very quickly. This affects a small subset of
scrub tests mentioned below,
1. osd-scrub-dump.sh -> TEST_recover_unexpected()
2. osd-scrub-repair.sh -> TEST_auto_repair_bluestore_tag()
3. osd-scrub-test.sh -> TEST_scrub_abort(), TEST_deep_scrub_abort()
Only for the above tests, until there's a reliable way to query scrub
states with "--osd-scrub-sleep" set to 0, the "osd_op_queue" config
option is set to "wpq".
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:
1. test-erasure-eio.sh:
a. Test_ec_backfill_unfound():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase the wait for backfill_unfound timeout to 240 secs.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:
1. osd-backfill-prio.sh:
Set osd_op_queue = wpq for all tests since the mclock doesn't
consider recovery priority as part of its scheduling algorithm.
2. osd-backfill-space.sh:
Set osd_mclock_profile to high_recovery_ops and increase the wait
for backfills timeout to 1200 secs for the following tests:
- TEST_backfill_test_simple()
- TEST_backfill_test_multi()
- TEST_backfill_test_sametarget()
- TEST_backfill_multi_partial()
- TEST_ec_backfill_simple()
- TEST_ec_backfill_multi()
- SKIP_TEST_ec_backfill_multi_partial()
- SKIP_TEST_ec_backfill_multi_partial()
3. osd-backfill-stats:
- TEST_backfill_ec_down_all_out():
Set osd_mclock_profile to high_recovery_ops and increase the wait
for recovery timeout to 240 secs.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Modified test cases:
1. osd-recovery-prio.sh:
Set osd_op_queue = wpq for all tests since mclock
doesn't consider recovery priority as part of its
scheduling algorithm.
2. osd-recovery-stats.sh:
a. TEST_recovery_undersized():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for recovery timeout to 300 secs.
3. osd-rep-recov-eio.sh:
a. TEST_rep_backfill_unfound():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for backfill_unfound to 360 secs.
4. repeer-on-acting-back.sh:
a. TEST_repeer_on_down_act():
- Set osd_mclock_profile to high_recovery_ops profile.
(To improve the test duration)
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
List of changes:
1. Remove the enforcement to use osd_op_queue=wpq when an osd is brought
up in the following functions:
- run_osd()
- run_osd_filestore() and
- activate_osd()
2. New functions:
- get_op_scheduler() - Get the current osd_op_queue for an osd.
3. Modified test cases:
- test_run_osd() - Add check for osd_max_backfill count.
The mclock scheduler overrides the count to 1000.
4. New test cases:
- test_activate_osd_after_mark_down()
- test_get_op_scheduler()
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
* refs/pull/42349/head:
mon/MDSMonitor: propose if FSMap struct_v is too old
mon/MDSMonitor: give a proper error message if FSMap struct_v is too old
mds/FSMap: use DECODE_OLDEST to gate FSMap version
qa: add tests for fs dump of epoch and trimming
qa: add file system support for dumping epoch
mon/MDSMonitor: return mon_mds_force_trim_to even if equal to current epoch
mon: add debugging for trimming methods
mon: fix debug spacing
qa: add nofs upgrade suite
Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Ramana Raja <rraja@redhat.com>
* refs/pull/41025/head:
qa: wait pgs to be clean before using the pools
qa: ignore PG_RECOVERY_FULL and PG_DEGRADED for mds-full
qa: wait more time since there have many more pgs than before
qa: do not multiple the full ratio twice
qa: do not raise for kclient for _fsync test
qa: use the pg autoscale mode to calcuate the pg_num
qa: set the object_size to 1M
qa: move the is_full() to parent class
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
This adds an upgrade suite to ensure that a Ceph cluster without a
CephFS file system does not blow up on upgrade (in particular, that the
MDSMonitor does not trip). This was developed to potentially reproduce
tracker 51673 but the actual cause for that issue was an old encoding
for the MDSMap which was obsoleted in Pacific. You must create a cluster
older than the FSMap (~Hammer or Infernalis) to reproduce. In any case,
this upgrade suite may be useful in the future so let's keep it!
Related-to: https://tracker.ceph.com/issues/51673
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
The cluster has already multiple the full ratio before returning
the "max_avail".
Fixes: https://tracker.ceph.com/issues/50984
Signed-off-by: Xiubo Li <xiubli@redhat.com>
For kclient, the write() will return -ENOSPC instead of the fsync().
Fixes: https://tracker.ceph.com/issues/45434
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Setting the pg_num to 8 is too small that some osds maybe not covered by the
pools, some osds maybe overloaded. Remove the hardcodeing pg_num here and let
the pg autoscale mode to calculate it as needed, and at the same time set the
pg_num_min to 64 to avoid the pg_num to small.
If ec pool is used, for the test cases most datas will go to the ec pool and
the primary replicated pool will store a small amount of metadata for all the
files only, so set the target size ratio to 0.05 should be enough.
Fixes: https://tracker.ceph.com/issues/45434
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Set the object_size to 1MB to make the objects destributed more even
among the OSDs.
Fixes: https://tracker.ceph.com/issues/45434
Signed-off-by: Xiubo Li <xiubli@redhat.com>
These overrides are standard for all configurations. The config to
enable fragmentation is also long removed.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
* refs/pull/42431/head:
cmake: add "mypy" back to tox envlist of "qa""
qa/tasks/vstart_runner: add optional "sudo" param to _run_python()
Reviewed-by: Sebastian Wagner <swagner@suse.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
to silence mypy warnings like:
tasks/vstart_runner.py:691: error: Definition of "_run_python" in base class "LocalCephFSMount" is incompatible with definition in base class "CephFSMount"
tasks/vstart_runner.py:705: error: Definition of "_run_python" in base class "LocalCephFSMount" is incompatible with definition in base class "CephFSMount"
Signed-off-by: Kefu Chai <kchai@redhat.com>
otherwise we have following warning in health report
{"status":"HEALTH_WARN","checks":{"RECENT_MGR_MODULE_CRASH":{"severity":"HEALTH_WARN","summary":{"message":"1 mgr modules have recently crashed","count":1},"muted":false}},"mutes":[]}
and it does not disappear after the test waits for 30 seconds.
and the tasks.mgr.test_module_selftest.TestModuleSelftest test
fails like:
2021-07-21T09:59:52.560 INFO:tasks.cephfs_test_runner:======================================================================
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-07-21T09:59:52.561 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/mgr/test_module_selftest.py", line 201, in
test_mo
dule_commands
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: self.wait_for_health_clear(timeout=30)
2021-07-21T09:59:52.562 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 172, in
wait_for_health_c
lear
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: self.wait_until_true(is_clear, timeout)
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_ceph-c_6a5d5abc027f706687dec92f92ff6fc6f074d2ae/qa/tasks/ceph_test_case.py", line 209, in
wait_until_true
2021-07-21T09:59:52.563 INFO:tasks.cephfs_test_runner: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2021-07-21T09:59:52.564 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
in this change, the crash reports are nuked right after
we see the warning, so that we can have a clean health
report.
Fixes: https://tracker.ceph.com/issues/51743
Signed-off-by: Kefu Chai <kchai@redhat.com>
qa/standalone: fixing the timings when waiting for deep-scrub to start
Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Sridhar Seshasayee <sseshasa@redhat.com>
qa/*/test_envlibrados_for_rocksdb.sh: remove OS specific configuration
Reviewed-by: David Galloway <dgallowa@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
this change partially reverts 81305b0da9,
otherwise we have following errors:
tasks/vstart_runner.py:691: error: Definition of "_run_python" in base class "LocalCephFSMount" is incompatible with definition in base class "CephFSMount"
tasks/vstart_runner.py:705: error: Definition of "_run_python" in base class "LocalCephFSMount" is incompatible with definition in base class "CephFSMount"
Signed-off-by: Kefu Chai <kchai@redhat.com>