Test: osd-recovery-space.sh extends the wait time for "recovery toofull"
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
pg-split-merge using ceph daemon command to check merge.
but it doesn't use asok path, which causes the check not to
return the correct output. change the command to use asok path.
Fixes: https://tracker.ceph.com/issues/65737
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
The osd-recovery-space test involves writing objects and expecting to receive
the "toofull" flag.
If we don't wait long enough, we might check the "toofull" flag before all objects
have completed writing, and the "toofull" status hasn't been activated yet.
The change will extend the waiting time and will also incorporate additional
checks for the return code from the status wait.
Fixes: https://tracker.ceph.com/issues/44510
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
This command will allow us to clear the OSD_TOO_MANY_REPAIRS alert
by setting the shard repair count to 0. This will help in cases where
the alert was a false positive, or a condition that has since cleared
at the disk level. Often, zeroing out the repair count is
better than muting the alert or restarting the OSD.
Fixes: https://tracker.ceph.com/issues/54182
Co-authored-by: David Zafman <dzafman@redhat.com>
Signed-off-by: Daniel Radjenovic <dradjenovic@digitalocean.com>
When creating new pool, the current code pick the divergent osd by
the first pg out of pg dump pgs, that can be in "unknown" status
which means the up_primary = -1 and that will fail the test.
We need to wait unitl the first pg is active+clean
Fixes: https://tracker.ceph.com/issues/56034
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
The qa tests are not client I/O centric and mostly focus on triggering
recovery/backfills and monitor them for completion within a finite amount
of time. The same holds true for scrub operations.
Therefore, an mClock profile that optimizes background operations is a
better fit for qa related tests. The osd_mclock_profile is therefore
globally overriden to 'high_recovery_ops' profile for the Rados suite as
it fits the requirement.
Also, many standalone tests expect recovery and scrub operations to
complete within a finite time. To ensure this, the osd_mclock_profile
options is set to 'high_recovery_ops' as part of the run_osd() function
in ceph-helpers.sh.
A subset of standalone tests explicitly used 'high_recovery_ops' profile.
Since the profile is now set as part of run_osd(), the earlier overrides
are redundant and therefore removed from the tests.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Set osd_mclock_override_recovery_settings option to true for tests that
modify recovery/backfill configuration options. This prevents logging of
the cluster warning when modifying recovery/backfill limits.
Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Update allocation file when we expand-device
Add the expended space to the allocator and then force an update to the allocation file
There is also a new standalone test case for expand
Fixes: https://tracker.ceph.com/issues/53699
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Modified test cases:
1. osd-recovery-prio.sh:
Set osd_op_queue = wpq for all tests since mclock
doesn't consider recovery priority as part of its
scheduling algorithm.
2. osd-recovery-stats.sh:
a. TEST_recovery_undersized():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for recovery timeout to 300 secs.
3. osd-rep-recov-eio.sh:
a. TEST_rep_backfill_unfound():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase wait for backfill_unfound to 360 secs.
4. repeer-on-acting-back.sh:
a. TEST_repeer_on_down_act():
- Set osd_mclock_profile to high_recovery_ops profile.
(To improve the test duration)
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The following files and tests in them did not teardown the
cluster after a test completed.
1. osd/osd-force-create.sh
2. osd/osd-reuse-id.sh
3. osd/pg-split-merge.sh
This wouldn't cause issues if the tests are run individually. But when
running all the tests in the files mentioned above, it could introduce
unexpected test failures down the line. For e.g., multiple tests may
create pools with same name and if they are not cleaned up properly, this
could result in unexpected failures in a subsequent test.
Fixes: https://tracker.ceph.com/issues/51580
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Test that the osd doesn't crash when it gets a bad incremental osdmap.
Related-to: https://tracker.ceph.com/issues/46443
Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
Include test case
Configurable by setting mon_osd_warn_num_repaired (default 10)
Ignore new health warning with random eio injection test
Fixes: https://tracker.ceph.com/issues/41564
Signed-off-by: David Zafman <dzafman@redhat.com>
It is possible for the pg dump to not be the latest when we check for newprimary
in _common_test(). This is because mgr_stats_period is 5 seconds, and we may not
have fetched the latest stats just yet. This causes the test to look at the same
stats before and after wait_for_clean.
Fixes: https://tracker.ceph.com/issues/43807 (2)
Signed-off-by: Neha Ojha <nojha@redhat.com>
Mon might fail to share the newest map with any of up osds, e.g.,
due to an injected broken pipe. Since we don't have any client
activities during the osd-markdown tests, osds might be unaware of
the map changes made through CLI. Make sure osds have pulled the
newest map down before we can test its reaction correctly.
Fixes: https://tracker.ceph.com/issues/44662
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Adds option `mon_allow_pool_size_one` which will be disabled by default
to ensure pools are not configured without replicas.
If the user still wants to use pool size 1, they will have to change the
value of `mon_allow_pool_size_one` to true and then have to pass flag
`--yes-i-really-mean-it` to cli command:
Example:
`ceph osd pool test set size 1 --yes-i-really-mean-it`
Fixes: https://tracker.ceph.com/issues/44025
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
One of our customers wants to verify the data safety of Ceph during scaling
the cluster up, and the test case looks like:
- keep checking the status of a speficied pg, who's up is [1, 2, 3]
- add more osds: up [1, 2, 3] -> up [1, 4, 5], acting = [1, 2, 3], backfill_targets = [4, 5],
pg is remapped
- stop osd.2: up [1, 4, 5], acting = [1, 3], backfill_targets = [4, 5], pg is undersized
- restart osd.2, acting will stay unchanged as 2 belongs to neither current up nor acting set,
hence leaving the corresponding pg pinning undersized for a long time until all backfill
targets completes
It does not pose any critical problem -- we'll end up getting that pg back into active + clean,
except that the long live DEGRADED warnings keep bothering our customer who cares about data
safety more than any thing else.
The right way to achieve the above goal is for:
boost::statechart::result PeeringState::Active::react(const MNotifyRec& notevt)
to check whether the newly booted node could be validly chosen for the acting set and
request a new temp mapping. The new temp mapping would then trigger a real interval change
that will get rid of the DEGRADED warning.
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: Yan Jun <yan.jun8@zte.com.cn>
To avoid confusion fix function names in osd-backfill-space.sh for how
they actually work.
Fixes: https://tracker.ceph.com/issues/43592
Signed-off-by: David Zafman <dzafman@redhat.com>
Treat backfull_toofull as a warning condition because it can resolve itself.
Includes test case for PG_BACKFILL_FULL
Includes test case for recovery_toofull / PG_RECOVERY_FULL
Fixes: https://tracker.ceph.com/issues/39555
Signed-off-by: David Zafman <dzafman@redhat.com>