All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores <lflores@redhat.com>
This way, for downgrades to whatever versions
this lands in onward, having added new parameters to
UpgradeState shouldn't break anything. Can't do much
about downgrades to older versions from this one
but this should help in the future.
Signed-off-by: Adam King <adking@redhat.com>
This function was around 500 lines and difficult to work
with. Splitting it into sub functions should hopefully make
it a bit easier to understand and make changes to.
Signed-off-by: Adam King <adking@redhat.com>
Cmd2ArgparseError is available only cmd2 version 1.0.1 onwards. Before
that, SystemExit(2) is raised. This commit creates an empty class
Cmd2ArgparseError for earlier version so that similar error won't creep
up again.
Fixes: https://tracker.ceph.com/issues/55716
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Not doing so, sets the exit code to zero which is not desired in case of
a command failure.
Fixes: https://tracker.ceph.com/issues/55710
Signed-off-by: Rishabh Dave <ridave@redhat.com>
rgw/qa: enable s3-tests related to cloud-transition feature
Reviewed-by: casey Bodley <cbodley@redhat.com>
Reviewed-by: Maredia, Ali <amaredia@redhat.com>
Run cloudtier tests with parameter 'retain_head_object'
set to true and false.
However having multiple cloudtier storage classes in the same task
is increasing the transition time and resulting in spurious failures.
Hence until there is a consistent way of running the tests, without
having to depend on lc_debug_interval, disabled one of the config for
now.
Signed-off-by: Soumya Koduri <skoduri@redhat.com>