ceph/qa/suites/rados
Laura Flores 40062676c2 qa/suites/rados/thrash-erasure-code-big/thrashers: add osd max backfills setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.

The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.

The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.

WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/

WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/

I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/

Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores <lflores@redhat.com>
2022-05-19 18:29:00 -05:00
..
basic
cephadm qa/suites/rados: reduce the number of cephadm tests 2022-01-21 23:38:53 +00:00
dashboard Merge pull request #43987 from rhcs-dashboard/53123-dashboard-nfs-cleanup 2021-11-19 20:40:41 +01:00
mgr qa: fix or add missing .qa links 2022-02-03 10:08:30 -05:00
monthrash
multimon
objectstore qa: Use osd_op_queue=wpq for tests using filestore backend. 2021-09-02 18:15:54 +05:30
perf qa: fix or add missing .qa links 2022-02-03 10:08:30 -05:00
rest
singleton qa: Added workunit test for noautoscale flag 2021-12-22 21:42:28 +00:00
singleton-bluestore
singleton-nomsgr qa/suites/rados: add crushdiff test 2021-08-27 17:45:40 +03:00
standalone
thrash qa: fix or add missing .qa links 2022-02-03 10:08:30 -05:00
thrash-erasure-code
thrash-erasure-code-big qa/suites/rados/thrash-erasure-code-big/thrashers: add osd max backfills setting to mapgap and pggrow 2022-05-19 18:29:00 -05:00
thrash-erasure-code-isa Revert "qa: support isal ec test for aarch64" 2021-10-12 12:53:58 -06:00
thrash-erasure-code-overwrites
thrash-erasure-code-shec
thrash-old-clients qa/suites/rados/thrash-old-clients: remove centos_8.3_container_tools_3.0 2022-02-02 23:26:54 +00:00
upgrade
valgrind-leaks qa: fix or add missing .qa links 2022-02-03 10:08:30 -05:00
verify
.qa
rook