- remove upgrades from octopus
- stubs for completing upgrade to reef
Still missing the quincy-x upgrade tests.
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Maintain the prefix_itr between calls to SnapMapper::get_next_objects_to_trim() to prevent searching depleted prefixes.
We got 8 distinct hash prefixes used for searching objects owned by a given PG.
On each call to SnapMapper::get_next_objects_to_trim() we start from the first prefix even after all objects mapped to it were depleted.
This means that we will be searching for 1 non-existing prefix after the first prefix was depleted, 2 after the first two prefixes were depleted... and so on until we will search 7 non-existing prefixes after the first 7 prefixes were depleted.
This is a performance improvement PR only!
It maintains the existing behavior and does not try to fix/change any of the TRIM logic.
I added an extra step after the last object is trimmed doing a full scan of the DB and only if no object was found it will return ENOENT.
This should make the new code no-worse than existing code which returns ENOENT after a full scan found no object.
It should not impact performance in real life snaps as it should only happen once per-snap.
added snap-mapper tests to rados-test-suite
disabled osd_debug_trim_objects when running (SnapMapperTest, prefix_itr) to prevent asserts(as this code does illegal inserts into DELETED snaps)
Code beautifing
Disabled the assert as there is a corner case when we retrieve the last valid object/s in a snap
The prefix_itr is advanced past the last valid value (as we completed a full scan)
If the OSD will call get_next_objects_to_trim() before the retrieved object/s was processed and removed from the SnapMapper DB it won't be found by the next call (as the prefix_itr is invalid).
The object will be found in the second-pass which will seems as if it was added after the trim was started (which is illegal) and will trigger an ASSERT
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
RGW daemons register in the servicemap by gid which allows multiple radosgw instances to share an auth key/identity. The daemon name is sent as part of the metadata. (84c265238b).
All other daemons register by the daemon name and the manager stores all daemon state information with daemon name as key. The 'config show' command looks up the daemon_state map with the daemon name the user mentions as key (for example: 'osd.0', 'client.rgw', 'mon.a').
Due to the change in RGW daemon registration, the key used for storing daemon state has become rgw.gid and 'config show client.rgw' no longer works.
This change will take care of going through the daemon metadata to look for the RGW daemon name when a user enters the config show command for a RGW daemon. Once the correct daemon is found, we retrieve the corresponding daemon key (rgw.gid) and use that to query the daemon_state map.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2011756
Signed-off-by: Aishwarya Mathuria <amathuri@redhat.com>
"unit_test_scan" optional config uses Remote.run_unit_test()
to scan xml files generated by unit tests to throw better failure
messages (for s3tests and gtests run by workunit)
It also creates "unit_test_summary.yaml" for more exception details
from xml files.
Signed-off-by: Vallari Agrawal <val.agl002@gmail.com>
Maintain the prefix_itr between calls to SnapMapper::get_next_objects_to_trim() to prevent searching depleted prefixes.
We got 8 distinct hash prefixes used for searching objects owned by a given PG.
On each call to SnapMapper::get_next_objects_to_trim() we start from the first prefix even after all objects mapped to it were depleted.
This means that we will be searching for 1 non-existing prefix after the first prefix was depleted, 2 after the first two prefixes were depleted... and so on until we will search 7 non-existing prefixes after the first 7 prefixes were depleted.
This is a performance improvement PR only!
It maintains the existing behavior and does not try to fix/change any of the TRIM logic.
I added an extra step after the last object is trimmed doing a full scan of the DB and only if no object was found it will return ENOENT.
This should make the new code no-worse than existing code which returns ENOENT after a full scan found no object.
It should not impact performance in real life snaps as it should only happen once per-snap.
added snap-mapper tests to rados-test-suite
disabled osd_debug_trim_objects when running (SnapMapperTest, prefix_itr) to prevent asserts(as this code does illegal inserts into DELETED snaps)
Code beautifing
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
Ceph status fail to report pool application warning if
the pool is empty. Report pool application warning
even if pool has 0 objects stored in it.
Add POOL_APP_NOT_ENABLED cluster warnings to log-ignorelist
to fix rados suite.
Fixes: https://tracker.ceph.com/issues/57097
Signed-off-by: Prashant D <pdhange@redhat.com>
In rocksdb 7.0, all envlibrados files were moved to a separate repository (ref: https://github.com/facebook/rocksdb/pull/9206).
The new repo is temporary and serves as an example before it is finalized where and who to host RADOS support.
Since this new repo is outside of the rocksdb repo and in an unceratin state, we should remove support for it in main
and Reef test suites. Quincy and below still use rocksdb 6.0, so the same does not apply.
Fixes: https://tracker.ceph.com/issues/59057
Signed-off-by: Laura Flores <lflores@redhat.com>
The rook team relies on a daily CI system to validate
rook changes. It doesn't seem that the teuthology tests
are maintained, so it makes sense to remove them from the
rados suite.
By removing this symlink, rook test coverage will remain
in the orch suite, and coverage will only be removed from the
rados suite.
Workaround for: https://tracker.ceph.com/issues/58585
Signed-off-by: Laura Flores <lflores@redhat.com>
Using the default pool size of 2 with random eio thrashing can cause
some of the object to mark as lost.
fixing typo from 'osd default pool size: 3' to 'osd pool default size: 3'
so we will have pool size 3 correctly.
Fixes: https://tracker.ceph.com/issues/49888
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
We have few test suites that using 'override' in yaml file
while ceph.py task is looking for 'overrides', in that case
those configure params won't take any affects.
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
Separate `mon-stretch` from `mon`.
Renamed `mon-stretched-cluster.sh` to
`mon-stretch-fail-recovery.sh`.
This isolation of stretch cluster test will enable
developers to get results faster for stretch-cluster
related stuff.
Signed-off-by: Kamoltat <ksirivad@redhat.com>
mgr: Add one finisher thread per module
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Brad Hubbard <bhubbard@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Removing and changing all suites to no longer use filestore
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
ceph_volume: remove all filestore tests suites
Since filestore removed, no need to test it
Signed-off-by: Nitzan Mordechai <nmordech@redhat.com>
This test passes on centos and rhel, but fails on ubuntu from an
invalid pointer. Since the envlibrados rocksdb tests are experimental
and don't have any actual users, we can just run them on rhel and
centos.
At the moment, the actual bug is not fully understood, but it was
decided that fixing it is low priority, and removing the test from
problematic distros is okay for the time being. This commit
is considered a workaround to the actual issue.
Related tracker: https://tracker.ceph.com/issues/57632
Signed-off-by: Laura Flores <lflores@redhat.com>
Some of the tests in Rados.sh can fail when trying to test
watch_list return size if we hit watch timeout.
increase the watch timeout for rados test
Fixes: https://tracker.ceph.com/issues/47025
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
Set osd_mclock_override_recovery_settings option to true for tests that
modify recovery/backfill configuration options. This prevents logging of
the cluster warning when modifying recovery/backfill limits.
Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
- remove upgrades from octopus
- stubs for completing upgrade to reef
Still missing the quincy-x upgrade tests.
`c8e1f4c2b547a152e049af2b529bf415f6d76e59` has moved
the `thrash-old-clients` tests back to the rados suite.
This commit fixes the `release-checklists.rst` accordingly.
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores <lflores@redhat.com>
Currently, every rados run of ~400 jobs is running ~150 cephadm tests,
which is unnecessary and redundant. With this change, we will run some
basic cephadm tests within the rados suite. The following seems to be
a good start.
qa/suites/rados/cephadm/osds
qa/suites/rados/cephadm/smoke
qa/suites/rados/cephadm/smoke-singlehost
qa/suites/rados/cephadm/workunits
Signed-off-by: Neha Ojha <nojha@redhat.com>
set and unset the noautoscale flag,
evaluate if the results are what
we expected. As well as, evaluate
if the flag is correct when we
create new pools.
Signed-off-by: Kamoltat <ksirivad@redhat.com>