This test passes on centos and rhel, but fails on ubuntu from an
invalid pointer. Since the envlibrados rocksdb tests are experimental
and don't have any actual users, we can just run them on rhel and
centos.
At the moment, the actual bug is not fully understood, but it was
decided that fixing it is low priority, and removing the test from
problematic distros is okay for the time being. This commit
is considered a workaround to the actual issue.
Related tracker: https://tracker.ceph.com/issues/57632
Signed-off-by: Laura Flores <lflores@redhat.com>
Set osd_mclock_override_recovery_settings option to true for tests that
modify recovery/backfill configuration options. This prevents logging of
the cluster warning when modifying recovery/backfill limits.
Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
- remove upgrades from octopus
- stubs for completing upgrade to reef
Still missing the quincy-x upgrade tests.
`c8e1f4c2b547a152e049af2b529bf415f6d76e59` has moved
the `thrash-old-clients` tests back to the rados suite.
This commit fixes the `release-checklists.rst` accordingly.
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores <lflores@redhat.com>
Currently, every rados run of ~400 jobs is running ~150 cephadm tests,
which is unnecessary and redundant. With this change, we will run some
basic cephadm tests within the rados suite. The following seems to be
a good start.
qa/suites/rados/cephadm/osds
qa/suites/rados/cephadm/smoke
qa/suites/rados/cephadm/smoke-singlehost
qa/suites/rados/cephadm/workunits
Signed-off-by: Neha Ojha <nojha@redhat.com>
set and unset the noautoscale flag,
evaluate if the results are what
we expected. As well as, evaluate
if the flag is correct when we
create new pools.
Signed-off-by: Kamoltat <ksirivad@redhat.com>
After https://github.com/ceph/ceph/pull/42526 and https://github.com/ceph/ceph/pull/43725 merges,
the following files do not exist but there were still references to them:
- src/pybind/mgr/dashboard/services/ganesha.py
- qa/tasks/mgr/dashboard/test_ganesha.py
The following files were renamed but there were still references to old names:
- src/pybind/mgr/dashboard/controllers/nfsganesha.py: nfsganesha.py --> nfs.py
- src/pybind/mgr/dashboard/tests/test_ganesha.py: test_ganesha.py --> test_nfs.py
Other changes in qa/suites/rados/dashboard/tasks/dashboard.yaml:
- Add missing task: tasks.mgr.dashboard.test_api
- Sort dashboard tasks alphabetically.
Fixes: https://tracker.ceph.com/issues/53123
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
I think we have enough coverage. Always testing all
objectstores is a bit excessive in my opinion
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
Add host section of the cluster creation workflow.
1. Fix bug in the modal where going forward one step on the wizard and coming back opens up the add host modal.
2. Rename Create Cluster to Expand Cluster as per the discussions
3. A skip confirmation modal to warn the user when he tries to skip the
cluster creation
4. Adapted all the tests
5. Did some UI improvements like fixing and aligning the styles,
colors..
- Used routed modal for host Additon form
- Renamed the Create to Add in Host Form
Fixes: https://tracker.ceph.com/issues/51517
Fixes: https://tracker.ceph.com/issues/51640
Fixes: https://tracker.ceph.com/issues/50336
Fixes: https://tracker.ceph.com/issues/50565
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
Signed-off-by: Nizamudeen A <nia@redhat.com>
This commit has been causing scheduled jobs to request e.g. aarch64
smithi machines, which don't exist. The dispatcher then tries to find them forever, requiring the dispatcher to be killed and restarted. The queue
will sit idle until someone notices the problem.
Signed-off-by: Zack Cerza <zack@redhat.com>
modified: qa/standalone/erasure-code/test-erasure-code-plugins.sh
new file: qa/suites/rados/thrash-erasure-code-isa/arch/aarch64.yaml
Signed-off-by: Dai Zhiwei <daizhiwei3@huawei.com>
Force a subset of tests that explicitly employ the filestore backend to
use WPQ scheduler. This is because mclock scheduler will not be
optimized for filestore.
Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
This is no longer required because we removed cosbench workloads in
fd350fd015. This is also required to prevent
failures like the following or any other changes that break the rgw task:
```
2021-08-06T20:13:25.812 INFO:teuthology.orchestra.run.smithi060.stderr:curl: (7) Failed to connect to smithi060.front.sepia.ceph.com port 80: Connection refused
2021-08-06T20:15:33.813 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_04c2febe7099917d97a71271f17abb5710030132/teuthology/contextutil.py", line 31, in nested
vars.append(enter())
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/rgw.py", line 191, in start_rgw
wait_for_radosgw(url, remote)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/util/rgw.py", line 94, in wait_for_radosgw
assert exit_status == 0
AssertionError
```
Signed-off-by: Neha Ojha <nojha@redhat.com>
Changes some the tests in teuthology to make
the test more deterministic.
Using:
`ceph osd set norecover` and
`ceph osd set nobackfill` when marking osds in
or out. As this will delay the recovery and make
sure it the test cases get the chance to check
that there is actually events poping up in
the progress module.
took out test_osd_cannot_recover from
tasks/mgr/test_progress.py since it is no longer
a relevant test case since recovery will get
triggered regardless if pg is unmoved.
Ignoring `OSDMAP_FLAGS` in teuthology
because we are using norecover and nobackfill
to delay the recovery process, therefore, it
will create a health warning and fails the
teuthology test.
Signed-off-by: Kamoltat <ksirivad@redhat.com>
it's a regression introduced by the restrcuture of the test suites,
let's pin the test to CentOS8.
See-also: https://tracker.ceph.com/issues/49638
Signed-off-by: Kefu Chai <kchai@redhat.com>
This is mostly for testing: a lot of tests assume that there are no
existing pools. These tests relied on a config to turn off creating the
"device_health_metrics" pool which generally exists for any new Ceph
cluster. It would be better to make these tests tolerant of the new .mgr
pool but clearly there's a lot of these. So just convert the config to
make it work.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
"get_heap_property *" asock commands are exposed to operators
to check the tcmalloc internals for understanding the performance
of the memory subsystem. but crimson uses the builtin seastar allocator
which is not backed by tcmalloc. but we can dump the metrics using
the "dump_metrics" asock command which is only available from
crimson-osd.
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>