Add a standalone test - test_activate_osd_skip_benchmark() in ceph-helpers.sh
that exercises the osd-mclock-skip-benchmark option.
Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
List of changes:
1. Remove the enforcement to use osd_op_queue=wpq when an osd is brought
up in the following functions:
- run_osd()
- run_osd_filestore() and
- activate_osd()
2. New functions:
- get_op_scheduler() - Get the current osd_op_queue for an osd.
3. Modified test cases:
- test_run_osd() - Add check for osd_max_backfill count.
The mclock scheduler overrides the count to 1000.
4. New test cases:
- test_activate_osd_after_mark_down()
- test_get_op_scheduler()
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
This is mostly for testing: a lot of tests assume that there are no
existing pools. These tests relied on a config to turn off creating the
"device_health_metrics" pool which generally exists for any new Ceph
cluster. It would be better to make these tests tolerant of the new .mgr
pool but clearly there's a lot of these. So just convert the config to
make it work.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This change is a follow-up to commit
b6e9c0903d5ad9a699b675f9fa7739e9cce9a5f3 that set the scheduler to wpq in
run_osd() and run_osd_filestore(). In addition, activate_osd() too has to
set the scheduler type to 'wpq' in order to be consistent and avoid test
failures.
The above is a temporary measure until all the standalone tests are
modified to run well with the mclock_scheduler.
Fixes: https://tracker.ceph.com/issues/51074
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
mclock_scheduler is now the default and some of these tests need to be modified
to run well with it. Continue using wpq until
https://tracker.ceph.com/issues/50574 is addressed.
Signed-off-by: Neha Ojha <nojha@redhat.com>
While the relevant comment says:
'# Execute the command and prepend the output with its pid'
the actual PID logged is the same for all background processes,
which isn't very helpful.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Adds option `mon_allow_pool_size_one` which will be disabled by default
to ensure pools are not configured without replicas.
If the user still wants to use pool size 1, they will have to change the
value of `mon_allow_pool_size_one` to true and then have to pass flag
`--yes-i-really-mean-it` to cli command:
Example:
`ceph osd pool test set size 1 --yes-i-really-mean-it`
Fixes: https://tracker.ceph.com/issues/44025
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
address the regression introduced by e62cfceb
in e62cfceb, we wanted to test the newly introduced TOO_FEW_OSDS
warning, so we increased the number of OSD to the size of pool, so if
the number of OSD is less than pool size, monitor will send a warning
message.
but we need to bring all OSDs back if we are expecting a healthy
cluster. in this change, all OSDs are resurrect before
`wait_for_health_ok`.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Case 1: A more recent update exists
Case 2: The first entry in the divergent sequence is a create
Case 3 NOT TESTED - Ohject currently missing
Case 4: We can rollback all of the entries
Case 5: We cannot rollback at least 1 of the entries
Support starting OSDs even when "noup" is set (don't wait for up).
Move create_ec_pool() to ceph-helpers.sh
Fixes: https://tracker.ceph.com/issues/39162
Signed-off-by: David Zafman <dzafman@redhat.com>
Change run_osd() to default objectstore bluestore
Use run_osd_filestore() to use the non-default objectstore
Fix inject_eio to handle any objectstore if config prefixed with type
Remaining tests using filestore:
osd-pool-create.sh TEST_pool_create_rep_expected_num_objects
Test filestore directory creation
qa/standalone/osd/osd-dup.sh TEST_filestore_to_bluestore
Obvious
qa/standalone/osd/osd-rep-recov-eio.sh TEST_rep_read_unfound
Requires data digest in object info
qa/standalone/scrub/osd-scrub-repair.sh multiple tests
Erasure code pools append mode for filestore is tested
qa/standalone/special/ceph_objectstore_tool.py
Test code verifies COT by directly examining filestore contents
Fixes: https://tracker.ceph.com/issues/39162
Signed-off-by: David Zafman <dzafman@redhat.com>
'rbd pool init' now does IO. Drop the pool, or change the pool size to 1.
Fixes: http://tracker.ceph.com/issues/38585
Signed-off-by: Sage Weil <sage@redhat.com>
Stopping the osd daemon won't reliably get you HEALTH_WARN or ERR; you have
to make sure it is also marked down.
Signed-off-by: Sage Weil <sage@redhat.com>
awk uses some tests that the native FreeBSD awk does not support:
like: BEGIN{print 0 < 90}
And TESTDIR is not set when calling ceph-helpers from smoke.sh
So fix with keeping the archive in /tmp
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
callers of get_python_path were not passing in a $1 parameter, so
ceph_lib was an empty string resulting in an invalid path to the built
cython modules. assume this is called from the `lib` parent directory.
pass path to the manager modules when starting ceph-mgr.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
If ulimit is set to a 1024 value, ceph-osd will segfault with the
following error :
filestore(td/smoke/0) error (24) Too many open files not handled on operation 0x55565d1fd004 (2182.1.0, or op 0, counting from 0)
This patch is about to insure that before setting up ceph daemons in tests, a valid ulimit value is setup.
Signed-off-by: Erwan Velu <erwan@redhat.com>
get_timeout_delays() is a generic function to compute delays for a long
period of time without saturating the CPU is busy loops.
It works pretty fine when the delay is short like having the following
series when requesting a 20seconds timeout : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 7.3 ".
Here the maximum between two loops is 7.3 which is perfectly fine.
When the timeout reaches 300sec, the same code produces the following
series : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 102.4 95.3 "
In such example there is delays which are nearly 2 minutes !
That is not efficient as the expected event, between two loops, could
arrive just after this long sleep occurs making a minute+ sleep for
nothing. On a local system that could be ok while on a CI, if all jobs
run like CI the overall is pretty unefficient by generating useless CPU
waits.
This patch is about adding a maximum acceptable delay time between two
loops while keeping the same rampup behavior.
On the same 300 seconds delay example, with MAX_TIMEOUT set to 10, we
now have the following series: "0.1 0.2 0.4 0.8 1.6 3.2 6.4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7.3"
We can see that the long 12/25/51/102/95 values vanished and being
replaced by a series of 10 seconds. It's up to every test defining the
probability of having a soonish event to complete.
The MAX_TIMEOUT is set to 15seconds.
Signed-off-by: Erwan Velu <erwan@redhat.com>