address the regression introduced by e62cfceb
in e62cfceb, we wanted to test the newly introduced TOO_FEW_OSDS
warning, so we increased the number of OSD to the size of pool, so if
the number of OSD is less than pool size, monitor will send a warning
message.
but we need to bring all OSDs back if we are expecting a healthy
cluster. in this change, all OSDs are resurrect before
`wait_for_health_ok`.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Case 1: A more recent update exists
Case 2: The first entry in the divergent sequence is a create
Case 3 NOT TESTED - Ohject currently missing
Case 4: We can rollback all of the entries
Case 5: We cannot rollback at least 1 of the entries
Support starting OSDs even when "noup" is set (don't wait for up).
Move create_ec_pool() to ceph-helpers.sh
Fixes: https://tracker.ceph.com/issues/39162
Signed-off-by: David Zafman <dzafman@redhat.com>
Change run_osd() to default objectstore bluestore
Use run_osd_filestore() to use the non-default objectstore
Fix inject_eio to handle any objectstore if config prefixed with type
Remaining tests using filestore:
osd-pool-create.sh TEST_pool_create_rep_expected_num_objects
Test filestore directory creation
qa/standalone/osd/osd-dup.sh TEST_filestore_to_bluestore
Obvious
qa/standalone/osd/osd-rep-recov-eio.sh TEST_rep_read_unfound
Requires data digest in object info
qa/standalone/scrub/osd-scrub-repair.sh multiple tests
Erasure code pools append mode for filestore is tested
qa/standalone/special/ceph_objectstore_tool.py
Test code verifies COT by directly examining filestore contents
Fixes: https://tracker.ceph.com/issues/39162
Signed-off-by: David Zafman <dzafman@redhat.com>
'rbd pool init' now does IO. Drop the pool, or change the pool size to 1.
Fixes: http://tracker.ceph.com/issues/38585
Signed-off-by: Sage Weil <sage@redhat.com>
Stopping the osd daemon won't reliably get you HEALTH_WARN or ERR; you have
to make sure it is also marked down.
Signed-off-by: Sage Weil <sage@redhat.com>
awk uses some tests that the native FreeBSD awk does not support:
like: BEGIN{print 0 < 90}
And TESTDIR is not set when calling ceph-helpers from smoke.sh
So fix with keeping the archive in /tmp
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
callers of get_python_path were not passing in a $1 parameter, so
ceph_lib was an empty string resulting in an invalid path to the built
cython modules. assume this is called from the `lib` parent directory.
pass path to the manager modules when starting ceph-mgr.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
If ulimit is set to a 1024 value, ceph-osd will segfault with the
following error :
filestore(td/smoke/0) error (24) Too many open files not handled on operation 0x55565d1fd004 (2182.1.0, or op 0, counting from 0)
This patch is about to insure that before setting up ceph daemons in tests, a valid ulimit value is setup.
Signed-off-by: Erwan Velu <erwan@redhat.com>
get_timeout_delays() is a generic function to compute delays for a long
period of time without saturating the CPU is busy loops.
It works pretty fine when the delay is short like having the following
series when requesting a 20seconds timeout : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 7.3 ".
Here the maximum between two loops is 7.3 which is perfectly fine.
When the timeout reaches 300sec, the same code produces the following
series : "0.1 0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6 51.2 102.4 95.3 "
In such example there is delays which are nearly 2 minutes !
That is not efficient as the expected event, between two loops, could
arrive just after this long sleep occurs making a minute+ sleep for
nothing. On a local system that could be ok while on a CI, if all jobs
run like CI the overall is pretty unefficient by generating useless CPU
waits.
This patch is about adding a maximum acceptable delay time between two
loops while keeping the same rampup behavior.
On the same 300 seconds delay example, with MAX_TIMEOUT set to 10, we
now have the following series: "0.1 0.2 0.4 0.8 1.6 3.2 6.4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7.3"
We can see that the long 12/25/51/102/95 values vanished and being
replaced by a series of 10 seconds. It's up to every test defining the
probability of having a soonish event to complete.
The MAX_TIMEOUT is set to 15seconds.
Signed-off-by: Erwan Velu <erwan@redhat.com>
The wait_for_clean() is using the default timeout aka 300sec = 5mn.
wait_for_clean() is trying to find a clean status within that timeout
_or_ reset its counter if any progress got made in between loops.
In a case where the cluster is sane, the recovery should be made in
shorter than 5mn but it the cluster died, waiting for 5mn for nothing is
unefficient.
This patch is about defining a custom timeout for a wait_for_clean() not
to wait much more that 1m30 (90sec). If no progress is made in that
period, there is very few chance this will read the a valid state
anyhow.
Signed-off-by: Erwan Velu <erwan@redhat.com>
some tests, like osd-backfill-stats.sh are using delete_pool(), but
they don't have this function defined. and this function is defined
in standalone tests separately, so would be simpler if we can
consolidate them in ceph-helper.sh.
Signed-off-by: Kefu Chai <kchai@redhat.com>