The qa tests are not client I/O centric and mostly focus on triggering
recovery/backfills and monitor them for completion within a finite amount
of time. The same holds true for scrub operations.
Therefore, an mClock profile that optimizes background operations is a
better fit for qa related tests. The osd_mclock_profile is therefore
globally overriden to 'high_recovery_ops' profile for the Rados suite as
it fits the requirement.
Also, many standalone tests expect recovery and scrub operations to
complete within a finite time. To ensure this, the osd_mclock_profile
options is set to 'high_recovery_ops' as part of the run_osd() function
in ceph-helpers.sh.
A subset of standalone tests explicitly used 'high_recovery_ops' profile.
Since the profile is now set as part of run_osd(), the earlier overrides
are redundant and therefore removed from the tests.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Set osd_mclock_override_recovery_settings option to true for tests that
modify recovery/backfill configuration options. This prevents logging of
the cluster warning when modifying recovery/backfill limits.
Fixes: https://tracker.ceph.com/issues/57529
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
This commit has been causing scheduled jobs to request e.g. aarch64
smithi machines, which don't exist. The dispatcher then tries to find them forever, requiring the dispatcher to be killed and restarted. The queue
will sit idle until someone notices the problem.
Signed-off-by: Zack Cerza <zack@redhat.com>
modified: qa/standalone/erasure-code/test-erasure-code-plugins.sh
new file: qa/suites/rados/thrash-erasure-code-isa/arch/aarch64.yaml
Signed-off-by: Dai Zhiwei <daizhiwei3@huawei.com>
Modified test cases:
1. test-erasure-eio.sh:
a. Test_ec_backfill_unfound():
- Set osd_mclock_profile to high_recovery_ops profile.
- Increase the wait for backfill_unfound timeout to 240 secs.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
The changes to the way EC/ReplicatedBackend communicate read
t showerrors had a side effect of making first eio on the object in
TEST_rados_get_subread_eio_shard_[01] repair itself depending
on the timing of the killed osd recovering. The test should
be improved to actually test that behavior at some point.
Signed-off-by: Samuel Just <sjust@redhat.com>
Change run_osd() to default objectstore bluestore
Use run_osd_filestore() to use the non-default objectstore
Fix inject_eio to handle any objectstore if config prefixed with type
Remaining tests using filestore:
osd-pool-create.sh TEST_pool_create_rep_expected_num_objects
Test filestore directory creation
qa/standalone/osd/osd-dup.sh TEST_filestore_to_bluestore
Obvious
qa/standalone/osd/osd-rep-recov-eio.sh TEST_rep_read_unfound
Requires data digest in object info
qa/standalone/scrub/osd-scrub-repair.sh multiple tests
Erasure code pools append mode for filestore is tested
qa/standalone/special/ceph_objectstore_tool.py
Test code verifies COT by directly examining filestore contents
Fixes: https://tracker.ceph.com/issues/39162
Signed-off-by: David Zafman <dzafman@redhat.com>
* refs/pull/26894/head:
qa/standalone/erasure-code/test-erasure-code: adjust test to avoid m=0
erasure-code: ensure m >= 1
mon/OSDMonitor: set ec min_size to k + min(1, m - 1)
Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
_DD is k=2 m=0, which we don't allow. Switch it to cDD.
I confess I don't fully understand why this was _DD to begin with, but
I'm pretty sure mapping is there to control the order of results so that
it can be mapped to the CRUSH rule output sanely, and the coding portion
is not relevant to the test.
Signed-off-by: Sage Weil <sage@redhat.com>
Bluestore caused grep crash with "grep: memory exhausted" due to
size of "block" storage.
Fixes: http://tracker.ceph.com/issues/38678
Signed-off-by: David Zafman <dzafman@redhat.com>
'rbd pool init' now does IO. Drop the pool, or change the pool size to 1.
Fixes: http://tracker.ceph.com/issues/38585
Signed-off-by: Sage Weil <sage@redhat.com>
Also:
- Do not print **offset** until specified
- Count missing objects correctly (used to be primary's local missing)
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
In EC recovery read, if the object's attrs read failed or with errors, we erase the attrs we have read and
try to read it again from left shards. This will make the primary osd get the object's attrs correct and
avoid assert.
Signed-off-by: xiaofei cui <cuixiaofei@sangfor.com>
When multiple objects are in flight for the same ReadOp, swap() on the
map<hobject_t, read_request_t> would remove requests for all objects.
We just want to replace the requests for the single object we're
dealing with in send_all_remaining_reads().
This prevents crashing trying to look up rop.to_read[hoid] when another
object in the same ReadOp gets an EIO and tries to send more requests.
Test this by using osd-recovery-max-single-start to bundle multiple
reads into one ReadOp. Save and restore CEPH_ARGS so custom settings
are reset for each test.
Fixes: http://tracker.ceph.com/issues/23195 (the 2nd crash there)
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
Discount shards that already returned EIO, and use minimum_to_decode()
to request just what is necessary to recover or read the originally
requested extents of the object.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
1235810c2a allowed recovery to use
multiple passes of reads to handle EIO, but the end condition for
checking whether we finished reading requires the full data to be
decodable (this is what get_want_to_read_shards returns).
This is just a loss of efficiency normally, since when there is only
one object the subsequent read works, and grabs all the data
necessary. The crash comes from having multiple objects in the same
ReadOp - in this case the sequence of events is:
- start recovery of two objects (osd_recovery_max_single_start > 1)
- read object a shard 3
- read object b shard 3
- fail minimum_to_decode because shard 3 can't reconstruct all of object a
- re-read all of object a, marking more reads in progress
- fail minimum_to_decode because shard 3 can't reconstruct all of object b
- skip re-reading object because there are now reads in progress
- finish reading k shards of object a
- still fail minimum_to_decode for object b, so no extra data was read
- send_all_remaining_reads tries to lookup object b in ReadOp object
- crash dereferencing to_read[object b], since this was cleared after handling the original object b read reply
This patch fixes the immediate inefficiency and crash by only checking
for the missing shards that were requested, rather than the entire
object, for recovery reads.
Fixes: http://tracker.ceph.com/issues/23195 (first crash)
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
some tests, like osd-backfill-stats.sh are using delete_pool(), but
they don't have this function defined. and this function is defined
in standalone tests separately, so would be simpler if we can
consolidate them in ceph-helper.sh.
Signed-off-by: Kefu Chai <kchai@redhat.com>
wait_for_clean() can miss the new pool if it races with pool create.
Fixes: http://tracker.ceph.com/issues/20465
Signed-off-by: David Zafman <dzafman@redhat.com>
/bin/bash is a Linuxism. Other operating systems install bash to
different paths. Use /usr/bin/env in shebangs to find bash.
Signed-off-by: Alan Somers <asomers@gmail.com>
- stop running via make check
- add teuthology yamls to run them
- disable ceph_objecstore_tool.py for now (too slow for make check, and
we can't use vstart in teuthology via a package install)
- drop cephtool tests since those are already covered by other teuthology
tests
- leave a handful of (fast!) ceph-helpers tests for make check for minimal
integration tests.
Signed-off-by: Sage Weil <sage@redhat.com>