The I/O workload in this test is xfstests (qa/run_xfstests_qemu.sh)
which isn't subjected to any timeout other than global max_job_time
limit in any other subsuite (e.g. qemu/workloads/qemu_xfstests.yaml).
But here, there is a parallel "op" workload defined as a workunit.
The workunit task has a default timeout of 3 hours which is effectively
imposed on the entire job. In the "rbd cache = false" configuration,
it's sometimes exceeded.
Fixes: https://tracker.ceph.com/issues/48038
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It doesn't really thrash anything, just repeatedly restarts the
workload on top of a dirty cache file. rbd_pwl_cache_recovery is
more on point and gets covered by existing CODEOWNERS.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
osd/SnapMapper: fix legacy key conversion in snapmapper class
Reviewed-by: Matan Breizman <mbreizma@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
Misplaced colons can result in radosgw thinking is has a bucket URL
but with no bucket name, leading to a crash later on.
Fixes: https://tracker.ceph.com/issues/55765
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
ceph.spec.in: fix path for mib file and properly mark in %files
Reviewed-by: Kefu Chai <tchaikov@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Currently, the following transaction exec sequence would lead to
loss of backref:
1. Trans `A` merge a alloc backref for extent `X`
2. Trans `B` add a release backref for extent `X` to backref cache,
during which it finds an in-cache alloc backref for extent `X` and
decide not to add the release backref to cache
3. Trans `A` commit
In the above sequece, the release backref for extent `X` is lost.
This is a regression introduced when we try to optimize the backref cache.
This commit fix the issue by caching inflight backrefs in a multiset,
alloc/release ops that happen on the same paddr are queued in the order of
their happening. When doing gc, all those backrefs are merged.
Fixes: https://tracker.ceph.com/issues/56519
Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>
This PR picks up the parts of
https://github.com/ceph/ceph/pull/44466
that were not merged back in January, when that
pull request was raised.
Matters added here:
* improved organzation of matter
* emphasis of IOPs per core over cores per OSD
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc/cephadm: add note about OSDs being recreated to OSD removal section
Reviewed-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
Reviewed-by: Redouane Kachach <rkachach@redhat.com>
Octopus modified the SnapMapper key format from
<LEGACY_MAPPING_PREFIX><snapid>_<shardid>_<hobject_t::to_str()>
to
<MAPPING_PREFIX><pool>_<snapid>_<shardid>_<hobject_t::to_str()>
When this change was introduced, 94ebe0ea also introduced a conversion
with a crucial bug which essentially destroyed legacy keys by mapping them
to
<MAPPING_PREFIX><poolid>_<snapid>_
without the object-unique suffix. This commit fixes this conversion going
forward, but a fix for existing clusters still needs to be developed.
Fixes: https://tracker.ceph.com/issues/56147
Signed-off-by: Manuel Lausch <manuel.lausch@1und1.de>
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
Prevent Alertmanager alerts from being redirected to the active mgr
dashboard instance. There are two reasons for it:
1. It doesn't bring any additional benefit. The Alertmanager config
includes all available mgr instances - active and passive ones. In
case of an alert, it will be sent to all of them. It ensures that
the active mgr dashboard will receive the alert in any case.
2. The redirect URL includes the mgr IP and NOT the FQDN. This leads
to issues in environments where an SSL certificate is configured and
matches the FQDNs, only.
Fixes: https://tracker.ceph.com/issues/56401
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
With ./box.py --engine docker you can specify you want to use docker
instead of podman. With docker box.py command should be run with sudo.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
crimson/os/seastore/cache: fine-grained lru cache control with GC
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Chunmei Liu <chunmei.liu@intel.com>
Reviewed-by: Yingxin Cheng <yingxin.cheng@intel.com>
The retired extent may exist as a RetiredExtentPlaceholder, casting
this extent to LogicalCachedExtent will cause undefined behavior.
Signed-off-by: Zhang Song <zhangsong325@gmail.com>
mgr/dashboard: ingress backend service should list all supported services
Reviewed-by: Aashish Sharma <aasharma@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: sunilangadi2 <NOT@FOUND>
mgr/snap_schedule: Use rados.Ioctx.remove_object() instead of remove().
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
add thrash test for persistent write log cache. run rbd bench
on persistent write log cache, thrashes rbd bench, test the
recovery function of persistent write log cache.
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
GC transaction is not sourced by user behaviors, so the extent read
operations from GC transaction don’t satisfy the time locality
principle. These extents should not be added to LRU cache.
Signed-off-by: Xinyu Huang <xinyu.huang@intel.com>
Image contexts are reopen even though we pass the context as an
argument. This commit changes that so you can forget about reopening
a rbd image context again.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Whenever we use serverSide (paginate through backend) we should
debounce reloadData since it might call api calls too much times.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>