The number of objects is not a significant indicated of when data
should be written out for rbd. Use the highest possible value for
number of objects and just rely on the dirty data limits to trigger
flushing. When the number of objects is low, and many start being
flushed before they accumulate many requests, it hurts average request
size and performance for many concurrent sequential writes.
Fixes: #7385
Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
close_image() among other things unregisters a watcher that's been
registered by open_image(). Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.
Fixes: http://tracker.ceph.com/issues/7076
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
This will show up on the command line and logs, making it more
clear than EINVAL.
Fixes#6851 and #4045
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This removes export-symbol-regex for installed libraries with C++
interfaces on non-Linux where the hidden symbols are not resolved. This
is a temporary fix.
See ceph-devel topic "Shared library symbol visibility" for discussion
about a perm solution.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.
The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.
Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The parent is always a snapshot. We may want to treat it differently
than other snaps by virtue of it (likely) being a more highly-shared
image.
By default, localize parent reads.
Signed-off-by: Sage Weil <sage@inktank.com>
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
- Enabling subdir objects
- Created a Makefile-env.am with basic automake init
- Created .am files per subdir, included from src/Makefile.am
Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
- In "includes", inttypes.h was cluttering the system's one. This caused
random build errors on some systems/in some conditions. Renaming it.
- Add emergency defs of PRI*64 headers when int_types.h does not define
them (which, unfortunately, can happen on some systems).
Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
Add rados_ioctx_namespace_set_key() and librados::IoCtx::namespace_set_key()
Add namespace to admin-daemon operations
Support namespace in osd map command
Add namespace to object_locator_t and hobject_t
Add random namespaces to psim program
Feature: #4982 (OSD: namespaces pt 1 (librados/osd, not caps))
Signed-off-by: David Zafman <david.zafman@inktank.com>
Feed truncate_size through the striping algorithm so that it reflects the
correct per-object offset (as opposed to the file offset).
Fixes#5380
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Call release() on librados::AioCompletion to free storage before
leave the loop or call new again.
CID 1021213 (#1 of 1): Resource leak (RESOURCE_LEAK)
leaked_storage: Variable "rados_completion" going out of scope leaks
the storage it points to.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Programs using older versions of the image creation functions can't
set newer parameters like image format and fancier striping.
Setting these options lets them use all the new functionality without
being patched and recompiled to use e.g. rbd_create3().
This is particularly useful for things like qemu-img, which does not
know how to create format 2 images yet.
Refs: #5067
backport: cuttlefish, bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Since snapshots never change, it's safe to read from replicas for them.
A common use for this would be reading from a parent snapshot shared by
many clones.
Convert LibrbdWriteback and AioRead to use the ObjectOperation api
so we can set flags. Fortunately the external wrapper holds no data,
so its lifecycle doesn't need to be managed.
Include a simple workunit that sets the flags in various combinations
and looks for their presence in the logs from 'rbd export'.
Fixes: #3064
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This is done by all callers right before constructing this.
Since C_SimpleThrottle is already responsible for calling ->end_op(),
it makes sense to call start_op() there too.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Instead of using read_iterate(), loop over each period of objects in
the source, read from them asynchronously, and then asynchronously
write to the destination.
The callbacks make this a bit more complex, but it can perform much
better.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
All the users of rbd_ctx_cb() do this separately right now, but
there's no reason to keep the completion around after the nested
completion has been called. Also declare rbd_ctx_cb() in the header
so it can be used before its definition.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Flattening reads the logical child object from the parent image, and
then does a copyup operation if the data is non-zero. This is
equivalent to doing a zero-length write to each object in the
child image. Do this instead, so that we can easily control how
many are in flight, and eliminate some code as well.
Since we no longer read from the parent within the flatten function,
the buffer is not needed. It would be leaked in some error conditions,
but since's it's unecessary we can just get rid of it.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
If the parent image is logically zero for the range of a child object,
it's equivalent to the object not existing. Save some I/O and network
bandwidth and don't send the useless zeroes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Speed up deletions when resizing down or removing an image by keeping
up 10 operations in flight by default.
Refs: #2256
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(of the pointer)
CID 966634 (#1 of 1): Use after free (USE_AFTER_FREE)
2. use_after_free: Using freed pointer "ictx".
Signed-off-by: Sage Weil <sage@inktank.com>
The existing read_iterate takes a size_t for the length, which is only 4GB
on 32-bit machines. Instead, take a uint64_t length for the new
read_iterate2().
Return 0 instead of the number of bytes read; this makes the user-facing
API a bit simpler.
Fixes: #4665
Signed-off-by: Sage Weil <sage@inktank.com>
keep bytes return from internal method
The read() method returns the bytes read, trimmed to the end of the image;
use the other read() variant to do this (which use aio_read()) instead of
read_iterate().
Signed-off-by: Sage Weil <sage@inktank.com>
The diff_iterate() tests fail when caching is enabled because recent writes
aren't visible to listsnaps. Flush from diff_iterate to ensure that they
are. Someday, maybe, we might make diff_iterate() inspect the cache
contents to make this more efficient, but for now that is not necessary.
Signed-off-by: Sage Weil <sage@inktank.com>
RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.
To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.
Fixes: #4531
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
The python interface is a bit awkward since it maps directly
to the C interface, but it'll work well enough and not use
tons of memory.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Use int instead of bool for the callback, and make it represent
whether the data exists, rather than the opposite, since callers
are likely to test for whether it's data instead of whether its zeroes.
Change the return value to 0, since an int64_t will wrap around
for large reads, and there's no value in reporting the length
read when it will always be the length requested clipped to the
size of the image.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>