Commit Graph

251 Commits

Author SHA1 Message Date
Josh Durgin
0559d31db2 librbd: remove limit on number of objects in the cache
The number of objects is not a significant indicated of when data
should be written out for rbd. Use the highest possible value for
number of objects and just rely on the dirty data limits to trigger
flushing. When the number of objects is low, and many start being
flushed before they accumulate many requests, it hurts average request
size and performance for many concurrent sequential writes.

Fixes: #7385
Backport: emperor, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2014-02-11 12:14:13 -08:00
Ilya Dryomov
4ebc32f37a rbd: don't forget to call close_image() if remove_child() fails
close_image() among other things unregisters a watcher that's been
registered by open_image().  Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-01-30 14:47:45 +02:00
Ilya Dryomov
0a553cfa81 rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:

  - user does 'rbd rm' on a mapped image with an fs mounted from it
  - 'rbd rm' trims (removes) all image data, only header is left
  - 'rbd rm' tries to remove a header and fails because krbd has a
    watcher registered on the header
  - at this point image cannot be unmapped because of the mounted fs
  - fs cannot be unmounted because all its data and metadata is gone

Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.

Fixes: http://tracker.ceph.com/issues/7076

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-01-30 14:47:45 +02:00
Noah Watkins
4c4e1d0d47 libc++: use ceph:: namespaced data types
Switches the implemetnation of smart pointers and unordered map/set to
use the ceph:: versions.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2014-01-18 14:03:20 -08:00
Josh Durgin
e91fb91065 librbd: better error when unprotect fails on unprotected snap
This will show up on the command line and logs, making it more
clear than EINVAL.

Fixes #6851 and #4045
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-31 16:26:07 -08:00
Noah Watkins
5b77533404 make: avoid symbol exporting for C++ libs on non-Linux
This removes export-symbol-regex for installed libraries with C++
interfaces on non-Linux where the hidden symbols are not resolved. This
is a temporary fix.

See ceph-devel topic "Shared library symbol visibility" for discussion
about a perm solution.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-30 12:58:37 -08:00
Josh Durgin
8f3ad4e3b9 Merge pull request #1000 from ceph/wip-rbd-tinc-5426
fix #5426 race in librbd

Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-26 18:53:02 -08:00
Josh Durgin
4cea7895da librbd: call user completion after incrementing perfcounters
The perfcounters (and the ictx) are only valid while the image is
still open.  If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.

The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.

Fixes: #5426
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-12-26 17:40:34 -08:00
Sage Weil
006449ddb5 librados: deprecate aio_operate() read variant that takes snapid
The argument was ignored.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-24 07:58:07 -08:00
Sage Weil
909f8a42b6 librbd: localize or distribute parent (snap) reads
The parent is always a snapshot.  We may want to treat it differently
than other snaps by virtue of it (likely) being a more highly-shared
image.

By default, localize parent reads.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-24 07:58:07 -08:00
Noah Watkins
ef4061f0ad librbd: remove unused private variable
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 18:07:03 -08:00
Noah Watkins
3b39a8a9f1 librbd: rename howmany to avoid conflict
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2013-12-07 18:07:03 -08:00
Sage Weil
ad4553a4fd librbd: fix build error
From a10703008f.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-10-21 15:48:42 -07:00
Sage Weil
a10703008f librbd: wire up flush counter
Fixes: #5668
Signed-off-by: Sage Weil <sage@inktank.com>
2013-10-21 14:40:03 -07:00
Roald J. van Loon
6949d221ad automake cleanup: implementing non-recursive make
- Enabling subdir objects
- Created a Makefile-env.am with basic automake init
- Created .am files per subdir, included from src/Makefile.am

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
2013-09-08 00:11:09 +02:00
Roald J. van Loon
09b42c033f automake cleanup: renamed inttypes.h
- In "includes", inttypes.h was cluttering the system's one. This caused
  random build errors on some systems/in some conditions. Renaming it.
- Add emergency defs of PRI*64 headers when int_types.h does not define
  them (which, unfortunately, can happen on some systems).

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
2013-09-07 22:41:10 +02:00
Sage Weil
a10ca4b5e0 librbd: fix debug print in aio_write
Reported-by: James Harper <james.harper@bendigoit.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-27 08:30:50 -07:00
Sage Weil
87affa2d1c Merge pull request #491 from kri5/wip-clang-compilation
Fix compilation -Wmismatched-tags warnings

Reviewed-by: Loic Dachary <loic@dachary.org>
2013-08-17 10:59:01 -07:00
Sage Weil
93ac92d85b librbd: remove mostly-useless assign_bid helper
Do it inline.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-15 17:21:11 -07:00
Christophe Courtaut
e1666d0400 Fix compilation -Wmismatched-tags warnings
Keep consistency in the code to not generate warnings of this type.

Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
2013-08-09 11:58:58 +02:00
Samuel Just
2dbb273d13 src/*: make Context::finish private and switch all users to use complete
Signed-off-by: Samuel Just <sam.just@inktank.com>
Fixes: Sage Weil <sage@inktank.com>
2013-07-22 10:33:40 -07:00
David Zafman
e761e4e55f librados, os, osd, osdc, test: Add support for client specified namespaces
Add rados_ioctx_namespace_set_key() and librados::IoCtx::namespace_set_key()
Add namespace to admin-daemon operations
Support namespace in osd map command
Add namespace to object_locator_t and hobject_t
Add random namespaces to psim program

Feature: #4982 (OSD: namespaces pt 1 (librados/osd, not caps))

Signed-off-by: David Zafman <david.zafman@inktank.com>
2013-07-09 14:09:02 -07:00
Yan, Zheng
714f2128bd osdc: re-calculate truncate_size for strip objects
Feed truncate_size through the striping algorithm so that it reflects the
correct per-object offset (as opposed to the file offset).

Fixes #5380
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-06-20 12:26:30 -07:00
Josh Durgin
bb64adb7ac Merge pull request #303 from ceph/wip-librbd-config-create
Reviewed-by: Sage Weil <sage.weil@inktank.com>
2013-05-21 11:16:53 -07:00
Danny Al-Gaaf
4ba70f8fb4 librbd/internal.cc: fix resource leak
Call release() on librados::AioCompletion to free storage before
leave the loop or call new again.

CID 1021213 (#1 of 1): Resource leak (RESOURCE_LEAK)
  leaked_storage: Variable "rados_completion" going out of scope leaks
  the storage it points to.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2013-05-17 13:54:09 +02:00
Josh Durgin
aacc9adc4e librbd: make image creation defaults configurable
Programs using older versions of the image creation functions can't
set newer parameters like image format and fancier striping.

Setting these options lets them use all the new functionality without
being patched and recompiled to use e.g. rbd_create3().
This is particularly useful for things like qemu-img, which does not
know how to create format 2 images yet.

Refs: #5067
backport: cuttlefish, bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-16 15:28:40 -07:00
Josh Durgin
13ae13a906 librbd: add options to enable balanced or localized reads for snapshots
Since snapshots never change, it's safe to read from replicas for them.
A common use for this would be reading from a parent snapshot shared by
many clones.

Convert LibrbdWriteback and AioRead to use the ObjectOperation api
so we can set flags. Fortunately the external wrapper holds no data,
so its lifecycle doesn't need to be managed.

Include a simple workunit that sets the flags in various combinations
and looks for their presence in the logs from 'rbd export'.

Fixes: #3064
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-12 19:31:22 -07:00
Sage Weil
b5e9b56fc9 Merge pull request #272 from ceph/wip-rbd-parallel
Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-10 17:13:12 -07:00
Josh Durgin
93f2794233 Throttle: move start_op() to C_SimpleThrottle constructor
This is done by all callers right before constructing this.
Since C_SimpleThrottle is already responsible for calling ->end_op(),
it makes sense to call start_op() there too.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:11 -07:00
Josh Durgin
613d7471a2 librbd: run copy in parallel
Instead of using read_iterate(), loop over each period of objects in
the source, read from them asynchronously, and then asynchronously
write to the destination.

The callbacks make this a bit more complex, but it can perform much
better.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:10 -07:00
Josh Durgin
fb299d3819 librbd: move completion release into rbd_ctx_cb()
All the users of rbd_ctx_cb() do this separately right now, but
there's no reason to keep the completion around after the nested
completion has been called. Also declare rbd_ctx_cb() in the header
so it can be used before its definition.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:10 -07:00
Josh Durgin
a6d0a25435 librbd: parallelize and simplify flatten
Flattening reads the logical child object from the parent image, and
then does a copyup operation if the data is non-zero. This is
equivalent to doing a zero-length write to each object in the
child image. Do this instead, so that we can easily control how
many are in flight, and eliminate some code as well.

Since we no longer read from the parent within the flatten function,
the buffer is not needed. It would be leaked in some error conditions,
but since's it's unecessary we can just get rid of it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:10 -07:00
Josh Durgin
bfa106694d librbd: only send non-zero copyup data
If the parent image is logically zero for the range of a child object,
it's equivalent to the object not existing. Save some I/O and network
bandwidth and don't send the useless zeroes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:10 -07:00
Josh Durgin
cfece23d5c librbd: parallelize rollback
Use a SimpleThrottle like trim_image() to limit the number of
requests in flight.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 16:17:09 -07:00
Josh Durgin
4095641016 librbd: delete more than one object at once
Speed up deletions when resizing down or removing an image by keeping
up 10 operations in flight by default.

Refs: #2256
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-10 12:00:11 -07:00
Sage Weil
2bc0883072 librbd: fix possible use-after-free
(of the pointer)

CID 966634 (#1 of 1): Use after free (USE_AFTER_FREE)
2. use_after_free: Using freed pointer "ictx".

Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-09 10:49:00 -07:00
Sage Weil
0093d704e6 librbd: fix i386 build
Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-23 16:18:53 -07:00
Sage Weil
857c88e017 librbd: add read_iterate2 call with fixed argument type
The existing read_iterate takes a size_t for the length, which is only 4GB
on 32-bit machines.  Instead, take a uint64_t length for the new
read_iterate2().

Return 0 instead of the number of bytes read; this makes the user-facing
API a bit simpler.

Fixes: #4665
Signed-off-by: Sage Weil <sage@inktank.com>

keep bytes return from internal method
2013-04-23 15:57:26 -07:00
Sage Weil
6c798ed940 librbd: implement read not in terms of read_iterate
The read() method returns the bytes read, trimmed to the end of the image;
use the other read() variant to do this (which use aio_read()) instead of
read_iterate().

Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-23 15:45:19 -07:00
Sage Weil
4865fb73c6 Merge pull request #214 from ceph/wip-objectcacher-handler-ordered
keep write responses to clones in order

Reviewed-by: Sage Weil <sage@inktank.com>
2013-04-16 15:48:15 -07:00
Sage Weil
899456617f librbd: flush on diff_iterate
The diff_iterate() tests fail when caching is enabled because recent writes
aren't visible to listsnaps.  Flush from diff_iterate to ensure that they
are.  Someday, maybe, we might make diff_iterate() inspect the cache
contents to make this more efficient, but for now that is not necessary.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-16 15:46:32 -07:00
Josh Durgin
06d05e5ed7 LibrbdWriteback: complete writes strictly in order
RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.

To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.

Fixes: #4531
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-10 16:57:08 -07:00
Josh Durgin
909dfb7d18 LibrbdWriteback: removed unused and undefined method
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-10 12:22:02 -07:00
Josh Durgin
9d19961539 LibrbdWriteback: use a tid_t for tids
An int could be much smaller, leading to overflow and bad behavior.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-10 12:06:36 -07:00
Josh Durgin
870f9cd421 WritebackHandler: make read return nothing
The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-10 12:03:04 -07:00
Sage Weil
4e847e8b2c librbd: simplify diff_iterate calls to list_snaps
We don't need the size.  Use the simpler API call.

Signed-off-by: Sage Weil <sage@inktank.com>
2013-04-02 18:13:01 -07:00
Josh Durgin
c0e3f642b1 librbd: add C and python bindings for diff_iterate
The python interface is a bit awkward since it maps directly
to the C interface, but it'll work well enough and not use
tons of memory.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-01 08:56:07 -07:00
Josh Durgin
33d1a2fc88 librbd: return -ENOENT from diff_iterate when the snap doesn't exist
This is a bit more helpful than -EINVAL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-01 08:56:07 -07:00
Josh Durgin
c680531e07 librbd: change diff_iterate interface to be more C-friendly
Use int instead of bool for the callback, and make it represent
whether the data exists, rather than the opposite, since callers
are likely to test for whether it's data instead of whether its zeroes.

Change the return value to 0, since an int64_t will wrap around
for large reads, and there's no value in reporting the length
read when it will always be the length requested clipped to the
size of the image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-01 08:56:07 -07:00
Sage Weil
5b0c68b928 doc/dev/rbd-diff: specify that metadata records come before data
Signed-off-by: Sage Weil <sage@inktank.com>
2013-03-31 23:32:41 -07:00