The refresh_seq is incremented in notify_change when calling
notify_async_complete after the locker owner completes the resize
request.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
If the item weight is 0 we don't want to divide; instead draw a minimal
value.
Fixes: #11357
Reported-by: Yann Dupont <yd@objoo.org>
Tested-by: Yann Dupont <yd@objoo.org>
Signed-off-by: Sage Weil <sage@redhat.com>
Override the RBD default image format back to version 1
to ensure tests properly cover the old format.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
osd: fix PG::all_unfound_are_queried_or_lost for non-existent osds
Reviewed-by: Kefu Chai <tchaikov@gmail.com
Reviewed-by: Samuel Just <sjust@redhat.com>
Fix#: 2862
Changes to some of the common files for command line parsing
Change to ceph_argparse.cc
-------------------------
Added function ceph_arg_value_type()
Given an input it will determine
i) If that input is an option or not
ii) If input is numeric in nature or not.
It will set the flag bool_option and bool_numeric appropriately.
This function is called by ceph_argparse_witharg() to figure out if
the input parameter to those functions are numeric in nature and not
an option. If the input parameter to ceph_argparse_witharg()
happens to be an option then it implies that user didn't supply
value to the option.
Changes to strol.cc
-------------------
Changes to strict_strtoll() and strict_strtol()
Both these functions reponsibility is to convert the string to long or to int.
I felt it may be not be good for it to display error message within this function,
rather caller of this function who has better understanding of the function's purpose
can display the error message.
Made change in this function to just create a generic error message,Its the
caller of this function decides what to do with this message.
Signed-off-by: Rajesh Nambiar <rajesh.n@msystechnologies.com>
Fixes: #2862
Changes related to rbd file
Changes to rbd.cc
-----------------
Change 1: line# 2744 to 2747
If the option is --order then do the check of its value if its less
than 12 or greaterthan 25 then throw error. Correct value of --order
is 12 to 25.
Change 2: Removal of validation from line# 3205 to 3209
Since the check for correct value of --order is done before hence the
check here is not needed.
Signed-off-by: Rajesh Nambiar <rajesh.n@msystechnologies.com>
When handling a proxied snap_create operation, the client which
invoked the snap_create should send the header update notification
to avoid a possible race condition where snap_create completes but
the client doesn't see the new snapshot (since it didn't yet receive
the notification).
Fixes: #11342
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Moved all parent overlap computation to within AioRequest so that
callers don't need to independently compute the overlap. Also
removed the need to pass the snap_id for write operations since
it can only be CEPH_NOSNAP.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The docker image created by docker-tests.sh for a given operating system
is parameterized with the user name. If two users on the same machine
try to use the same image, they will compete and fail with an error
like:
... user get supplementary groups Unable to find user ...
Add the $USER to the image name to reflect the fact that they contain an
account for this user.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
The meta file is deleted only if the bucket meta data is not synced
Signed-off-by: Orit Wasserman <owasserm@redhat.com>
Fixes: #11149
Backport: hammer, firefly
When m_readahead_pos reaches the limit, there's no need to call
_compute_readahead to calculate the readahead. Just return with no
readahead.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
triggering request is big enough
If the size of the read triggering the continuing readahead is such big
that exceeding m_readahead_pos, should do the readahead starting from
m_last_pos.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Instead of flagging the HEAD image object map as invalid when an
error occurs with a snapshot object map, properly flag the snapshot
as invalid.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Images no longer track per-snapshot features. snapshot_list
no longer needs to retrieve per-snapshot features.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
In preparation for dynamic feature bits, it probably doesn't
make sense to have snapshots have different features enabled.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Add a test for the activation of the memstore objectstore and verify
that it works without specifying a journal.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
Add the test_pool_read_write function to share the rados put / get test
that demonstrate the osd that has been created can actually be used. Use
it from the both the regular device and dmcrypt tests.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
Instead of duplicating the device construction / destruction logic for
dmcrypt tests, use test_setup_dev_and_run to do it. It is now able to
recover from devmapper leftover which may occur when a cryptsetup test
fails.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
The activate_dmcrypt_plain_dev_body and activate_dmcrypt_dev_body
functions are almost identical, merge them and differentiate with an
argument.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
Move test_activate_dev to test_setup_dev_and_run and make it
run the function given in argument. test_activate_dev calls
test_setup_dev_and_run and no longer needs to implement device
allocation or destruction.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
Address all possible failure cases, when ceph-disk.sh completes or when
it starts with leftover from a previous interrupted run. It is assumed
that ceph-disk.sh will crash at any point.
* umount all mount points that belong to ceph-disk.sh (check the
absolute path of the directory)
* dmsetup remove all device mapper nodes found to hold a loop device
that ceph-disks.sh no longer uses
* losetup --detach all loop devices that ceph-disks.sh no longer uses
Signed-off-by: Loic Dachary <ldachary@redhat.com>
The tests explicitly return on error when relevant. Add two error cases:
* detect when the allocation of a loop device fails.
* in the outer loop, return immediately if one of the test fails
Signed-off-by: Loic Dachary <ldachary@redhat.com>
run-make-check.sh relies on ccache. If ~/.ccache is not bind mounted and
$HOME is not bind mounted either, ./configure will fail with an obscure
error because it cannot create the directory. Create the directory if it
does not exist already and avoid this problem. The worst that can happen
is that an empty .ccache directory is created and never used which
should not be a major inconvenience.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
When handling GET request for large object (with multiple chunks), currently it will first flush the
cached data, and then issue AIO request for next chunk, this has the potential issue to make the retriving
from OSD and sending to client serialized. This patch switch the two operations.
Fixes: 11322
Signed-off-by: Guang Yang <yguang@yahoo-inc.com>
If the attempt to deregister the snapshot from the parent
image fails with -ENOENT, ignore the error as it is safe
to assume that the child is not associated with the parent.
Fixes: #11113
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit cf8094942c)
get_parent_info should return -ENOENT if the image does not
have an associated parent image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 21afd0ef8e)
...in order to avoid issuing lots of separate RADOS ops
if expiring a segment that contained requests for many
clients.
Signed-off-by: John Spray <john.spray@redhat.com>
If we want to discard a range of an object, we will zero(use fallocate
to punch a hole) the range now. In general this introduce some overhead(extra writes).
If the filesystem ontop of RBD holding lots of small files, this
behavior will bring big performance penalty.
Adding a flag that allow user to control if they want to zero the
range.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
- If the journal is required, require it.
- If the journal is not allowed, do not allow one to be specified
- If the journal is not wanted, to not set one up by default when none is
provided.
See #9580
Signed-off-by: Sage Weil <sage@redhat.com>
Remove code duplication by generalizing ceph_argparse_with{int,float,longlong}
routines - make one template function for those cases.
Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com>
Remove erroneous arg for ceph_argparse_witharg call when '--io-pattern' parsed:
the name look up will point the compiler to bool ceph_argparse_witharg(
std::vector<const char*> &args,
std::vector<const char*>::iterator &i, std::string *ret, ...) when compiler
is resolving this function call. The &err argument will be wrongly interpreted
as a char * variable to be compared with the argument name pointed by i.
Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com>
The librbd unit tests currently only test the old image format. Ensure
the new format and its possible features are also tested.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
FileJournal needs stuff from blkdev.c in libcommon.
./.libs/libosd.a(libos_la-FileJournal.o): In function `FileJournal::_open_block_device()':
/home/nwatkins/src/ceph/src/os/FileJournal.cc:139: undefined reference to `get_block_device_size(int, long*)'
/home/nwatkins/src/ceph/src/os/FileJournal.cc:161: undefined reference to `block_device_support_discard(char const*)'
./.libs/libosd.a(libos_la-FileJournal.o): In function `FileJournal::do_discard(long, long)':
/home/nwatkins/src/ceph/src/os/FileJournal.cc:1587: undefined reference to `block_device_discard(int, long, long)'
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
If we are in recovery_wait, we might not recover that object as part of
recover_primary for some time. Worse, if we are waiting on a backfill
which is blocked waiting on a copy_from on the missing object in
question, it can become a dead lock.
Fixes: 11244
Backport: firefly
Signed-off-by: Samuel Just <sjust@redhat.com>
This way, even empty objects have the hinfo key written. That way,
touch and touch->append->truncate end up with the same state.
Fixes: 11265
Signed-off-by: Samuel Just <sjust@redhat.com>
Add some performance critial configurations
Also group and polish the description of each configuration
to make it more clear, changed the default from 0 to actual
value.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
This will help prevent drift in the future. It also makes it clear
that the flags are supposed to have the same values.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
Just verify that the read gets the right data, to demonstrate that
passing a flag doesn't cause problems.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
We don't want to spit out the warning twice, and we don't have cct
anyway.
Also test_init is annoying; we should try to kill it.
Signed-off-by: Sage Weil <sage@redhat.com>
This reverts commit 4bd2bd6bb8.
These constants are the only way these flags are exposed through the C
interface. C users can't include librados.hpp. Ideally we would have
only one version of these (just the C ones), but the C++ ones came
first and need to stay for backwards compatibility.
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd: coalesce into single omap_setkeys for normal writes
Tested-by: Andreas Bluemle <andreas.bluemle@itxperts.de>
Reviewed-by: David Zafman <dzafman@redhat.com>
The user mtime and local_mtime are normally set in finish_ctx based on the
value of ctx->mtime; clear that to avoid this update.
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
[Sage: simplified]
Signed-off-by: Sage Weil <sage@redhat.com>
Ensure that if this was modified during
a segment, and the session is not
persisted for some other reason, we
go ahead and persist it at the end
of the segment.
Fixes: #11048
Signed-off-by: John Spray <john.spray@redhat.com>
When MDS is no longer laggy, it should process deferred messages
first, then process newly received messages.
Fix: #11258
Signed-off-by: Yan, Zheng <zyan@redhat.com>
ISA-L 2.13 brings better performance on Avoton (20%). There's no impact on Xeon
platform. The details are in the release notes.
There's a new API ec_encode_data_update() for incremental encoding
and decoding. The other highlevel API keeps the same as in 2.10
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Nightly ran and encountered a situation in which fstat following
ftruncate reported a size not equal to the truncated size.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
The CLOG_CHANNEL_DEFAULT constant was being abused for two purposes:
- the default channel to log messages to
- the name of the config option key in the key/value pair string that is
used for the default option, e.g. "default=true foo=false bar=false"
Fix this by making the config option key CLOG_CONFIG_DEFAULT_KEY and
replacing throughout, and changing CLOG_CHANNEL_DEFAULT to "cluster" (as
it should be and has been historically).
Fixes: #11177
Signed-off-by: Sage Weil <sage@redhat.com>
...and update it via wait_For_flush completions, so
that its updates are ordered with respect to the
callbacks that happen after a log event is persisted.
Fixes: #10368
Signed-off-by: John Spray <john.spray@redhat.com>
cls_rbd: fix read past end of bufferlist c_str() in debug log msg
Reviewed-by: Haomai Wang <haomaiwang@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
ceph-mon needs crushtool to be in PATH. Don't set if it is run
from ceph_vstart_wrapper, which already sets it as it needs.
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
When we want to get mdsmap, we try to get_version()
and the return value err = 0 means success.
The assert verified r == 0. r would not change in this flow.
It always meet assert and lead mon failure.
I think this verify should be:
assert(err == 0)
It will help to check return value of get_version().
If you have any questions, feel free to let me know.
Thanks!
Signed-off-by: Vicente Cheng <freeze.bilsted@gmail.com>
Fixes and improvements this brings:
* Use _exit(2) instead of exit(2) if exec(3) failed (does not call the atexit
functions, removing asock and pid files in the child process).
* Close all parent descriptors before exec(3).
* Log crushtool stderr.
* SubProcess is covered by unit tests.
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
...by calling get_down_mds_set instead of get_failed_mds_set.
Also remove a redundant if(true) around this region.
Signed-off-by: John Spray <john.spray@redhat.com>
For places where we would like to treat failed
and damaged MDS ranks the same, like detecting
when someone has dropped offline.
Signed-off-by: John Spray <john.spray@redhat.com>
The initial is_degraded() check guarantees that
the 'in' set is equal to the 'up' set. Later,
this calls get_mds_set and assigns it to a variable
called 'up'.
It's clearer to use get_up_mds_set into the variable
called up (this was confusing when debugging #11218
which was itself a result of is_degraded() ignoring
damaged ranks).
Signed-off-by: John Spray <john.spray@redhat.com>
This mismatch about whether pool IDs are signed or unsigned is
a persistent annoyance. I'm now casting the unsigned down to signed space
because apparently the OSD is using negative IDs for temporary object
namespaces.
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
If a directory is complete, we *really* want to keep the exclusive cap
so that we don't end up needing to do MDS lookup requests on every cache
miss.
Fixes: #11226
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
The pg_log.add() call already dirties the log such that the later
write_log() call will write it. There is no need to encode it separately
here and then explicitly omap_setkeys() it.
Signed-off-by: Sage Weil <sage@redhat.com>
Previously, we did not actually set it when we got a pg creation message from
the mon. It would actually get set on the first start_peering_interval after
that point. If we don't get that far, but do send a stat update to the mon, we
can end up with 11197. Instead, let's just set it and clear it upon entry into
and exit from the Primary state.
Fixes: 11197
Signed-off-by: Samuel Just <sjust@redhat.com>
filelock in LOCK_XSYN state does not allow Fs cap. so client can't
mark directory as complete when handling the readdir reply.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
A common mistake upon osd loss is to remove the osd from the crush map
before marking the osd lost. This tends to make it so that the user
can no longer mark the osd lost to satisfy all_unfound_are_queried_or_lost.
The simple solution is for all_unfound_are_queried_or_lost to ignore
the osd if it does not exist.
Fixes: #10976
Backports: firefly,giant
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Read doesn't need to be ordered. So when proxy read comes back from base
tier, it's not necessarily at the front of the in progress list.
Fixes: #11211
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>