We're about to use this at a basic level, to identify when we have
"classic" monitors in-quorum, but could also do something more
sophisticated like a set intersection on the commands.
Signed-off-by: Greg Farnum <greg@inktank.com>
If the Elector doesn't receive a set of commands from the elected leader, it
assumes the monitor is "classic" and uses the Dumpling command set as
the leader set.
Signed-off-by: Greg Farnum <greg@inktank.com>
If we're the leader, just point to our local set. Disseminating these
will let peons advertise the full command set supported by the leader.
INCOMPLETE: does not yet handle winning Electors who do not send a command set.
Signed-off-by: Greg Farnum <greg@inktank.com>
We want to be able to validate commands against both the leader and
local command sets, so make that functionality generic.
Signed-off-by: Greg Farnum <greg@inktank.com>
--show-utilization* outputs only if --show-statistics is set, which is
confusing. Instead of failing, set --show-statistics to avoid the
confusion.
Signed-off-by: Loic Dachary <loic@dachary.org>
We want to actually encode each element and keep it, rather than
writing each one at the position after the array end!
Signed-off-by: Greg Farnum <greg@inktank.com>
CrushWrapper::start_choose_profile allocates map->choose_tries with
choose_total_tries elements. When crush_choose_firstn sets a value, it
tests against map->choose_local_tries which could lead to memory
corruption if map->choose_total_tries is smaller than
map->choose_local_tries.
Another indesirable but non fatal side effect is that the output crushtool
--show-choose-tries will be truncated to choose_local_tries which is
set to a lower value than choose_total_tries by the default tuneables.
Signed-off-by: Loic Dachary <loic@dachary.org>
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Checking for fdatasync uses the same approach as the qemu configure
script. The relevant commit is d1722a27f552a22561104210e0afad4577878e53.
Here is a copy of the commit message which explains the check:
Under Darwin, a symbol exists for the fdatasync() function, so that our
link test succeeds. However _POSIX_SYNCHRONIZED_IO is set to '-1'.
According to POSIX:2008, a value of -1 means the feature is not
supported.
A value of 0 means supported at compilation time, and a value greater 0
means supported at both compilation and run time.
Enable fdatasync() only if _POSIX_SYNCHRONIZED_IO is '>0'.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Ensure that a crush file always compiled deterministically, even though
the default values for *new* maps has changed.
Signed-off-by: Sage Weil <sage@inktank.com>
These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Since detach_bucket is a private helper solely used by move_bucket which
contains another ( correct ) safeguard, the code cannot be reached and
the problem can never happen. If another function uses detach_bucket,
it may happen.
Signed-off-by: Loic Dachary <loic@dachary.org>
The following was introduced in 2012 by a2d0cff1b0
// un-set the device name so we can use add_item later
build_rmap(name_map, name_rmap);
name_map.erase(id);
name_rmap.erase(id_name);
when insert_item refused to move a bucket for which a name already
exists. It was changed in 2013 by
4e2557a038 and now supports it. The
TestCrushWrapper unittest for move_bucket pass.
Signed-off-by: Loic Dachary <loic@dachary.org>
This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.
Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
There's a race between the client and osd with a newly marked full
osdmap. If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.
If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later. -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO. Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.
To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.
Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>