Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.
Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
There's a race between the client and osd with a newly marked full
osdmap. If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.
If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later. -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO. Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.
To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.
Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.
Signed-off-by: Sage Weil <sage@inktank.com>
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.
Signed-off-by: Sage Weil <sage@inktank.com>
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:
If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.
In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.
Signed-off-by: Sage Weil <sage@inktank.com>
And reduce the depth of the hierarchy because three levels of buckets
capture the same cases as four levels.
Signed-off-by: Loic Dachary <loic@dachary.org>
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.
Signed-off-by: Loic Dachary <loic@dachary.org>
This is (as near to) a trivial ObjectStore backend for the OSD as we can
get at the moment. Everything is stored in memory. We are slightly
tricky with the locking, but not overly so.
On umount we dump everything out to disk, and on mount we load it all in
again, so we have some very coarse persistence/durability... just enough
to make this usable in a non-failure environment.
Signed-off-by: Sage Weil <sage@inktank.com>
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose
We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.
Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.
We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.
'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This commit also adds two options to the MDSMonitor:
- mon_max_mdsmap_epochs: the maximum amount of maps we'll keep (def: 500)
- mon_mds_force_trim: the version we want to trim to
This results in 'get_trim_to()' returning the possible values:
- if we have set mon_mds_force_trim, and this value is greater than the
last committed version, trim to mon_mds_force_trim
- if we hold more than the max number of maps, trim to last - max
- if we have set mon_mds_force_trim and if we hold more than the max
number of maps, and mon_mds_force_trim is lower than last - max,
then trim to last - max
Backport: dumpling
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
A year after the last modification of test to check if an item was added
twice to the same bucket, the subtree_contains test was added a few
lines above it, making it redundant.
Signed-off-by: Loic Dachary <loic@dachary.org>
A bucket name may be created as a side effect of insert_item. All names
in the loc argument are checked for validity at the beginning of the
method and an error is returned immediately if one is found. This allows
to not check for errors when setting the name of an item later on.
Signed-off-by: Loic Dachary <loic@dachary.org>
Call --mbrtogpt on journal run of sgdisk should the drive require a GPT ...
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Loic Dachary <loic@dachary.org>