There are times when users may need to make sure the client has the
latest osdmap, for example after sending a mon command modifying
pool properties.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
squash "librados: add wait_for_latest_osdmap()"
The hashing is dependent on pool properties; capture (more of) it in a
method instead of having it in OSDMap.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
The hash value, if provided, becomes the ps (placement seed) portion of the
pg_t, skipping any hashing of the object name (or locator key).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Instead of hashing the object name or key, we allow the hash position to be
provided explicitly.
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
This way we can set the compatv preferentially depending on whether
we've actually encoded new information or not.
Signed-off-by: Greg Farnum <greg@inktank.com>
Return to caller at the end of each PG. This allows the caller to look at
the [pg_]hash_position and get something meaningful.
If there are no objects in the PG, we skip it so that every callback has
*some* data (unless the pool is totally empty!). So the real difference
here is that we don't move on to the next PG just to reach max_entries.
This gives the client some data sooner, but may mean more callbacks into
client code.
Signed-off-by: Sage Weil <sage@inktank.com>
The pgid field is used to store the pg the op mapped to. We were just
setting it directly for PGLS. Instead, fill in a new base_pgid, and copy that
to pgid in recalc_op_target(), the same way we do when we map an object
name to a PG.
In particular, we take this opportunity to map a raw pgid to an actual
pgid. This means the base_pg could come from a raw hash value (although
it doesn't, yet).
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
We don't use preferred placements any more, so this will
make it easier to start dropping references to it in new code.
Signed-off-by: Sage Weil <sage@inktank.com>
This is allows rbd-bench to detect http://tracker.ceph.com/issues/6938
when combined with rapidly changing the mon osd full ratio.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Now that the osd does not respond if it gets a map with the full flag
set first, clients need to resend all writes.
Clients talking to old osds are still subject to the race condition,
so both sides must be upgraded to avoid it.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
There's a race between the client and osd with a newly marked full
osdmap. If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.
If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later. -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO. Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.
To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.
Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.
Refs: #6938
Backport: dumpling, emperor
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.
Signed-off-by: Sage Weil <sage@inktank.com>
Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.
Signed-off-by: Sage Weil <sage@inktank.com>
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:
If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.
In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.
Signed-off-by: Sage Weil <sage@inktank.com>
And reduce the depth of the hierarchy because three levels of buckets
capture the same cases as four levels.
Signed-off-by: Loic Dachary <loic@dachary.org>
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.
Signed-off-by: Loic Dachary <loic@dachary.org>
This is (as near to) a trivial ObjectStore backend for the OSD as we can
get at the moment. Everything is stored in memory. We are slightly
tricky with the locking, but not overly so.
On umount we dump everything out to disk, and on mount we load it all in
again, so we have some very coarse persistence/durability... just enough
to make this usable in a non-failure environment.
Signed-off-by: Sage Weil <sage@inktank.com>
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose
We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.
Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.
We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.
'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>