This chooses whether to use the original (supported by krbd)
or the new (supports layering) format.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
If the following sequence of events occured,
a clone could be created of an unprotected snapshot:
1. A: begin clone - check that snap foo is protected
2. B: rbd unprotect snap foo
3. B: check that all pools have no clones of foo
4. B: unprotect snap foo
5. A: finish creating clone of foo, add it as a child
To stop this from happening, check at the beginning and end of
cloning that the parent snapshot is protected. If it is not,
or checking protection status fails (possibly because the parent
snapshot was removed), remove the clone and return an error.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
These iterate over all pools and check for children of a
particular snapshot.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Default to .3. Setting to 0 effectively turns this off.
Also make OSDMap::osd_xinfo_t decode into a float to simplify the
arithmetic conversions.
Signed-off-by: Sage Weil <sage@inktank.com>
Scale the down/out interval the same way we do the heartbeat grace, so that
we give laggy osds a bit longer to recovery.
See #3047.
Signed-off-by: Sage Weil <sage@inktank.com>
Add a configurable halflife for the laggy probability and duration and
apply it at the time those values are used to adjust the heartbeat grace
period. Both are multiplied together, so it doesn't matter which you
think is being decayed (the probability or the interval).
Default to an hour.
Signed-off-by: Sage Weil <sage@inktank.com>
If, based on historical behavior, an observed osd failure is likely to be
due to unresponsiveness and not the daemon stopping, scale the heartbeat
grace period accordingly:
grace' = grace + laggy_probabiliy * laggy_interval
This will avoid fruitlessly marking OSDs down and generating additional
map update overhead when the cluster is overloaded and potentially
struggling to keep up with map updates. See #3045.
Signed-off-by: Sage Weil <sage@inktank.com>
Currently we only trigger a failure on receipt of a failure report. Move
the checks into a helper and check during tick() too, so that we will
trigger failures even when the thresholds are not met at failure report
time. This is rarely true now, but will be true once we locally scale the
grace period.
Signed-off-by: Sage Weil <sage@inktank.com>
Track the latest report message for each reporter. When the osd is
eventually marked failed, send map updates to them all.
Signed-off-by: Sage Weil <sage@inktank.com>
Aggregate the failure reports into a single mon 'failed_since' value (the
max, currently), and wait until we have exceeded the grace period to
consider the osd failed.
WARNING: This slightly changes the semantics. Previously, the grace could
be adjusted in the [osd] section. Now, the [osd] option controls when the
failure messages are sent, and the [mon] option controls when it is marked
down, and sane users should set it once in [global].
Signed-off-by: Sage Weil <sage@inktank.com>
This is a no-op if the client was talking to us, but in the forwarded
request case will clean up the request state (and request message) on the
forwarding monitor. Otherwise, MOSDFailure messages (and probably others)
can accumulate on the non-leader mon indefinitely.
Signed-off-by: Sage Weil <sage@inktank.com>
- use structs to track allegedly failed nodes, and reports against them.
- use methods to handle report, and failure threshold logic.
- calculate failed_since based on OSD's reported failed_for duration
This will make it simpler to extend the logic when we add dynamic
grace periods.
Signed-off-by: Sage Weil <sage@inktank.com>
On each osd boot, determine whether the osd was laggy (wrongly marked down)
or newly booted. Either update the laggy probability and interval or
decay the values, as appropriate.
Signed-off-by: Sage Weil <sage@inktank.com>
Track information about laggy probabilities for each OSD. That is, the
probability that if it is marked down it is because it is laggy, and
the expected interval over which it will take to recovery if it is laggy.
We store this in the OSDMap because it is not convenient to keep it
elsewhere in the monitor. Yet. When the new mon infrastructure is in
place, there is a bunch of stuff that can be moved out of the OSDMap
'extended' section into other mon data structures.
Signed-off-by: Sage Weil <sage@inktank.com>
Allow thread pool sizes to be adjusted on the fly by telling the
ThreadPool which config option to monitor. Add some basic unit tests
for resizing.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
radosgw-admin bucket check [--fix] --bucket=<bucket>
The command will dump the existing bucket header stats,
and the calculated bucket header stats. If --fix is provided
the bucket stats will be overwritten by the recalculated
stats.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
We now pass the object version returned by obj_stat. We use that
epoch for setting the object version through the index suggestion
mechanism. This was broken by a recent change that switched from
reading the obj stats by (wrongly) calling directly to ioctx->stat()
to calling get_obj_state().
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
An update shouldn't be skipped if epoch is zero. We'd see a zero
epoch if we tried to read an object and it didn't exist. That
could happen e.g., when a delete object operation failed to
call the complete earlier, and now we're recalling delete on
the (now non-existent object).
However, note that the zero epoch is racy. We may end up racing
with an object creation. This will be taken care of by a new
rados change that will set the returned object version even if
it didn't exist.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
We weren't setting the 'exists' flag on the bucket entry,
so we ended up not updating the index correctly.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Fixes: #3127
Bad variable scoping made it so that specific variables
weren't initialized between suggested changes iterations.
This specifically affected a case where in a specific
change we had an updated followed by a remove, and the
remove was on a non-existent key (e.g., was already
removed earlier). We ended up re-substracting the
object stats, as the entry wasn't reset between
the iterations (and we didn't read it because the
key didn't exist).
backport:argonaut
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>