If, based on historical behavior, an observed osd failure is likely to be
due to unresponsiveness and not the daemon stopping, scale the heartbeat
grace period accordingly:
grace' = grace + laggy_probabiliy * laggy_interval
This will avoid fruitlessly marking OSDs down and generating additional
map update overhead when the cluster is overloaded and potentially
struggling to keep up with map updates. See #3045.
Signed-off-by: Sage Weil <sage@inktank.com>
Currently we only trigger a failure on receipt of a failure report. Move
the checks into a helper and check during tick() too, so that we will
trigger failures even when the thresholds are not met at failure report
time. This is rarely true now, but will be true once we locally scale the
grace period.
Signed-off-by: Sage Weil <sage@inktank.com>
Track the latest report message for each reporter. When the osd is
eventually marked failed, send map updates to them all.
Signed-off-by: Sage Weil <sage@inktank.com>
Aggregate the failure reports into a single mon 'failed_since' value (the
max, currently), and wait until we have exceeded the grace period to
consider the osd failed.
WARNING: This slightly changes the semantics. Previously, the grace could
be adjusted in the [osd] section. Now, the [osd] option controls when the
failure messages are sent, and the [mon] option controls when it is marked
down, and sane users should set it once in [global].
Signed-off-by: Sage Weil <sage@inktank.com>
This is a no-op if the client was talking to us, but in the forwarded
request case will clean up the request state (and request message) on the
forwarding monitor. Otherwise, MOSDFailure messages (and probably others)
can accumulate on the non-leader mon indefinitely.
Signed-off-by: Sage Weil <sage@inktank.com>
- use structs to track allegedly failed nodes, and reports against them.
- use methods to handle report, and failure threshold logic.
- calculate failed_since based on OSD's reported failed_for duration
This will make it simpler to extend the logic when we add dynamic
grace periods.
Signed-off-by: Sage Weil <sage@inktank.com>
On each osd boot, determine whether the osd was laggy (wrongly marked down)
or newly booted. Either update the laggy probability and interval or
decay the values, as appropriate.
Signed-off-by: Sage Weil <sage@inktank.com>
Track information about laggy probabilities for each OSD. That is, the
probability that if it is marked down it is because it is laggy, and
the expected interval over which it will take to recovery if it is laggy.
We store this in the OSDMap because it is not convenient to keep it
elsewhere in the monitor. Yet. When the new mon infrastructure is in
place, there is a bunch of stuff that can be moved out of the OSDMap
'extended' section into other mon data structures.
Signed-off-by: Sage Weil <sage@inktank.com>
This blindly tries the Subdomain calling format if the ordinary method
fails. In particular, this works around buckets that present a
PermanentRedirect message.
See bug #3128.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Matthew Wodrich <matthew.wodrich@dreamhost.com>
Should fix bug #2761.
If we are already pushing soid, recovery_ops will only be decremented once for
all current pushes, so only increment recovery_ops if we are not currently
pushing it.
This bug causes us to leak a recovery op and get stuck in backfill.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reorder the snapdir logic and ctx->at_version adjustments prior to filling
in the object_info_t and user_versions and all that stuff. Adjust
at_version after appending the log entry (so that it points to the next
position/version we will write at.. culminating in the actual user
event).
The user log entry contains the request id, which will be used
by replay ops to put themselves in the correct place in the
waiting_for_commit/ack maps. Thus, the repop needs to be tagged
with the same version as the log entry with the request id.
Thus, the request id bearing log entry should be the last in
the log entry vector.
This should fix#3072, wherein a replay which should wait on
the repop tagged as version '36 will instead wait on '35.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Instead of 'osd crush set NNN osd.NNN weight loc...', make the second
osd.NNN option optional, and allow either NNN or osd.NNN to specify the
osd id. This makes the usage much more sane, but maintains backward
compatibility.
Signed-off-by: Sage Weil <sage@inktank.com>
Create an item in the tree with the given weight, or move it (without
touching the weight) if it is already present.
Closes: #3101
Signed-off-by: Sage Weil <sage@inktank.com>
Create an item if it doesn't exist, with the specified weight. If it is
already in the tree, move it, but do not adjust the weight.
Signed-off-by: Sage Weil <sage@inktank.com>