Instead of automatically marking unfound objects lost (once we've tried
every location we can think of), do it when the administator explicitly
says to. This avoids marking things wrong incorrectly when there are
peering issues, and also allows the administrator to decide whether there
may be offline osds that are worth bringing online.
Signed-off-by: Sage Weil <sage@newdream.net>
We may not want to do this automatically until we have more confidense in
the recovery code. Even then, possible not. In particular, the OSDs may
believe they have contact all possible homes for the data even though there
is some long-lost OSD that has the data on disk that if offline.
For now, we make the marking process explicit so that the administrator can
make the call.
Signed-off-by: Sage Weil <sage@newdream.net>
A few things:
- track Connection* instead of entity_inst_t for hb peers
- we can only send maps over the cluster_messenger
- if peer is still alive, do that
- if peer is not, send dying MOSDPing ping with YOU_DIED flag
If we forget the peer epoch when we see them go down, we won't share the
map later in update_heartbeat_peers() to tell them they're down.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We try to keep track of which epochs our peers have so that we can be
semi-intelligent about which map incrementals we send preceeding any
messages. Since this is useful from the heartbeat and cluster channels/
threads, protect the data with an inner lock and clean up the callers.
Be smarter about when we forget.
Make note of peer epoch when we receive a ping.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Clean up the code to mirror the _to case.
Previously we would not mark down an old _from that is still a _to but with
a new address. Now we do.
Share a map while we're at it, just to be nice!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
If a peer remains a _to target but their address changes, we still want
to mark down the old connection.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This could conceivably screw up ordering, and priority doesn't matter
anyway when this is the first message we send to this peer.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Consider peer P.
- P does down in, say, epoch 60, and back up in epoch 70
- P and requests a heartbeat, as_of 70
- We update to map 50, and coincidentally add the same peer as a target
- We set the heartbeat_to[P] = 50 and start sending to the _old_ address
- P marks us down because we stop sending to the new addr
- We eventually get map 70, but it's too late!
Make sure we preserve any _to targets _and_ their epoch+inst.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We can't blinding ask for everything since last_epoch_started because that
may mean we get some fragment of a backlog. Look at the peer's log
ranges and request the correct thing. Also, in fulfill_log, infer what
the primary should have asked for if they make a bad request.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
We weren't accounting for the case where we have
(foo,foo]+backlog
i.e., everything is backlog, and rbegin().version != log.head.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
If the peer log is empty, and we break out of the loop on the first pass,
then clearly last_update has not been adjusted.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
I'm hitting a case where the primary is compensating for a replica's
last_complete < log.tail by sending a log+backlog, but the replica
isn't smart enough to take advantage. In this case,
replica: log(781'26629,781'26631]
from primary: log(781'26629,781'26631]+backlog
result: log(781'26629,781'26631]
Doh!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
If the peer has a last_complete below their tail, we can get by with our
log (without backlog) if our tail if _before_ their last_complete, not
after. Otherwise, we need a backlog!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This fixes a bug where we were excluding up (but not acting) nodes from
past intervals, which in turn was triggering a nasty choose_acting loop
(because we _do_ already include acting but !up from the current
interval).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Check them before entering the state machine so we can
safely enter the Crashed state on unexpected messages
from the current interval.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Without preferring an OSD with a backlog, PGs would get stuck in the
active state when acting != up and the backlog was on an OSD with the
same last_update but a lower number or log_tail.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
We already hold the lock from a few frames up the stack (ms_dispatch).
Reported-by: Simon Tian <aixt2006@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
newest_update osd should be stable when the primary changes, to
prevent cycles of acting set choices. For the same reason, we should
not treat the primary as a special case in choose_acting.
Also remove the magic -1 from representing the current primary.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
If we request a log from one osd, and then another member of our prior
set comes up with a later last_update, we should not fail when we
receive the first log.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>