We need to know whether the client is lossy before we connect to the peer
in order to know whether to deliver a RESET event or not on connection
failure. Lossy clients get one, lossless do not.
And in any case, we know ahead of time, so we may as well indicate as much
in the Policy.
Signed-off-by: Sage Weil <sage@inktank.com>
The IncomingQueue can live beyond the Pipe. In particular, there is no
reason not to deliver messages we've received on this connection even
though the socket has errored out.
Separate incoming queue discard from outgoing, and only do the latter in
the reaper.
Signed-off-by: Sage Weil <sage@inktank.com>
Use this pointer only for debug output prefix; do not dereference, as we
may live beyond the original parent.
Signed-off-by: Sage Weil <sage@inktank.com>
We change a couple of key things here:
* If there is a matching connect_seq and the existing connection is in OPEN (or
STANDBY; same thing + a failure), we send a RETRY_SESSION and ask the peer to
bump their connect_seq. This handles the case where there was a race, our
end successfully opened, but the peer's racing attempt was slowly processed.
* We always reply with connect_seq + 1. This handles the above case
more cleanly, and lets us use the same code path.
Also avoid duplicating the RETRY_SESSION path with a goto. Beautiful!
Signed-off-by: Sage Weil <sage@inktank.com>
We solve two problems with this patch. The first is that the messenger
will now reuse an existing session's Connection with a new connection,
which means that we don't want to change session->connection when we
are validating an authorizer. Instead, set (but do not change) it.
We also want to avoid a race where:
- mds recovers, replays Sessions with no con's
- multiple connection attempts for the same session race in the msgr
- both are authorized, but out of order
- Session->connection gets set to the losing attempt's Connection*
Instead, we take advantage of an accept event that is called only for
accepted winners.
Signed-off-by: Sage Weil <sage@inktank.com>
fixup
Create a new event type when we successfully accept a connection. This is
distinct from the authorizor verification, which may happen for multiple
racing connection attempts. In contrast, this will only happen on those
that win the race(s). I don't think this is that important for stateless
servers (OSD, MON), but it is important for the MDS to ensure that it keeps
its Session con reference pointing to the most recently-successful
connection attempt.
Signed-off-by: Sage Weil <sage@inktank.com>
This used to be necessary because the pipe_lock was used when queueing
the pipe in the dispatch queue. Now that is handled by IncomingQueue's
own lock, so these can be removed.
By no longer dropping the lock, we eliminate a whole category of potential
hard-to-debug races. (Not that any were observed, but now we dno't need to
worry about them.)
Signed-off-by: Sage Weil <sage@inktank.com>
The DispatchQueue class now completely owns message delivery. This is
cleaner and lets us drop the redundant destination_stopped flag from
msgr (DQ has its own stop flag).
Signed-off-by: Sage Weil <sage@inktank.com>
Looking through git history it is not clear exactly how these checks
came to be. They seem to have grown during the multiple-entity-per-rank
transition a few years back. I'm not fully convinced they are necessary,
but we will keep them regardless.
Push checks into DispatchQueue and look at the local stop flag to
determine whether these events should be queued. This moves us away from
the kludgey SimpleMessenger::destination_stopped flag (which will soon
be removed).
Also move the refcount futzing into the DispatchQueue methods. This makes
the callers much simpler.
Signed-off-by: Sage Weil <sage@inktank.com>
We don't need to worry about racing with shutdown here; the cleanup
procedure will stop the accepter thread before cleaning up all the
pipes.
Signed-off-by: Sage Weil <sage@inktank.com>
* create it via DispatchQueue
* keep pointer to parent DispatchQueue
* drop now-useless contextual arguments to most methods
Signed-off-by: Sage Weil <sage@inktank.com>
Move to an explicit Connection for messages sent to ourselves, instead of
using the one on the local_pipe (which we'll remove shortly).
Signed-off-by: Sage Weil <sage@inktank.com>
If a new pipe/socket is taking over an existing session, it should also
take over the Connection* associated with the existing session. Because
we cannot clear existing->connection_state, we just take another reference.
Clean up the comments a bit while we're here.
This affects MDS<->client sessions when reconnecting after a socket fault.
It probably also affects intra-cluster (osd/osd, mds/mds, mon/mon)
sessions as well, but I did not confirm that.
Backport: argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
The queue may have been previously stopped (by discard_queue()), and needs
to be restarted.
Fixes consistent failures from the mon_recovery.py integration tests.
Signed-off-by: Sage Weil <sage@inktank.com>
If the connect_seq matches, but our existing connection is in STANDBY, take
the incoming one. Otherwise, the other end will wait indefinitely for us
to connect but we won't.
Alternatively, we could "win" the race and trigger a connection by sending
a keepalive (or similar), but that is more work; we may as well accept the
incoming connection we have now.
This removes STANDBY from the acceptable WAIT case states. It also keeps
responsibility squarely on the shoulders of the peer with something to
deliver.
Without this patch, a 3-osd vstart cluster with
'ms inject socket failures = 100' and rados bench write -b 4096 would start
generating slow request warnings after a few minutes due to the osds
failing to connect to each other. With the patch, I complete a 10 minute
run without problems.
Signed-off-by: Sage Weil <sage@inktank.com>
If we replace an existing pipe with a new one, move the incoming queue
of messages that have not yet been dispatched over to the new Pipe so that
they are not lost. This prevents messages from being lost.
Alternatively, we could set in_seq = existing->in_seq - existing->in_qlen,
but that would make the other end resend those messages, which is a waste
of bandwidth.
Very easy to reproduce the original bug with 'ms inject socket failures'.
Signed-off-by: Sage Weil <sage@inktank.com>
This extricates the incoming queue and its funky relationship with
DispatchQueue from Pipe and moves it into IncomingQueue. There is now a
single IncomingQueue attached to each Pipe. DispatchQueue is now no
longer tied to Pipe.
This modularizes the code a bit better (tho that is still a work in
progress) and (more importantly) will make it possible to move the
incoming messages from one pipe to another in accept().
Signed-off-by: Sage Weil <sage@inktank.com>
A while ago we inadvertantly broke ms_handle_connect() callbacks because
of a check for m being non-zero in the dispatch_entry() thread. Adjust the
enums so that they get delivered again.
This fixes hangs when, for example, the ceph tool sends a command, gets a
connection reset, and doesn't get the connect callback to resend after
reconnecting to a new monitor.
Signed-off-by: Sage Weil <sage@inktank.com>
We may replace an existing pipe in the STANDBY state if the previous
attempt failed during accept() (see previous patches).
This might fix#1378.
Signed-off-by: Sage Weil <sage@inktank.com>
If we have a con with a closed pipe, drop the message. For lossless
sessions, the state will be STANDBY if we should reconnect. For lossy
sessions, we will end up with CLOSED and we *should* drop the message.
Signed-off-by: Sage Weil <sage@inktank.com>
If we replace an existing pipe during accept() and then fail, move to
STANDBY so that our connection state (connect_seq, etc.) is preserved.
Otherwise, we will throw out that information and falsely trigger a
RESETSESSION on the next connection attempt.
Signed-off-by: Sage Weil <sage@inktank.com>
This could cause us to incorrectly encode new features into the monstore
that an old mon won't understand.
This is overly conservative; we probably need to persist the set of quorum
features that are supported and use those.
Signed-off-by: Sage Weil <sage@inktank.com>
We marked a request as complete in the callback, however
it might be that we're still inside S3_runall_request_context()
which means that request is not really complete yet.
Possibly fixes bug #2652.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
In the following sequence:
1) create (a, 1)
2) setattr (a, 1)
3) link (a, 1), (b, 1)
4) remove (a, 1)
If we play 1-4 and then replay 1-4 again, we will end up removing
(b, 1)'s attributes since nlink for (a, 1) the second time through
is 1. We fix this by marking spos on the object_map header for
(a, 1) when we remove (a, 1) but not eh attributes.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Also adjust the recommends and depends, so that libcephfs1 and ceph-fuse
hang off of ceph-mds instead of ceph.
Signed-off-by: Sage Weil <sage@inktank.com>
- keyrings have new default locations that everyone should use.
- the user key setup is vastly simplified if you use the
'ceph auth get-or-create' command.
Signed-off-by: Sage Weil <sage@inktank.com>