Merge PR #23965 into master

* refs/pull/23965/head:
	doc/dev/msgr2: better formatting
	doc/dev/msgr2: clarify padding alignment
	doc/dev/msgr2: tweak message flow handshake
	doc/dev/msgr2: remove stream concept, streamline auth

Reviewed-by: Ricardo Dias <rdias@suse.com>
This commit is contained in:
Sage Weil 2018-09-10 09:23:32 -05:00
commit 3a34c9ee38

View File

@ -10,19 +10,15 @@ Goals
This protocol revision has several goals relative to the original protocol: This protocol revision has several goals relative to the original protocol:
* *Multiplexing*. We will have multiple server entities (e.g.,
multiple OSDs and clients) coexisting in the same process. We would
like to share the transport connection (e.g., TCP socket) whenever
possible.
* *Signing*. We will allow for traffic to be signed (but not
necessarily encrypted).
* *Encryption*. We will incorporate encryption over the wire.
* *Flexible handshaking*. The original protocol did not have a * *Flexible handshaking*. The original protocol did not have a
sufficiently flexible protocol negotiation that allows for features sufficiently flexible protocol negotiation that allows for features
that were not required. that were not required.
* *Encryption*. We will incorporate encryption over the wire.
* *Performance*. We would like to provide for protocol features * *Performance*. We would like to provide for protocol features
(e.g., padding) that keep computation and memory copies out of the (e.g., padding) that keep computation and memory copies out of the
fast path where possible. fast path where possible.
* *Signing*. We will allow for traffic to be signed (but not
necessarily encrypted). This may not be implemented in the initial version.
Definitions Definitions
----------- -----------
@ -33,36 +29,25 @@ Definitions
* *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity * *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
has one or more unique entity_addr_t's by virtue of the 'nonce' has one or more unique entity_addr_t's by virtue of the 'nonce'
field, which is typically a pid or random value. field, which is typically a pid or random value.
* *stream*: an exchange, passed over a connection, between two unique
entities. in the future multiple entities may coexist within the
same process.
* *session*: a stateful session between two entities in which message * *session*: a stateful session between two entities in which message
exchange is ordered and lossless. A session might span multiple exchange is ordered and lossless. A session might span multiple
connections (and streams) if there is an interruption (TCP connection connections if there is an interruption (TCP connection disconnect).
disconnect).
* *frame*: a discrete message sent between the peers. Each frame * *frame*: a discrete message sent between the peers. Each frame
consists of a tag (type code), stream id, payload, and (if signing consists of a tag (type code), payload, and (if signing
or encryption is enabled) some other fields. See below for the or encryption is enabled) some other fields. See below for the
structure. structure.
* *stream id*: a 32-bit value that uniquely identifies a stream within * *tag*: a type code associated with a frame. The tag
a given connection. the stream id implicitly instantiated when the send
sends a frame using that id.
* *tag*: a single-byte type code associated with a frame. The tag
determines the structure of the payload. determines the structure of the payload.
Phases Phases
------ ------
A connection has two distinct phases: A connection has four distinct phases:
#. banner #. banner
#. frame exchange for one or more strams #. authentication frame exchange
#. message flow handshake frame exchange
A stream has three distinct phases: #. message frame exchange
#. authentication
#. message flow handshake
#. message exchange
Banner Banner
------ ------
@ -89,81 +74,60 @@ can disconnect.
|<-----------+ | |<-----------+ |
| | | |
Frame format and Stream establishment Frame format
------------------------------------- ------------
All further data sent or received is contained by a frame. Each frame has All further data sent or received is contained by a frame. Each frame has
the form:: the form::
stream_id (le32)
frame_len (le32) frame_len (le32)
tag (TAG_* byte) tag (TAG_* le32)
payload payload
[payload padding -- only present after stream auth phase] [payload padding -- only present after stream auth phase]
[signature -- only present after stream auth phase] [signature -- only present after stream auth phase]
* stream_id is generated by the client.
* frame_len includes everything after the frame_len le32 up to the end of the * frame_len includes everything after the frame_len le32 up to the end of the
frame (all payloads, signatures, and padding). frame (all payloads, signatures, and padding).
* The payload format and length is determined by the tag. * The payload format and length is determined by the tag.
* The signature portion is only present in a given stream if the * The signature portion is only present if the authentication phase
authentication phase has completed (TAG_AUTH_DONE has been sent) and has completed (TAG_AUTH_DONE has been sent) and signatures are
signatures are enabled. enabled.
A new stream is created when the client sends a frame with the following tag
message:
* TAG_NEW_STREAM (client only): starts a new stream::
__u8 my_type (CEPH_ENTITY_TYPE_*)
.. ditaa:: +---------+ +--------+
| Client | | Server |
+---------+ +--------+
| send new stream |
|------------------>|
| |
Authentication Authentication
-------------- --------------
* TAG_AUTH_SET_METHOD (client only): set auth method for this connection:: * TAG_AUTH_REQUEST: client->server::
__le32 method; __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
__le32 len;
- The selected auth method determines the sig_size and block_size in any method specific payload
subsequent messages (TAG_AUTH_DONE and non-auth messages).
* TAG_AUTH_BAD_METHOD (server only): reject client-selected auth method:: * TAG_AUTH_BAD_METHOD (server only): reject client-selected auth method::
__le32 method __le32 method
__le32 num_methods __le32 num_methods
__le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX} __le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX, ...}
- Returns the unsupported/forbidden method along with the list of allowed - Returns the unsupported/forbidden method along with the list of allowed
authentication methods. authentication methods.
* TAG_AUTH_REQUEST: client->server:: * TAG_AUTH_BAD_AUTH: server->client::
__le32 error code (e.g., EPERM, EACCESS)
__le32 len; __le32 len;
method specific payload error string;
* TAG_AUTH_REPLY: server->client::
__le32 len;
method specific payload
* TAG_AUTH_BAD_AUTH: server->client:
- Sent when the authentication fails - Sent when the authentication fails
* TAG_AUTH_MORE: server->client or client->server::
* TAG_AUTH_DONE:: __le32 len;
method specific payload
* TAG_AUTH_DONE: (server->client)::
confounder (block_size bytes of random garbage) confounder (block_size bytes of random garbage)
__le64 flags __le64 flags
@ -171,8 +135,7 @@ Authentication
FLAG_SIGNED 2 FLAG_SIGNED 2
signature signature
- The client first says AUTH_DONE, and the server replies to - The server is the one to decide authentication has completed.
acknowledge it.
Example of authentication phase interaction when the client uses an Example of authentication phase interaction when the client uses an
@ -181,17 +144,15 @@ allowed authentication method:
.. ditaa:: +---------+ +--------+ .. ditaa:: +---------+ +--------+
| Client | | Server | | Client | | Server |
+---------+ +--------+ +---------+ +--------+
| set method |
|---------------->|
| auth request | | auth request |
|---------------->| |---------------->|
|<----------------| |<----------------|
| auth reply| | auth more|
| | | |
| auth done | |auth more |
|---------------->| |---------------->|
|<----------------| |<----------------|
| auth done ack | | auth done|
Example of authentication phase interaction when the client uses a forbidden Example of authentication phase interaction when the client uses a forbidden
@ -200,45 +161,42 @@ authentication method as the first attempt:
.. ditaa:: +---------+ +--------+ .. ditaa:: +---------+ +--------+
| Client | | Server | | Client | | Server |
+---------+ +--------+ +---------+ +--------+
| set method |
|---------------->|
| +---|
| auth request| |
|-------------+-->|
| | |
|<------------+ |
| bad method |
| |
| set method |
|---------------->|
| auth request | | auth request |
|---------------->| |---------------->|
|<----------------| |<----------------|
| auth reply| | bad method |
| | | |
| auth done | | auth request |
|---------------->| |---------------->|
|<----------------| |<----------------|
| auth done ack | | auth more|
| |
| auth more |
|---------------->|
|<----------------|
| auth done|
Message frame format Post-auth frame format
-------------------- ----------------------
The frame format is fixed (see above), but can take three different The frame format is fixed (see above), but can take three different
forms, depending on the AUTH_DONE flags: forms, depending on the AUTH_DONE flags:
* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple:: * If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
stream_id
frame_len frame_len
tag tag
payload payload
payload_padding (out to auth block_size) payload_padding (out to auth block_size)
- The padding is some number of bytes < the auth block_size that
brings the total length of the payload + payload_padding to a
multiple of block_size. It does not include the frame_len or tag. Padding
content can be zeros or (better) random bytes.
* If FLAG_SIGNED has been specified:: * If FLAG_SIGNED has been specified::
stream_id
frame_len frame_len
tag tag
payload payload
@ -252,10 +210,9 @@ forms, depending on the AUTH_DONE flags:
* If FLAG_ENCRYPTED has been specified:: * If FLAG_ENCRYPTED has been specified::
stream_id
frame_len frame_len
tag
{ {
payload_sig_length
payload payload
payload_padding (out to auth block_size) payload_padding (out to auth block_size)
} ^ stream cipher } ^ stream cipher
@ -275,21 +232,31 @@ an established session.
entity_addrvec_t addr(s) entity_addrvec_t addr(s)
__u8 my type (CEPH_ENTITY_TYPE_*) __u8 my type (CEPH_ENTITY_TYPE_*)
__le32 protocol version __le64 gid (numeric part of osd.0, client.123456, ...)
__le64 features supported (CEPH_FEATURE_* bitmask) __le64 features supported (CEPH_FEATURE_* bitmask)
__le64 features required (CEPH_FEATURE_* bitmask) __le64 features required (CEPH_FEATURE_* bitmask)
__le64 flags (CEPH_MSG_CONNECT_* bitmask) __le64 flags (CEPH_MSG_CONNECT_* bitmask)
__le64 cookie (a client identifier, assigned by the sender. unique on the sender.) __le64 cookie (a client identifier, assigned by the sender. unique on the sender.)
- client will send first, server will reply with same. - client will send first, server will reply with same. if this is a
new session, the client and server can proceed to the message exchange.
- type.gid (entity_name_t) is set here. this means we don't need it
in the header of every message. it also means that we can't send
messages "from" other entity_name_t's. the current
implementations set this at the top of _send_message etc so this
shouldn't break any existing functionality. implementation will
likely want to mask this against what the authenticated credential
allows.
- we've dropped the 'protocol_version' field from msgr1
- for lossy sessions, cookie is meaningless. for lossless sessions,
we assign a local value that identifies the local Connection
state. when we receive this from a peer, we make a note of their
cookie, so that on reconnect we can reattach (see below).
* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT with too few features:: * TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT
with too few features::
__le64 features we require that peer didn't advertise __le64 features we require that the peer didn't advertise
* TAG_IDENT_BAD_PROTOCOL (server only): complain about an old protocol version::
__le32 protocol_version (our protocol version)
* TAG_RECONNECT (client only): reconnect to an established session:: * TAG_RECONNECT (client only): reconnect to an established session::
@ -302,6 +269,9 @@ an established session.
__le64 msg_seq (last msg seq received) __le64 msg_seq (last msg seq received)
- once the client receives this, the client can proceed to message exchange.
- once the server sends this, the server can proceed to message exchange.
* TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq * TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
* TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq * TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
@ -315,17 +285,24 @@ an established session.
Message exchange Message exchange
---------------- ----------------
Once a session is stablished, we can exchange messages. Once a session is established, we can exchange messages.
* TAG_MSG: a message:: * TAG_MSG: a message::
ceph_msg_header2 ceph_msg_header2
front front
middle middle
data_pre_padding
data data
- The ceph_msg_header is modified in ceph_msg_header2 to include an - The ceph_msg_header2 is modified from ceph_msg_header:
ack_seq. This avoids the need for a TAG_ACK message most of the time. * include an ack_seq. This avoids the need for a TAG_ACK
message most of the time.
* remove the src field, which we now get from the message flow
handshake (TAG_IDENT).
* specifies the data_pre_padding length, which can be used to
adjust the alignment of the data payload. (NOTE: is this is
useful?)
* TAG_ACK: acknowledge receipt of message(s):: * TAG_ACK: acknowledge receipt of message(s)::
@ -345,14 +322,12 @@ Once a session is stablished, we can exchange messages.
- Time stamp is from the TAG_KEEPALIVE2 we are responding to. - Time stamp is from the TAG_KEEPALIVE2 we are responding to.
* TAG_CLOSE: terminate a stream * TAG_CLOSE: terminate a connection
Indicates that a stream should be terminated. This is equivalent to Indicates that a connection should be terminated. This is equivalent
a hangup or reset (i.e., should trigger ms_handle_reset). It isn't to a hangup or reset (i.e., should trigger ms_handle_reset). It
strictly necessary or useful if there is only a single stream as we isn't strictly necessary or useful as we could just disconnect the
could just disconnect the TCP connection, although one could TCP connection.
certainly use it creatively (e.g., reset the stream state and retry
an authentication handshake).
Example of protocol interaction (WIP) Example of protocol interaction (WIP)
@ -371,26 +346,20 @@ _____________________________________
| | | |
| send new stream | | send new stream |
|------------------>| |------------------>|
| set method |
|------------------>|
| +-----|
| auth request| |
|-------------+---->|
| | |
|<------------+ |
| bad method |
| |
| set method |
|------------------>|
| auth request | | auth request |
|------------------>| |------------------>|
|<------------------| |<------------------|
| auth reply | | bad method |
| | | |
| auth done | | auth request |
|------------------>| |------------------>|
|<------------------| |<------------------|
| auth done ack | | auth more |
| |
| auth more |
|------------------>|
|<------------------|
| auth done |
| | | |