diff --git a/doc/dev/msgr2.rst b/doc/dev/msgr2.rst index bedb4e0fed5..83ec7c0b409 100644 --- a/doc/dev/msgr2.rst +++ b/doc/dev/msgr2.rst @@ -10,19 +10,15 @@ Goals This protocol revision has several goals relative to the original protocol: -* *Multiplexing*. We will have multiple server entities (e.g., - multiple OSDs and clients) coexisting in the same process. We would - like to share the transport connection (e.g., TCP socket) whenever - possible. -* *Signing*. We will allow for traffic to be signed (but not - necessarily encrypted). -* *Encryption*. We will incorporate encryption over the wire. * *Flexible handshaking*. The original protocol did not have a sufficiently flexible protocol negotiation that allows for features that were not required. +* *Encryption*. We will incorporate encryption over the wire. * *Performance*. We would like to provide for protocol features (e.g., padding) that keep computation and memory copies out of the fast path where possible. +* *Signing*. We will allow for traffic to be signed (but not + necessarily encrypted). This may not be implemented in the initial version. Definitions ----------- @@ -33,36 +29,25 @@ Definitions * *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity has one or more unique entity_addr_t's by virtue of the 'nonce' field, which is typically a pid or random value. -* *stream*: an exchange, passed over a connection, between two unique - entities. in the future multiple entities may coexist within the - same process. * *session*: a stateful session between two entities in which message exchange is ordered and lossless. A session might span multiple - connections (and streams) if there is an interruption (TCP connection - disconnect). + connections if there is an interruption (TCP connection disconnect). * *frame*: a discrete message sent between the peers. Each frame - consists of a tag (type code), stream id, payload, and (if signing + consists of a tag (type code), payload, and (if signing or encryption is enabled) some other fields. See below for the structure. -* *stream id*: a 32-bit value that uniquely identifies a stream within - a given connection. the stream id implicitly instantiated when the send - sends a frame using that id. -* *tag*: a single-byte type code associated with a frame. The tag +* *tag*: a type code associated with a frame. The tag determines the structure of the payload. Phases ------ -A connection has two distinct phases: +A connection has four distinct phases: #. banner -#. frame exchange for one or more strams - -A stream has three distinct phases: - -#. authentication -#. message flow handshake -#. message exchange +#. authentication frame exchange +#. message flow handshake frame exchange +#. message frame exchange Banner ------ @@ -89,81 +74,60 @@ can disconnect. |<-----------+ | | | -Frame format and Stream establishment -------------------------------------- +Frame format +------------ All further data sent or received is contained by a frame. Each frame has the form:: - stream_id (le32) frame_len (le32) - tag (TAG_* byte) + tag (TAG_* le32) payload [payload padding -- only present after stream auth phase] [signature -- only present after stream auth phase] -* stream_id is generated by the client. - * frame_len includes everything after the frame_len le32 up to the end of the frame (all payloads, signatures, and padding). * The payload format and length is determined by the tag. -* The signature portion is only present in a given stream if the - authentication phase has completed (TAG_AUTH_DONE has been sent) and - signatures are enabled. - -A new stream is created when the client sends a frame with the following tag -message: - -* TAG_NEW_STREAM (client only): starts a new stream:: - - __u8 my_type (CEPH_ENTITY_TYPE_*) - - -.. ditaa:: +---------+ +--------+ - | Client | | Server | - +---------+ +--------+ - | send new stream | - |------------------>| - | | +* The signature portion is only present if the authentication phase + has completed (TAG_AUTH_DONE has been sent) and signatures are + enabled. Authentication -------------- -* TAG_AUTH_SET_METHOD (client only): set auth method for this connection:: +* TAG_AUTH_REQUEST: client->server:: - __le32 method; - - - The selected auth method determines the sig_size and block_size in any - subsequent messages (TAG_AUTH_DONE and non-auth messages). + __le32 method; // CEPH_AUTH_{NONE, CEPHX, ...} + __le32 len; + method specific payload * TAG_AUTH_BAD_METHOD (server only): reject client-selected auth method:: __le32 method __le32 num_methods - __le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX} + __le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX, ...} - Returns the unsupported/forbidden method along with the list of allowed authentication methods. -* TAG_AUTH_REQUEST: client->server:: +* TAG_AUTH_BAD_AUTH: server->client:: + __le32 error code (e.g., EPERM, EACCESS) __le32 len; - method specific payload - -* TAG_AUTH_REPLY: server->client:: - - __le32 len; - method specific payload - -* TAG_AUTH_BAD_AUTH: server->client: + error string; - Sent when the authentication fails +* TAG_AUTH_MORE: server->client or client->server:: -* TAG_AUTH_DONE:: + __le32 len; + method specific payload + +* TAG_AUTH_DONE: (server->client):: confounder (block_size bytes of random garbage) __le64 flags @@ -171,8 +135,7 @@ Authentication FLAG_SIGNED 2 signature - - The client first says AUTH_DONE, and the server replies to - acknowledge it. + - The server is the one to decide authentication has completed. Example of authentication phase interaction when the client uses an @@ -181,17 +144,15 @@ allowed authentication method: .. ditaa:: +---------+ +--------+ | Client | | Server | +---------+ +--------+ - | set method | - |---------------->| | auth request | |---------------->| |<----------------| - | auth reply| + | auth more| | | - | auth done | + |auth more | |---------------->| |<----------------| - | auth done ack | + | auth done| Example of authentication phase interaction when the client uses a forbidden @@ -200,45 +161,42 @@ authentication method as the first attempt: .. ditaa:: +---------+ +--------+ | Client | | Server | +---------+ +--------+ - | set method | - |---------------->| - | +---| - | auth request| | - |-------------+-->| - | | | - |<------------+ | - | bad method | - | | - | set method | - |---------------->| | auth request | |---------------->| |<----------------| - | auth reply| + | bad method | | | - | auth done | + | auth request | |---------------->| |<----------------| - | auth done ack | + | auth more| + | | + | auth more | + |---------------->| + |<----------------| + | auth done| -Message frame format --------------------- +Post-auth frame format +---------------------- The frame format is fixed (see above), but can take three different forms, depending on the AUTH_DONE flags: * If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple:: - stream_id frame_len tag payload payload_padding (out to auth block_size) + - The padding is some number of bytes < the auth block_size that + brings the total length of the payload + payload_padding to a + multiple of block_size. It does not include the frame_len or tag. Padding + content can be zeros or (better) random bytes. + * If FLAG_SIGNED has been specified:: - stream_id frame_len tag payload @@ -252,10 +210,9 @@ forms, depending on the AUTH_DONE flags: * If FLAG_ENCRYPTED has been specified:: - stream_id frame_len + tag { - payload_sig_length payload payload_padding (out to auth block_size) } ^ stream cipher @@ -275,21 +232,31 @@ an established session. entity_addrvec_t addr(s) __u8 my type (CEPH_ENTITY_TYPE_*) - __le32 protocol version + __le64 gid (numeric part of osd.0, client.123456, ...) __le64 features supported (CEPH_FEATURE_* bitmask) __le64 features required (CEPH_FEATURE_* bitmask) __le64 flags (CEPH_MSG_CONNECT_* bitmask) __le64 cookie (a client identifier, assigned by the sender. unique on the sender.) - - client will send first, server will reply with same. + - client will send first, server will reply with same. if this is a + new session, the client and server can proceed to the message exchange. + - type.gid (entity_name_t) is set here. this means we don't need it + in the header of every message. it also means that we can't send + messages "from" other entity_name_t's. the current + implementations set this at the top of _send_message etc so this + shouldn't break any existing functionality. implementation will + likely want to mask this against what the authenticated credential + allows. + - we've dropped the 'protocol_version' field from msgr1 + - for lossy sessions, cookie is meaningless. for lossless sessions, + we assign a local value that identifies the local Connection + state. when we receive this from a peer, we make a note of their + cookie, so that on reconnect we can reattach (see below). -* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT with too few features:: +* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT + with too few features:: - __le64 features we require that peer didn't advertise - -* TAG_IDENT_BAD_PROTOCOL (server only): complain about an old protocol version:: - - __le32 protocol_version (our protocol version) + __le64 features we require that the peer didn't advertise * TAG_RECONNECT (client only): reconnect to an established session:: @@ -302,6 +269,9 @@ an established session. __le64 msg_seq (last msg seq received) + - once the client receives this, the client can proceed to message exchange. + - once the server sends this, the server can proceed to message exchange. + * TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq * TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq @@ -315,17 +285,24 @@ an established session. Message exchange ---------------- -Once a session is stablished, we can exchange messages. +Once a session is established, we can exchange messages. * TAG_MSG: a message:: ceph_msg_header2 front middle + data_pre_padding data - - The ceph_msg_header is modified in ceph_msg_header2 to include an - ack_seq. This avoids the need for a TAG_ACK message most of the time. + - The ceph_msg_header2 is modified from ceph_msg_header: + * include an ack_seq. This avoids the need for a TAG_ACK + message most of the time. + * remove the src field, which we now get from the message flow + handshake (TAG_IDENT). + * specifies the data_pre_padding length, which can be used to + adjust the alignment of the data payload. (NOTE: is this is + useful?) * TAG_ACK: acknowledge receipt of message(s):: @@ -345,14 +322,12 @@ Once a session is stablished, we can exchange messages. - Time stamp is from the TAG_KEEPALIVE2 we are responding to. -* TAG_CLOSE: terminate a stream +* TAG_CLOSE: terminate a connection - Indicates that a stream should be terminated. This is equivalent to - a hangup or reset (i.e., should trigger ms_handle_reset). It isn't - strictly necessary or useful if there is only a single stream as we - could just disconnect the TCP connection, although one could - certainly use it creatively (e.g., reset the stream state and retry - an authentication handshake). + Indicates that a connection should be terminated. This is equivalent + to a hangup or reset (i.e., should trigger ms_handle_reset). It + isn't strictly necessary or useful as we could just disconnect the + TCP connection. Example of protocol interaction (WIP) @@ -371,26 +346,20 @@ _____________________________________ | | | send new stream | |------------------>| - | set method | - |------------------>| - | +-----| - | auth request| | - |-------------+---->| - | | | - |<------------+ | - | bad method | - | | - | set method | - |------------------>| | auth request | |------------------>| |<------------------| - | auth reply | + | bad method | | | - | auth done | + | auth request | |------------------>| |<------------------| - | auth done ack | + | auth more | + | | + | auth more | + |------------------>| + |<------------------| + | auth done | | |