mirror of
https://github.com/ceph/ceph
synced 2025-03-21 01:38:15 +00:00
Merge PR #23965 into master
* refs/pull/23965/head: doc/dev/msgr2: better formatting doc/dev/msgr2: clarify padding alignment doc/dev/msgr2: tweak message flow handshake doc/dev/msgr2: remove stream concept, streamline auth Reviewed-by: Ricardo Dias <rdias@suse.com>
This commit is contained in:
commit
3a34c9ee38
@ -10,19 +10,15 @@ Goals
|
||||
|
||||
This protocol revision has several goals relative to the original protocol:
|
||||
|
||||
* *Multiplexing*. We will have multiple server entities (e.g.,
|
||||
multiple OSDs and clients) coexisting in the same process. We would
|
||||
like to share the transport connection (e.g., TCP socket) whenever
|
||||
possible.
|
||||
* *Signing*. We will allow for traffic to be signed (but not
|
||||
necessarily encrypted).
|
||||
* *Encryption*. We will incorporate encryption over the wire.
|
||||
* *Flexible handshaking*. The original protocol did not have a
|
||||
sufficiently flexible protocol negotiation that allows for features
|
||||
that were not required.
|
||||
* *Encryption*. We will incorporate encryption over the wire.
|
||||
* *Performance*. We would like to provide for protocol features
|
||||
(e.g., padding) that keep computation and memory copies out of the
|
||||
fast path where possible.
|
||||
* *Signing*. We will allow for traffic to be signed (but not
|
||||
necessarily encrypted). This may not be implemented in the initial version.
|
||||
|
||||
Definitions
|
||||
-----------
|
||||
@ -33,36 +29,25 @@ Definitions
|
||||
* *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
|
||||
has one or more unique entity_addr_t's by virtue of the 'nonce'
|
||||
field, which is typically a pid or random value.
|
||||
* *stream*: an exchange, passed over a connection, between two unique
|
||||
entities. in the future multiple entities may coexist within the
|
||||
same process.
|
||||
* *session*: a stateful session between two entities in which message
|
||||
exchange is ordered and lossless. A session might span multiple
|
||||
connections (and streams) if there is an interruption (TCP connection
|
||||
disconnect).
|
||||
connections if there is an interruption (TCP connection disconnect).
|
||||
* *frame*: a discrete message sent between the peers. Each frame
|
||||
consists of a tag (type code), stream id, payload, and (if signing
|
||||
consists of a tag (type code), payload, and (if signing
|
||||
or encryption is enabled) some other fields. See below for the
|
||||
structure.
|
||||
* *stream id*: a 32-bit value that uniquely identifies a stream within
|
||||
a given connection. the stream id implicitly instantiated when the send
|
||||
sends a frame using that id.
|
||||
* *tag*: a single-byte type code associated with a frame. The tag
|
||||
* *tag*: a type code associated with a frame. The tag
|
||||
determines the structure of the payload.
|
||||
|
||||
Phases
|
||||
------
|
||||
|
||||
A connection has two distinct phases:
|
||||
A connection has four distinct phases:
|
||||
|
||||
#. banner
|
||||
#. frame exchange for one or more strams
|
||||
|
||||
A stream has three distinct phases:
|
||||
|
||||
#. authentication
|
||||
#. message flow handshake
|
||||
#. message exchange
|
||||
#. authentication frame exchange
|
||||
#. message flow handshake frame exchange
|
||||
#. message frame exchange
|
||||
|
||||
Banner
|
||||
------
|
||||
@ -89,81 +74,60 @@ can disconnect.
|
||||
|<-----------+ |
|
||||
| |
|
||||
|
||||
Frame format and Stream establishment
|
||||
-------------------------------------
|
||||
Frame format
|
||||
------------
|
||||
|
||||
All further data sent or received is contained by a frame. Each frame has
|
||||
the form::
|
||||
|
||||
stream_id (le32)
|
||||
frame_len (le32)
|
||||
tag (TAG_* byte)
|
||||
tag (TAG_* le32)
|
||||
payload
|
||||
[payload padding -- only present after stream auth phase]
|
||||
[signature -- only present after stream auth phase]
|
||||
|
||||
* stream_id is generated by the client.
|
||||
|
||||
* frame_len includes everything after the frame_len le32 up to the end of the
|
||||
frame (all payloads, signatures, and padding).
|
||||
|
||||
* The payload format and length is determined by the tag.
|
||||
|
||||
* The signature portion is only present in a given stream if the
|
||||
authentication phase has completed (TAG_AUTH_DONE has been sent) and
|
||||
signatures are enabled.
|
||||
|
||||
A new stream is created when the client sends a frame with the following tag
|
||||
message:
|
||||
|
||||
* TAG_NEW_STREAM (client only): starts a new stream::
|
||||
|
||||
__u8 my_type (CEPH_ENTITY_TYPE_*)
|
||||
|
||||
|
||||
.. ditaa:: +---------+ +--------+
|
||||
| Client | | Server |
|
||||
+---------+ +--------+
|
||||
| send new stream |
|
||||
|------------------>|
|
||||
| |
|
||||
* The signature portion is only present if the authentication phase
|
||||
has completed (TAG_AUTH_DONE has been sent) and signatures are
|
||||
enabled.
|
||||
|
||||
|
||||
Authentication
|
||||
--------------
|
||||
|
||||
* TAG_AUTH_SET_METHOD (client only): set auth method for this connection::
|
||||
* TAG_AUTH_REQUEST: client->server::
|
||||
|
||||
__le32 method;
|
||||
|
||||
- The selected auth method determines the sig_size and block_size in any
|
||||
subsequent messages (TAG_AUTH_DONE and non-auth messages).
|
||||
__le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
|
||||
__le32 len;
|
||||
method specific payload
|
||||
|
||||
* TAG_AUTH_BAD_METHOD (server only): reject client-selected auth method::
|
||||
|
||||
__le32 method
|
||||
__le32 num_methods
|
||||
__le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX}
|
||||
__le32 allowed_methods[num_methods] // CEPH_AUTH_{NONE, CEPHX, ...}
|
||||
|
||||
- Returns the unsupported/forbidden method along with the list of allowed
|
||||
authentication methods.
|
||||
|
||||
* TAG_AUTH_REQUEST: client->server::
|
||||
* TAG_AUTH_BAD_AUTH: server->client::
|
||||
|
||||
__le32 error code (e.g., EPERM, EACCESS)
|
||||
__le32 len;
|
||||
method specific payload
|
||||
|
||||
* TAG_AUTH_REPLY: server->client::
|
||||
|
||||
__le32 len;
|
||||
method specific payload
|
||||
|
||||
* TAG_AUTH_BAD_AUTH: server->client:
|
||||
error string;
|
||||
|
||||
- Sent when the authentication fails
|
||||
|
||||
* TAG_AUTH_MORE: server->client or client->server::
|
||||
|
||||
* TAG_AUTH_DONE::
|
||||
__le32 len;
|
||||
method specific payload
|
||||
|
||||
* TAG_AUTH_DONE: (server->client)::
|
||||
|
||||
confounder (block_size bytes of random garbage)
|
||||
__le64 flags
|
||||
@ -171,8 +135,7 @@ Authentication
|
||||
FLAG_SIGNED 2
|
||||
signature
|
||||
|
||||
- The client first says AUTH_DONE, and the server replies to
|
||||
acknowledge it.
|
||||
- The server is the one to decide authentication has completed.
|
||||
|
||||
|
||||
Example of authentication phase interaction when the client uses an
|
||||
@ -181,17 +144,15 @@ allowed authentication method:
|
||||
.. ditaa:: +---------+ +--------+
|
||||
| Client | | Server |
|
||||
+---------+ +--------+
|
||||
| set method |
|
||||
|---------------->|
|
||||
| auth request |
|
||||
|---------------->|
|
||||
|<----------------|
|
||||
| auth reply|
|
||||
| auth more|
|
||||
| |
|
||||
| auth done |
|
||||
|auth more |
|
||||
|---------------->|
|
||||
|<----------------|
|
||||
| auth done ack |
|
||||
| auth done|
|
||||
|
||||
|
||||
Example of authentication phase interaction when the client uses a forbidden
|
||||
@ -200,45 +161,42 @@ authentication method as the first attempt:
|
||||
.. ditaa:: +---------+ +--------+
|
||||
| Client | | Server |
|
||||
+---------+ +--------+
|
||||
| set method |
|
||||
|---------------->|
|
||||
| +---|
|
||||
| auth request| |
|
||||
|-------------+-->|
|
||||
| | |
|
||||
|<------------+ |
|
||||
| bad method |
|
||||
| |
|
||||
| set method |
|
||||
|---------------->|
|
||||
| auth request |
|
||||
|---------------->|
|
||||
|<----------------|
|
||||
| auth reply|
|
||||
| bad method |
|
||||
| |
|
||||
| auth done |
|
||||
| auth request |
|
||||
|---------------->|
|
||||
|<----------------|
|
||||
| auth done ack |
|
||||
| auth more|
|
||||
| |
|
||||
| auth more |
|
||||
|---------------->|
|
||||
|<----------------|
|
||||
| auth done|
|
||||
|
||||
|
||||
Message frame format
|
||||
--------------------
|
||||
Post-auth frame format
|
||||
----------------------
|
||||
|
||||
The frame format is fixed (see above), but can take three different
|
||||
forms, depending on the AUTH_DONE flags:
|
||||
|
||||
* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
|
||||
|
||||
stream_id
|
||||
frame_len
|
||||
tag
|
||||
payload
|
||||
payload_padding (out to auth block_size)
|
||||
|
||||
- The padding is some number of bytes < the auth block_size that
|
||||
brings the total length of the payload + payload_padding to a
|
||||
multiple of block_size. It does not include the frame_len or tag. Padding
|
||||
content can be zeros or (better) random bytes.
|
||||
|
||||
* If FLAG_SIGNED has been specified::
|
||||
|
||||
stream_id
|
||||
frame_len
|
||||
tag
|
||||
payload
|
||||
@ -252,10 +210,9 @@ forms, depending on the AUTH_DONE flags:
|
||||
|
||||
* If FLAG_ENCRYPTED has been specified::
|
||||
|
||||
stream_id
|
||||
frame_len
|
||||
tag
|
||||
{
|
||||
payload_sig_length
|
||||
payload
|
||||
payload_padding (out to auth block_size)
|
||||
} ^ stream cipher
|
||||
@ -275,21 +232,31 @@ an established session.
|
||||
|
||||
entity_addrvec_t addr(s)
|
||||
__u8 my type (CEPH_ENTITY_TYPE_*)
|
||||
__le32 protocol version
|
||||
__le64 gid (numeric part of osd.0, client.123456, ...)
|
||||
__le64 features supported (CEPH_FEATURE_* bitmask)
|
||||
__le64 features required (CEPH_FEATURE_* bitmask)
|
||||
__le64 flags (CEPH_MSG_CONNECT_* bitmask)
|
||||
__le64 cookie (a client identifier, assigned by the sender. unique on the sender.)
|
||||
|
||||
- client will send first, server will reply with same.
|
||||
- client will send first, server will reply with same. if this is a
|
||||
new session, the client and server can proceed to the message exchange.
|
||||
- type.gid (entity_name_t) is set here. this means we don't need it
|
||||
in the header of every message. it also means that we can't send
|
||||
messages "from" other entity_name_t's. the current
|
||||
implementations set this at the top of _send_message etc so this
|
||||
shouldn't break any existing functionality. implementation will
|
||||
likely want to mask this against what the authenticated credential
|
||||
allows.
|
||||
- we've dropped the 'protocol_version' field from msgr1
|
||||
- for lossy sessions, cookie is meaningless. for lossless sessions,
|
||||
we assign a local value that identifies the local Connection
|
||||
state. when we receive this from a peer, we make a note of their
|
||||
cookie, so that on reconnect we can reattach (see below).
|
||||
|
||||
* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT with too few features::
|
||||
* TAG_IDENT_MISSING_FEATURES (server only): complain about a TAG_IDENT
|
||||
with too few features::
|
||||
|
||||
__le64 features we require that peer didn't advertise
|
||||
|
||||
* TAG_IDENT_BAD_PROTOCOL (server only): complain about an old protocol version::
|
||||
|
||||
__le32 protocol_version (our protocol version)
|
||||
__le64 features we require that the peer didn't advertise
|
||||
|
||||
* TAG_RECONNECT (client only): reconnect to an established session::
|
||||
|
||||
@ -302,6 +269,9 @@ an established session.
|
||||
|
||||
__le64 msg_seq (last msg seq received)
|
||||
|
||||
- once the client receives this, the client can proceed to message exchange.
|
||||
- once the server sends this, the server can proceed to message exchange.
|
||||
|
||||
* TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
|
||||
|
||||
* TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
|
||||
@ -315,17 +285,24 @@ an established session.
|
||||
Message exchange
|
||||
----------------
|
||||
|
||||
Once a session is stablished, we can exchange messages.
|
||||
Once a session is established, we can exchange messages.
|
||||
|
||||
* TAG_MSG: a message::
|
||||
|
||||
ceph_msg_header2
|
||||
front
|
||||
middle
|
||||
data_pre_padding
|
||||
data
|
||||
|
||||
- The ceph_msg_header is modified in ceph_msg_header2 to include an
|
||||
ack_seq. This avoids the need for a TAG_ACK message most of the time.
|
||||
- The ceph_msg_header2 is modified from ceph_msg_header:
|
||||
* include an ack_seq. This avoids the need for a TAG_ACK
|
||||
message most of the time.
|
||||
* remove the src field, which we now get from the message flow
|
||||
handshake (TAG_IDENT).
|
||||
* specifies the data_pre_padding length, which can be used to
|
||||
adjust the alignment of the data payload. (NOTE: is this is
|
||||
useful?)
|
||||
|
||||
* TAG_ACK: acknowledge receipt of message(s)::
|
||||
|
||||
@ -345,14 +322,12 @@ Once a session is stablished, we can exchange messages.
|
||||
|
||||
- Time stamp is from the TAG_KEEPALIVE2 we are responding to.
|
||||
|
||||
* TAG_CLOSE: terminate a stream
|
||||
* TAG_CLOSE: terminate a connection
|
||||
|
||||
Indicates that a stream should be terminated. This is equivalent to
|
||||
a hangup or reset (i.e., should trigger ms_handle_reset). It isn't
|
||||
strictly necessary or useful if there is only a single stream as we
|
||||
could just disconnect the TCP connection, although one could
|
||||
certainly use it creatively (e.g., reset the stream state and retry
|
||||
an authentication handshake).
|
||||
Indicates that a connection should be terminated. This is equivalent
|
||||
to a hangup or reset (i.e., should trigger ms_handle_reset). It
|
||||
isn't strictly necessary or useful as we could just disconnect the
|
||||
TCP connection.
|
||||
|
||||
|
||||
Example of protocol interaction (WIP)
|
||||
@ -371,26 +346,20 @@ _____________________________________
|
||||
| |
|
||||
| send new stream |
|
||||
|------------------>|
|
||||
| set method |
|
||||
|------------------>|
|
||||
| +-----|
|
||||
| auth request| |
|
||||
|-------------+---->|
|
||||
| | |
|
||||
|<------------+ |
|
||||
| bad method |
|
||||
| |
|
||||
| set method |
|
||||
|------------------>|
|
||||
| auth request |
|
||||
|------------------>|
|
||||
|<------------------|
|
||||
| auth reply |
|
||||
| bad method |
|
||||
| |
|
||||
| auth done |
|
||||
| auth request |
|
||||
|------------------>|
|
||||
|<------------------|
|
||||
| auth done ack |
|
||||
| auth more |
|
||||
| |
|
||||
| auth more |
|
||||
|------------------>|
|
||||
|<------------------|
|
||||
| auth done |
|
||||
| |
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user