ceph/doc/dev/msgr2.rst
Kefu Chai 0cb56e0f13 doc: use plantweb as fallback of sphinx-ditaa
RTD does not support installing system packages, the only ways to install
dependencies are setuptools and pip. while ditaa is a tool written in
Java. so we need to find a native python tool allowing us to render ditaa
images. plantweb is able to the web service for rendering the ditaa
diagram. so let's use it as a fallback if "ditaa" is not around.

also start a new line after the directive, otherwise planweb server will
return 500 at seeing the diagram.

Signed-off-by: Kefu Chai <kchai@redhat.com>
2020-04-10 08:38:06 +08:00

559 lines
17 KiB
ReStructuredText

.. _msgr2-protocol:
msgr2 protocol
==============
This is a revision of the legacy Ceph on-wire protocol that was
implemented by the SimpleMessenger. It addresses performance and
security issues.
Goals
-----
This protocol revision has several goals relative to the original protocol:
* *Flexible handshaking*. The original protocol did not have a
sufficiently flexible protocol negotiation that allows for features
that were not required.
* *Encryption*. We will incorporate encryption over the wire.
* *Performance*. We would like to provide for protocol features
(e.g., padding) that keep computation and memory copies out of the
fast path where possible.
* *Signing*. We will allow for traffic to be signed (but not
necessarily encrypted). This may not be implemented in the initial version.
Definitions
-----------
* *client* (C): the party initiating a (TCP) connection
* *server* (S): the party accepting a (TCP) connection
* *connection*: an instance of a (TCP) connection between two processes.
* *entity*: a ceph entity instantiation, e.g. 'osd.0'. each entity
has one or more unique entity_addr_t's by virtue of the 'nonce'
field, which is typically a pid or random value.
* *session*: a stateful session between two entities in which message
exchange is ordered and lossless. A session might span multiple
connections if there is an interruption (TCP connection disconnect).
* *frame*: a discrete message sent between the peers. Each frame
consists of a tag (type code), payload, and (if signing
or encryption is enabled) some other fields. See below for the
structure.
* *tag*: a type code associated with a frame. The tag
determines the structure of the payload.
Phases
------
A connection has four distinct phases:
#. banner
#. authentication frame exchange
#. message flow handshake frame exchange
#. message frame exchange
Banner
------
Both the client and server, upon connecting, send a banner::
"ceph %x %x\n", protocol_features_suppored, protocol_features_required
The protocol features are a new, distinct namespace. Initially no
features are defined or required, so this will be "ceph 0 0\n".
If the remote party advertises required features we don't support, we
can disconnect.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| send banner |
|----+ +----|
| | | |
| +-------+--->|
| send banner| |
|<-----------+ |
| |
Frame format
------------
All further data sent or received is contained by a frame. Each frame has
the form::
frame_len (le32)
tag (TAG_* le32)
frame_header_checksum (le32)
payload
[payload padding -- only present after stream auth phase]
[signature -- only present after stream auth phase]
* The frame_header_checksum is over just the frame_len and tag values (8 bytes).
* frame_len includes everything after the frame_len le32 up to the end of the
frame (all payloads, signatures, and padding).
* The payload format and length is determined by the tag.
* The signature portion is only present if the authentication phase
has completed (TAG_AUTH_DONE has been sent) and signatures are
enabled.
Hello
-----
* TAG_HELLO: client->server and server->client::
__u8 entity_type
entity_addr_t peer_socket_address
- We immediately share our entity type and the address of the peer (which can be useful
for detecting our effective IP address, especially in the presence of NAT).
Authentication
--------------
* TAG_AUTH_REQUEST: client->server::
__le32 method; // CEPH_AUTH_{NONE, CEPHX, ...}
__le32 num_preferred_modes;
list<__le32> mode // CEPH_CON_MODE_*
method specific payload
* TAG_AUTH_BAD_METHOD server -> client: reject client-selected auth method::
__le32 method
__le32 negative error result code
__le32 num_methods
list<__le32> allowed_methods // CEPH_AUTH_{NONE, CEPHX, ...}
__le32 num_modes
list<__le32> allowed_modes // CEPH_CON_MODE_*
- Returns the attempted auth method, and error code (-EOPNOTSUPP if
the method is unsupported), and the list of allowed authentication
methods.
* TAG_AUTH_REPLY_MORE: server->client::
__le32 len;
method specific payload
* TAG_AUTH_REQUEST_MORE: client->server::
__le32 len;
method specific payload
* TAG_AUTH_DONE: (server->client)::
__le64 global_id
__le32 connection mode // CEPH_CON_MODE_*
method specific payload
- The server is the one to decide authentication has completed and what
the final connection mode will be.
Example of authentication phase interaction when the client uses an
allowed authentication method:
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| auth request |
|---------------->|
|<----------------|
| auth more|
| |
|auth more |
|---------------->|
|<----------------|
| auth done|
Example of authentication phase interaction when the client uses a forbidden
authentication method as the first attempt:
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| auth request |
|---------------->|
|<----------------|
| bad method |
| |
| auth request |
|---------------->|
|<----------------|
| auth more|
| |
| auth more |
|---------------->|
|<----------------|
| auth done|
Post-auth frame format
----------------------
The frame format is fixed (see above), but can take three different
forms, depending on the AUTH_DONE flags:
* If neither FLAG_SIGNED or FLAG_ENCRYPTED is specified, things are simple::
frame_len
tag
payload
payload_padding (out to auth block_size)
- The padding is some number of bytes < the auth block_size that
brings the total length of the payload + payload_padding to a
multiple of block_size. It does not include the frame_len or tag. Padding
content can be zeros or (better) random bytes.
* If FLAG_SIGNED has been specified::
frame_len
tag
payload
payload_padding (out to auth block_size)
signature (sig_size bytes)
Here the padding just makes life easier for the signature. It can be
random data to add additional confounder. Note also that the
signature input must include some state from the session key and the
previous message.
* If FLAG_ENCRYPTED has been specified::
frame_len
tag
{
payload
payload_padding (out to auth block_size)
} ^ stream cipher
Note that the padding ensures that the total frame is a multiple of
the auth method's block_size so that the message can be sent out over
the wire without waiting for the next frame in the stream.
Message flow handshake
----------------------
In this phase the peers identify each other and (if desired) reconnect to
an established session.
* TAG_CLIENT_IDENT (client->server): identify ourselves::
__le32 num_addrs
entity_addrvec_t*num_addrs entity addrs
entity_addr_t target entity addr
__le64 gid (numeric part of osd.0, client.123456, ...)
__le64 global_seq
__le64 features supported (CEPH_FEATURE_* bitmask)
__le64 features required (CEPH_FEATURE_* bitmask)
__le64 flags (CEPH_MSG_CONNECT_* bitmask)
__le64 cookie
- client will send first, server will reply with same. if this is a
new session, the client and server can proceed to the message exchange.
- the target addr is who the client is trying to connect *to*, so
that the server side can close the connection if the client is
talking to the wrong daemon.
- type.gid (entity_name_t) is set here, by combinging the type shared in the hello
frame with the gid here. this means we don't need it
in the header of every message. it also means that we can't send
messages "from" other entity_name_t's. the current
implementations set this at the top of _send_message etc so this
shouldn't break any existing functionality. implementation will
likely want to mask this against what the authenticated credential
allows.
- cookie is the client coookie used to identify a session, and can be used
to reconnect to an existing session.
- we've dropped the 'protocol_version' field from msgr1
* TAG_IDENT_MISSING_FEATURES (server->client): complain about a TAG_IDENT
with too few features::
__le64 features we require that the peer didn't advertise
* TAG_SERVER_IDENT (server->client): accept client ident and identify server::
__le32 num_addrs
entity_addrvec_t*num_addrs entity addrs
__le64 gid (numeric part of osd.0, client.123456, ...)
__le64 global_seq
__le64 features supported (CEPH_FEATURE_* bitmask)
__le64 features required (CEPH_FEATURE_* bitmask)
__le64 flags (CEPH_MSG_CONNECT_* bitmask)
__le64 cookie
- The server cookie can be used by the client if it is later disconnected
and wants to reconnect and resume the session.
* TAG_RECONNECT (client->server): reconnect to an established session::
__le32 num_addrs
entity_addr_t * num_addrs
__le64 client_cookie
__le64 server_cookie
__le64 global_seq
__le64 connect_seq
__le64 msg_seq (the last msg seq received)
* TAG_RECONNECT_OK (server->client): acknowledge a reconnect attempt::
__le64 msg_seq (last msg seq received)
- once the client receives this, the client can proceed to message exchange.
- once the server sends this, the server can proceed to message exchange.
* TAG_RECONNECT_RETRY_SESSION (server only): fail reconnect due to stale connect_seq
* TAG_RECONNECT_RETRY_GLOBAL (server only): fail reconnect due to stale global_seq
* TAG_RECONNECT_WAIT (server only): fail reconnect due to connect race.
- Indicates that the server is already connecting to the client, and
that direction should win the race. The client should wait for that
connection to complete.
* TAG_RESET_SESSION (server only): ask client to reset session::
__u8 full
- full flag indicates whether peer should do a full reset, i.e., drop
message queue.
Example of failure scenarios:
* First client's client_ident message is lost, and then client reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | client_ident(a) |
|-------------X |
| |
| client_ident(a) |
|-------------------->|
|<--------------------|
| server_ident(b) | s_cookie(b)
| |
| session established |
| |
* Server's server_ident message is lost, and then client reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | client_ident(a) |
|-------------------->|
| X------------|
| server_ident(b) | s_cookie(b)
| |
| |
| client_ident(a) |
|-------------------->|
|<--------------------|
| server_ident(c) | s_cookie(c)
| |
| session established |
| |
* Server's server_ident message is lost, and then server reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | client_ident(a) |
|-------------------->|
| X------------|
| server_ident(b) | s_cookie(b)
| |
| |
| reconnect(a, b) |
|<--------------------|
|-------------------->|
| reset_session(F) |
| |
| client_ident(a) | c_cookie(a)
|<--------------------|
|-------------------->|
s_cookie(c) | server_ident(c) |
| |
* Connection failure after session is established, and then client reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | session established | s_cookie(b)
|<------------------->|
| X------------|
| |
| reconnect(a, b) |
|-------------------->|
|<--------------------|
| reconnect_ok |
| |
* Connection failure after session is established because server reset,
and then client reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | session established | s_cookie(b)
|<------------------->|
| X------------| reset
| |
| reconnect(a, b) |
|-------------------->|
|<--------------------|
| reset_session(RC*) |
| |
c_cookie(c) | client_ident(c) |
|-------------------->|
|<--------------------|
| server_ident(d) | s_cookie(d)
| |
RC* means that the reset session full flag depends on the policy.resetcheck
of the connection.
* Connection failure after session is established because client reset,
and then client reconnects.
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| |
c_cookie(a) | session established | s_cookie(b)
|<------------------->|
reset | X------------|
| |
c_cookie(c) | client_ident(c) |
|-------------------->|
|<--------------------| reset if policy.resetcheck
| server_ident(d) | s_cookie(d)
| |
Message exchange
----------------
Once a session is established, we can exchange messages.
* TAG_MSG: a message::
ceph_msg_header2
front
middle
data_pre_padding
data
- The ceph_msg_header2 is modified from ceph_msg_header:
* include an ack_seq. This avoids the need for a TAG_ACK
message most of the time.
* remove the src field, which we now get from the message flow
handshake (TAG_IDENT).
* specifies the data_pre_padding length, which can be used to
adjust the alignment of the data payload. (NOTE: is this is
useful?)
* TAG_ACK: acknowledge receipt of message(s)::
__le64 seq
- This is only used for stateful sessions.
* TAG_KEEPALIVE2: check for connection liveness::
ceph_timespec stamp
- Time stamp is local to sender.
* TAG_KEEPALIVE2_ACK: reply to a keepalive2::
ceph_timestamp stamp
- Time stamp is from the TAG_KEEPALIVE2 we are responding to.
* TAG_CLOSE: terminate a connection
Indicates that a connection should be terminated. This is equivalent
to a hangup or reset (i.e., should trigger ms_handle_reset). It
isn't strictly necessary or useful as we could just disconnect the
TCP connection.
Example of protocol interaction (WIP)
_____________________________________
.. ditaa::
+---------+ +--------+
| Client | | Server |
+---------+ +--------+
| send banner |
|----+ +------|
| | | |
| +-------+----->|
| send banner| |
|<-----------+ |
| |
| send new stream |
|------------------>|
| auth request |
|------------------>|
|<------------------|
| bad method |
| |
| auth request |
|------------------>|
|<------------------|
| auth more |
| |
| auth more |
|------------------>|
|<------------------|
| auth done |
| |