ceph/doc/dev/encoding.rst


Serialization (encode/decode)
=============================

When a structure is sent over the network or written to disk, it is
encoded into a string of bytes. Usually (but not always -- multiple
serialization facilities coexist in Ceph) serializable structures
have ``encode`` and ``decode`` methods that write and read from
``bufferlist`` objects representing byte strings.

Terminology
-----------
It is best to think not in the domain of daemons and clients but
encoders and decoders. An encoder serializes a structure into a bufferlist
while a decoder does the opposite.

Encoders and decoders can be referred collectively as dencoders.

Dencoders (both encoders and docoders) live within daemons and clients.
For instance, when an RBD client issues an IO operation, it prepares
an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
that is put on the wire.
An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
Here encoder was used by the client while decoder by the OSD. However,
these roles can swing -- just imagine handling of the response: OSD encodes
the ``MOSDOpReply`` while RBD clients decode.

Encoder and decoder operate accordingly to a format which is defined
by a programmer by implementing the ``encode`` and ``decode`` methods.

Principles for format change
----------------------------
It is not unusual for the format of serialization to change. This
process requires careful attention both during development
and review.

The general rule is that a decoder must understand what has been encoded by an
encoder. Most difficulties arise during the process of ensuring the continuity
of compatibility of old decoders with new encoders, and ensuring the continuity
of compatibility of new decoders with old decoders. One should assume -- if not
otherwise specified -- that any mix of old and new is possible in a cluster.
There are two primary concerns:

1. **Upgrades.** Although there are recommendations related to the order of
   entity types (mons/OSDs/clients), it is not mandatory and no assumption
   should be made.
2. **Huge variability of client versions.** It has always been the case that
   kernel upgrades (and thus kernel clients) are decoupled from Ceph upgrades.
   Containerization brings variability even to ``librbd`` -- now user space
   libraries live in the container itself:

There are a few rules limiting the degree of interoperability between
dencoders:

* ``n-2`` for dencoding between daemons,
* ``n-3`` hard requirement for client scenarios,
* ``n-3..`` soft requirement for client scenarios. Ideally every client should
  be able to talk to any version of daemons.

As the underlying reasons are the same, the rules that dencoders
follow are nearly the same as the rules for deprecations of our features
bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.

Frameworks
----------
Currently multiple genres of dencoding helpers co-exist.

* encoding.h (the most proliferated one),
* denc.h (performance optimized, seen mostly in ``BlueStore``),
* the `Message` hierarchy.

Although details vary, the interoperability rules stay the same.

Adding a field to a structure
-----------------------------

You can see examples of this all over the Ceph code, but here's an
example:

.. code-block:: cpp

    class AcmeClass
    {
        int member1;
        std::string member2;

        void encode(bufferlist &bl)
        {
            ENCODE_START(1, 1, bl);
            ::encode(member1, bl);
            ::encode(member2, bl);
            ENCODE_FINISH(bl);
        }

        void decode(bufferlist::iterator &bl)
        {
            DECODE_START(1, bl);
            ::decode(member1, bl);
            ::decode(member2, bl);
            DECODE_FINISH(bl);
        }
    };

The ``ENCODE_START`` macro writes a header that specifies a *version* and
a *compat_version* (both initially 1).  The message version is incremented
whenever a change is made to the encoding.  The compat_version is incremented
only if the change will break existing decoders -- decoders are tolerant
of trailing bytes, so changes that add fields at the end of the structure
do not require incrementing compat_version.

The ``DECODE_START`` macro takes an argument specifying the most recent
message version that the code can handle.  This is compared with the
compat_version encoded in the message, and if the message is too new then
an exception will be thrown.  Because changes to compat_version are rare,
this isn't usually something to worry about when adding fields.

In practice, changes to encoding usually involve simply adding the desired fields
at the end of the ``encode`` and ``decode`` functions, and incrementing
the versions in ``ENCODE_START`` and ``DECODE_START``.  For example, here's how
to add a third field to ``AcmeClass``:

.. code-block:: cpp

    class AcmeClass
    {
        int member1;
        std::string member2;
        std::vector<std::string> member3;

        void encode(bufferlist &bl)
        {
            ENCODE_START(2, 1, bl);
            ::encode(member1, bl);
            ::encode(member2, bl);
            ::encode(member3, bl);
            ENCODE_FINISH(bl);
        }

        void decode(bufferlist::iterator &bl)
        {
            DECODE_START(2, bl);
            ::decode(member1, bl);
            ::decode(member2, bl);
            if (struct_v >= 2) {
                ::decode(member3, bl);
            }
            DECODE_FINISH(bl);
        }
    };

Note that the compat_version did not change because the encoded message
will still be decodable by versions of the code that only understand
version 1 -- they will just ignore the trailing bytes where we encode ``member3``.

In the ``decode`` function, decoding the new field is conditional: this is
because we might still be passed older-versioned messages that do not
have the field.  The ``struct_v`` variable is a local set by the ``DECODE_START``
macro.

# Into the weeeds

The append-extendability of our dencoders is a result of the forward
compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.

They are implementing extensibility facilities. An encoder, when filling
the bufferlist, prepends three fields: version of the current format,
minimal version of a decoder compatible with it and the total size of
all encoded fields.

.. code-block:: cpp

        /**
         * start encoding block
         *
         * @param v current (code) version of the encoding
         * @param compat oldest code version that can decode it
         * @param bl bufferlist to encode to
         *
         */
        #define ENCODE_START(v, compat, bl)                             \
          __u8 struct_v = v;                                            \
          __u8 struct_compat = compat;                                  \
          ceph_le32 struct_len;                                         \
          auto filler = (bl).append_hole(sizeof(struct_v) +             \
            sizeof(struct_compat) + sizeof(struct_len));                \
          const auto starting_bl_len = (bl).length();                   \
          using ::ceph::encode;                                         \
          do {

The ``struct_len`` field allows the decoder to eat all the bytes that were
left undecoded in the user-provided ``decode`` implementation.
Analogically, decoders tracks how much input has been decoded in the
user-provided ``decode`` methods.

.. code-block:: cpp

        #define DECODE_START(bl)		                        \
          unsigned struct_end = 0;					\
          __u32 struct_len;						\
          decode(struct_len, bl);					\
          ...                                                           \
          struct_end = bl.get_off() + struct_len;			\
          }								\
          do {


Decoder uses this information to discard the extra bytes it does not
understand. Advancing bufferlist is critical as dencoders tend to be nested;
just leaving it intact would work only for the very last ``deocde`` call
in a nested structure.

.. code-block:: cpp

        #define DECODE_FINISH(bl)					\
          } while (false);						\
          if (struct_end) {						\
            ...                                                         \
            if (bl.get_off() < struct_end)				\
              bl += struct_end - bl.get_off();				\
          }


This entire, cooperative mechanism allows encoder (its further revisions)
to generate more byte stream (due to e.g. adding a new field at the end)
and not worry that the residue will crash older decoder revisions.