haproxy/doc/internals/buffer-rework.txt

2018-05-18 - Buffer rework

1. Summary

The situation with the current buffer structure is becoming problematic in
the newly introduced muxes and causes serious difficulties preventing muxes
from being used on both sides, unless requiring that all code is duplicated
to use buf->i on the Rx path and buf->o on the Tx path.


2. History

A very long time ago, buffers were used to receive data using the recv() call,
to parse them, forward them, and send them over the network using the send()
call. Buffers were split into (buffer,channel) when some layers started to be
introduced, and were reorganized a few times to ease content processing and
rewriting. The current definition of the buffer structure is the following :

    struct buffer {
        char *p;
        uint size;
        uint i;
        uint o;
        char data[0];
    };

    data                               p
      |                                |
      V                                V
      +-----------+--------------------+------------+-------------+
      |           |////////////////////|////////////|             |
      +-----------+--------------------+------------+-------------+
       <---------------------------------------------------------> size
                   <------------------> <---------->
                            o                i

Pointer (p) is initialized to (data) when the buffer is empty. Data are
received after (p+i), increasing (i) by the number of bytes read. Data are sent
from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent.
Forwarding data in the channel consists in advancing (p) by the number of bytes
to forward, increasing (o) and decreasing (i) by the same amount.

This representation is convenient for channel operations because most of them
require to parse input data between (p) and (p+i), and to have a simple way to
forward data. Additionally, it's always easy to know if some data are scheduled
for departure (o), or if the buffer has some room available (size-i-o).


3. Problems

When applets were introduced, the initial code that was made to write data into
the output part was modified to send it into the input part since we had to
rely on the stream code to forward these data via the channel. This explains
the flood of bi_* functions that were introduced to perform the same operations
as the initial bo_*, to write into an input buffer from an applet.

Health checks however continue to use output because checks do not use streams
nor channels. Thus the check buffers use buf->o for requests and buf->i for
responses.

The introduction of muxes has changed this again by requiring that most request
code was able to write to buf->i, pretending to be the equivalent of a socket
recv() call. New bi_* functions had to be created to write headers and chunks
from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic
from buf->o while all the original code was made to parse this from buf->i.

Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to
duplicate a lot of the code to use buf->i instead of buf->o and conversely,
just because the mux will not be placed on the same side of the buffer. Not
only it complicates code maintenance but it also emphasizes the risk to use
the wrong function at any moment.

From a performance perspective, applets have to suffer a useless copy most of
the time, only due to API limitatoins : it is not possible to write directly to
an input buffer, one has to write to a chunk and then copy it into a buffer. A
compatible structure could allow to share the same data between the chunk and
the buffer without having to perform an extra copy.


4. Proposal

In checks and muxes, it is obvious that a single "side" of the buffer is used,
and it generally is the one associated with the I/O to be performed. Only the
channel requires the distinction between (i) and (o).

The proposal is to remove this distinction from the buffer and move ->o into
the channel.

A buffer will then become only a linear (possibly wrapping) storage area with
a beginning, and an end.

Given the experience gathered from past buffer API updates, we know that the
buffer's end is not as much important as its data length. This will give the
current representation :


    struct buffer {
        void *area;    // start of the storage area
        uint head;     // start offset of remaining data relative to area
        uint len;      // contents length after head
        uint size;     // size of the storage area (wrapping point)
    };

    area
      |
      V
      +-----------+---------------------------------+-------------+
      |           |/////////////////////////////////|             |
      +-----------+---------------------------------+-------------+
       <---------------------------------------------------------> size
       <---------> <------------------------------->
           head                 len

The channel will contain an "output" field corresponding to the current buf->o,
indicating how many bytes of the current buffer are actually scheduled for
being forwarded and must not be considered anymore. It means that a stream
parser will simply start to parse from (buf->area + buf->head + chn->output)
and stop at (buf->area + buf->head + buf->len).

For esnding data, the caller of cs_send() or whatever function will have to
pass the desired number of bytes to send, and one will not expect anymore that
all the buffer's contents have to be sent. In general the caller will have
access to chn->output if it needs to use this (typically from the stream
interface code at the moment).


5. First implementation step

The first step will consist in limiting the changes to the current buffers. The
buffer structure will still contain both a descriptor and the storage area. A
buffer will first be declared this way :

    struct buffer {
        uint head;     // start offset of remaining data relative to area
        uint len;      // contents length after head
        uint size;     // size of the storage area (wrapping point)
        void area[0];  // start of the storage area
    };

Thanks to this, no changes will have to be performed on memory management, and
buffers will continue to be allocated from a pool of size (sizeof(buffer) +
tune.bufsize).

The following translations will have to be performed on the code :
  - occurrences of (buf->i + buf->o) will have to be replaced with (buf->len)
  - bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del()
  - bo_ptr() -> b_head() ; bo_end() -> co_end()
  - b_adv() -> c_adv() ; b_rew() -> c_rew()
  - buf->o will have to be replaced with either chn->output or a function
    argument containing a copy of chn->output. These ones should cancel out
    at the end of the operation.
  - buf->i -> (b_len(buf) - chn->output)

Temporary difficulties :
  - compression makes use of both (i) and (o), by taking care of only touching
    (i) and never (o). The filters know how not to touch (o), and the internal
    compression API needs a small update so that this previous ->o value is
    passed as an argument that the filter will collect from the channel. If it
    is simpler (it probably isn't), a temporary "struct oldbuf" could be
    created to emulate the old behaviour and be fed/used by the filters code.

  - buffer_slow_realign() distinguishes input data from output data so that the
    output data is always placed at the end, leaving a clean contigous buffer
    once forwarded. Instead, a "split" argument will have to be added so that
    the caller may decide where to split the contents. Muxes will pass zero
    here while channels will pass chn->output.


6. Second implementation step

The second step will consist in making "struct buffer" only hold the descriptor
and not the data anymore. It will then look like this :

    struct buffer {
        void *area;    // start of the storage area
        uint head;     // start offset of remaining data relative to area
        uint len;      // contents length after head
        uint size;     // size of the storage area (wrapping point)
    };

Each buffer allocation will have to atomically allocate a struct buffer and an
area. Buffer copies will consist in exchanging the "struct buffer" contents
only.

The chunk API must then be updated so that some new versions of chunk_putblk(),
chunk_printf() etc can write to a storage area, and so that bi_putchk() and
bo_putchk() instead can swap the storage areas when possible.

At this point b_size() will be used to know where to release the allocated
area. The storage will simply consist in (start,len) which is perfectly suited
to have slabs. Just like chunks, b_size()==0 can be used to mention that no
free() must be done on the area. Doing so will make it much simpler to send
pre-formated messages (eg: error messages) into a buffer because such messages
will then be stored into a "struct ist" and sending such a message will be as
simple as doing :

      b->area = ist.str;
      b->len  = ist.len;
      b->head = 0;
      b->size = 0;

The chunk struct can then be removed and replaced everywhere with a struct
buffer. Only the functions will remain, though they will likely have to be
renamed. Maybe the buffer manipulation functions will have to be split between
those which support wrapping and those which don't (chunks don't support
wrapping).

The buf_empty structure will then disappear since a small 20-bytes structure
will be enough to represent an empty buffer.


7. Third implementation step

The third step will consist in placing a struct buffer into the struct channel.
This way no allocation is needed at all, and any storage can be used to deliver
contents. This allows to trivially upgrade a buffer on the fly by picking from
a different slab. It also allows to deliver error messages without ever having
to perform a buffer allocation. Doing so removes the need for the early buffer
allocation for the response in process_stream(), as it is only needed to have a
reliable place to send an error message to. This will ensure the buffer
allocator can be simplified and made more robust against the risk of deadlock
on memory shortage.


8. Caveats

The following points require extra care :
  - there will be some subtracts to figure the buffer "input" length (formerly
    buf->i). In the past it always used to be an unsigned value. Extreme care
    will have to be taken to always use an inline function to compute this so
    that it doesn't accidently become signed.

  - supporting buf->size==0 to point to a special string may require some extra
    checks to avoid causing an integer underflow when calculating (size-len) or
    (size-len-head) to figure the available room.

  - it is very likely that some further changes will be tempting to do in the
    channel to better integrate the buffer (which becomes very similar to the
    pipe), but we must not go as far as removing the visibility of the "struct
    buffer" because it will be used as entry point for many functions.

  - it is likely that a number of the chunk_*, bi_* and bo_* variants have very
    minor variations like return codes or error checking that can make their
    replacement very tricky. Each set of such functions must be studied in
    advance, and their users as well.