diff --git a/doc/internals/buffer-rework.txt b/doc/internals/buffer-rework.txt new file mode 100644 index 000000000..136377fd0 --- /dev/null +++ b/doc/internals/buffer-rework.txt @@ -0,0 +1,245 @@ +2018-05-18 - Buffer rework + +1. Summary + +The situation with the current buffer structure is becoming problematic in +the newly introduced muxes and causes serious difficulties preventing muxes +from being used on both sides, unless requiring that all code is duplicated +to use buf->i on the Rx path and buf->o on the Tx path. + + +2. History + +A very long time ago, buffers were used to receive data using the recv() call, +to parse them, forward them, and send them over the network using the send() +call. Buffers were split into (buffer,channel) when some layers started to be +introduced, and were reorganized a few times to ease content processing and +rewriting. The current definition of the buffer structure is the following : + + struct buffer { + char *p; + uint size; + uint i; + uint o; + char data[0]; + }; + + data p + | | + V V + +-----------+--------------------+------------+-------------+ + | |////////////////////|////////////| | + +-----------+--------------------+------------+-------------+ + <---------------------------------------------------------> size + <------------------> <----------> + o i + +Pointer (p) is initialized to (data) when the buffer is empty. Data are +received after (p+i), increasing (i) by the number of bytes read. Data are sent +from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent. +Forwarding data in the channel consists in advancing (p) by the number of bytes +to forward, increasing (o) and decreasing (i) by the same amount. + +This representation is convenient for channel operations because most of them +require to parse input data between (p) and (p+i), and to have a simple way to +forward data. Additionally, it's always easy to know if some data are scheduled +for departure (o), or if the buffer has some room available (size-i-o). + + +3. Problems + +When applets were introduced, the initial code that was made to write data into +the output part was modified to send it into the input part since we had to +rely on the stream code to forward these data via the channel. This explains +the flood of bi_* functions that were introduced to perform the same operations +as the initial bo_*, to write into an input buffer from an applet. + +Health checks however continue to use output because checks do not use streams +nor channels. Thus the check buffers use buf->o for requests and buf->i for +responses. + +The introduction of muxes has changed this again by requiring that most request +code was able to write to buf->i, pretending to be the equivalent of a socket +recv() call. New bi_* functions had to be created to write headers and chunks +from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic +from buf->o while all the original code was made to parse this from buf->i. + +Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to +duplicate a lot of the code to use buf->i instead of buf->o and conversely, +just because the mux will not be placed on the same side of the buffer. Not +only it complicates code maintenance but it also emphasizes the risk to use +the wrong function at any moment. + +From a performance perspective, applets have to suffer a useless copy most of +the time, only due to API limitatoins : it is not possible to write directly to +an input buffer, one has to write to a chunk and then copy it into a buffer. A +compatible structure could allow to share the same data between the chunk and +the buffer without having to perform an extra copy. + + +4. Proposal + +In checks and muxes, it is obvious that a single "side" of the buffer is used, +and it generally is the one associated with the I/O to be performed. Only the +channel requires the distinction between (i) and (o). + +The proposal is to remove this distinction from the buffer and move ->o into +the channel. + +A buffer will then become only a linear (possibly wrapping) storage area with +a beginning, and an end. + +Given the experience gathered from past buffer API updates, we know that the +buffer's end is not as much important as its data length. This will give the +current representation : + + + struct buffer { + void *area; // start of the storage area + uint head; // start offset of remaining data relative to area + uint len; // contents length after head + uint size; // size of the storage area (wrapping point) + }; + + area + | + V + +-----------+---------------------------------+-------------+ + | |/////////////////////////////////| | + +-----------+---------------------------------+-------------+ + <---------------------------------------------------------> size + <---------> <-------------------------------> + head len + +The channel will contain an "output" field corresponding to the current buf->o, +indicating how many bytes of the current buffer are actually scheduled for +being forwarded and must not be considered anymore. It means that a stream +parser will simply start to parse from (buf->area + buf->head + chn->output) +and stop at (buf->area + buf->head + buf->len). + +For esnding data, the caller of cs_send() or whatever function will have to +pass the desired number of bytes to send, and one will not expect anymore that +all the buffer's contents have to be sent. In general the caller will have +access to chn->output if it needs to use this (typically from the stream +interface code at the moment). + + +5. First implementation step + +The first step will consist in limiting the changes to the current buffers. The +buffer structure will still contain both a descriptor and the storage area. A +buffer will first be declared this way : + + struct buffer { + uint head; // start offset of remaining data relative to area + uint len; // contents length after head + uint size; // size of the storage area (wrapping point) + void area[0]; // start of the storage area + }; + +Thanks to this, no changes will have to be performed on memory management, and +buffers will continue to be allocated from a pool of size (sizeof(buffer) + +tune.bufsize). + +The following translations will have to be performed on the code : + - occurrences of (buf->i + buf->o) will have to be replaced with (buf->len) + - bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del() + - bo_ptr() -> b_head() ; bo_end() -> co_end() + - b_adv() -> c_adv() ; b_rew() -> c_rew() + - buf->o will have to be replaced with either chn->output or a function + argument containing a copy of chn->output. These ones should cancel out + at the end of the operation. + - buf->i -> (b_len(buf) - chn->output) + +Temporary difficulties : + - compression makes use of both (i) and (o), by taking care of only touching + (i) and never (o). The filters know how not to touch (o), and the internal + compression API needs a small update so that this previous ->o value is + passed as an argument that the filter will collect from the channel. If it + is simpler (it probably isn't), a temporary "struct oldbuf" could be + created to emulate the old behaviour and be fed/used by the filters code. + + - buffer_slow_realign() distinguishes input data from output data so that the + output data is always placed at the end, leaving a clean contigous buffer + once forwarded. Instead, a "split" argument will have to be added so that + the caller may decide where to split the contents. Muxes will pass zero + here while channels will pass chn->output. + + +6. Second implementation step + +The second step will consist in making "struct buffer" only hold the descriptor +and not the data anymore. It will then look like this : + + struct buffer { + void *area; // start of the storage area + uint head; // start offset of remaining data relative to area + uint len; // contents length after head + uint size; // size of the storage area (wrapping point) + }; + +Each buffer allocation will have to atomically allocate a struct buffer and an +area. Buffer copies will consist in exchanging the "struct buffer" contents +only. + +The chunk API must then be updated so that some new versions of chunk_putblk(), +chunk_printf() etc can write to a storage area, and so that bi_putchk() and +bo_putchk() instead can swap the storage areas when possible. + +At this point b_size() will be used to know where to release the allocated +area. The storage will simply consist in (start,len) which is perfectly suited +to have slabs. Just like chunks, b_size()==0 can be used to mention that no +free() must be done on the area. Doing so will make it much simpler to send +pre-formated messages (eg: error messages) into a buffer because such messages +will then be stored into a "struct ist" and sending such a message will be as +simple as doing : + + b->area = ist.str; + b->len = ist.len; + b->head = 0; + b->size = 0; + +The chunk struct can then be removed and replaced everywhere with a struct +buffer. Only the functions will remain, though they will likely have to be +renamed. Maybe the buffer manipulation functions will have to be split between +those which support wrapping and those which don't (chunks don't support +wrapping). + +The buf_empty structure will then disappear since a small 20-bytes structure +will be enough to represent an empty buffer. + + +7. Third implementation step + +The third step will consist in placing a struct buffer into the struct channel. +This way no allocation is needed at all, and any storage can be used to deliver +contents. This allows to trivially upgrade a buffer on the fly by picking from +a different slab. It also allows to deliver error messages without ever having +to perform a buffer allocation. Doing so removes the need for the early buffer +allocation for the response in process_stream(), as it is only needed to have a +reliable place to send an error message to. This will ensure the buffer +allocator can be simplified and made more robust against the risk of deadlock +on memory shortage. + + +8. Caveats + +The following points require extra care : + - there will be some subtracts to figure the buffer "input" length (formerly + buf->i). In the past it always used to be an unsigned value. Extreme care + will have to be taken to always use an inline function to compute this so + that it doesn't accidently become signed. + + - supporting buf->size==0 to point to a special string may require some extra + checks to avoid causing an integer underflow when calculating (size-len) or + (size-len-head) to figure the available room. + + - it is very likely that some further changes will be tempting to do in the + channel to better integrate the buffer (which becomes very similar to the + pipe), but we must not go as far as removing the visibility of the "struct + buffer" because it will be used as entry point for many functions. + + - it is likely that a number of the chunk_*, bi_* and bo_* variants have very + minor variations like return codes or error checking that can make their + replacement very tricky. Each set of such functions must be studied in + advance, and their users as well.