mirror of
http://git.haproxy.org/git/haproxy.git/
synced 2024-12-12 22:44:32 +00:00
fa9f9ccd6f
The "struct buffer" needs to be reworked, this new doc lists the changes and steps to do this.
246 lines
11 KiB
Plaintext
246 lines
11 KiB
Plaintext
2018-05-18 - Buffer rework
|
|
|
|
1. Summary
|
|
|
|
The situation with the current buffer structure is becoming problematic in
|
|
the newly introduced muxes and causes serious difficulties preventing muxes
|
|
from being used on both sides, unless requiring that all code is duplicated
|
|
to use buf->i on the Rx path and buf->o on the Tx path.
|
|
|
|
|
|
2. History
|
|
|
|
A very long time ago, buffers were used to receive data using the recv() call,
|
|
to parse them, forward them, and send them over the network using the send()
|
|
call. Buffers were split into (buffer,channel) when some layers started to be
|
|
introduced, and were reorganized a few times to ease content processing and
|
|
rewriting. The current definition of the buffer structure is the following :
|
|
|
|
struct buffer {
|
|
char *p;
|
|
uint size;
|
|
uint i;
|
|
uint o;
|
|
char data[0];
|
|
};
|
|
|
|
data p
|
|
| |
|
|
V V
|
|
+-----------+--------------------+------------+-------------+
|
|
| |////////////////////|////////////| |
|
|
+-----------+--------------------+------------+-------------+
|
|
<---------------------------------------------------------> size
|
|
<------------------> <---------->
|
|
o i
|
|
|
|
Pointer (p) is initialized to (data) when the buffer is empty. Data are
|
|
received after (p+i), increasing (i) by the number of bytes read. Data are sent
|
|
from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent.
|
|
Forwarding data in the channel consists in advancing (p) by the number of bytes
|
|
to forward, increasing (o) and decreasing (i) by the same amount.
|
|
|
|
This representation is convenient for channel operations because most of them
|
|
require to parse input data between (p) and (p+i), and to have a simple way to
|
|
forward data. Additionally, it's always easy to know if some data are scheduled
|
|
for departure (o), or if the buffer has some room available (size-i-o).
|
|
|
|
|
|
3. Problems
|
|
|
|
When applets were introduced, the initial code that was made to write data into
|
|
the output part was modified to send it into the input part since we had to
|
|
rely on the stream code to forward these data via the channel. This explains
|
|
the flood of bi_* functions that were introduced to perform the same operations
|
|
as the initial bo_*, to write into an input buffer from an applet.
|
|
|
|
Health checks however continue to use output because checks do not use streams
|
|
nor channels. Thus the check buffers use buf->o for requests and buf->i for
|
|
responses.
|
|
|
|
The introduction of muxes has changed this again by requiring that most request
|
|
code was able to write to buf->i, pretending to be the equivalent of a socket
|
|
recv() call. New bi_* functions had to be created to write headers and chunks
|
|
from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic
|
|
from buf->o while all the original code was made to parse this from buf->i.
|
|
|
|
Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to
|
|
duplicate a lot of the code to use buf->i instead of buf->o and conversely,
|
|
just because the mux will not be placed on the same side of the buffer. Not
|
|
only it complicates code maintenance but it also emphasizes the risk to use
|
|
the wrong function at any moment.
|
|
|
|
From a performance perspective, applets have to suffer a useless copy most of
|
|
the time, only due to API limitatoins : it is not possible to write directly to
|
|
an input buffer, one has to write to a chunk and then copy it into a buffer. A
|
|
compatible structure could allow to share the same data between the chunk and
|
|
the buffer without having to perform an extra copy.
|
|
|
|
|
|
4. Proposal
|
|
|
|
In checks and muxes, it is obvious that a single "side" of the buffer is used,
|
|
and it generally is the one associated with the I/O to be performed. Only the
|
|
channel requires the distinction between (i) and (o).
|
|
|
|
The proposal is to remove this distinction from the buffer and move ->o into
|
|
the channel.
|
|
|
|
A buffer will then become only a linear (possibly wrapping) storage area with
|
|
a beginning, and an end.
|
|
|
|
Given the experience gathered from past buffer API updates, we know that the
|
|
buffer's end is not as much important as its data length. This will give the
|
|
current representation :
|
|
|
|
|
|
struct buffer {
|
|
void *area; // start of the storage area
|
|
uint head; // start offset of remaining data relative to area
|
|
uint len; // contents length after head
|
|
uint size; // size of the storage area (wrapping point)
|
|
};
|
|
|
|
area
|
|
|
|
|
V
|
|
+-----------+---------------------------------+-------------+
|
|
| |/////////////////////////////////| |
|
|
+-----------+---------------------------------+-------------+
|
|
<---------------------------------------------------------> size
|
|
<---------> <------------------------------->
|
|
head len
|
|
|
|
The channel will contain an "output" field corresponding to the current buf->o,
|
|
indicating how many bytes of the current buffer are actually scheduled for
|
|
being forwarded and must not be considered anymore. It means that a stream
|
|
parser will simply start to parse from (buf->area + buf->head + chn->output)
|
|
and stop at (buf->area + buf->head + buf->len).
|
|
|
|
For esnding data, the caller of cs_send() or whatever function will have to
|
|
pass the desired number of bytes to send, and one will not expect anymore that
|
|
all the buffer's contents have to be sent. In general the caller will have
|
|
access to chn->output if it needs to use this (typically from the stream
|
|
interface code at the moment).
|
|
|
|
|
|
5. First implementation step
|
|
|
|
The first step will consist in limiting the changes to the current buffers. The
|
|
buffer structure will still contain both a descriptor and the storage area. A
|
|
buffer will first be declared this way :
|
|
|
|
struct buffer {
|
|
uint head; // start offset of remaining data relative to area
|
|
uint len; // contents length after head
|
|
uint size; // size of the storage area (wrapping point)
|
|
void area[0]; // start of the storage area
|
|
};
|
|
|
|
Thanks to this, no changes will have to be performed on memory management, and
|
|
buffers will continue to be allocated from a pool of size (sizeof(buffer) +
|
|
tune.bufsize).
|
|
|
|
The following translations will have to be performed on the code :
|
|
- occurrences of (buf->i + buf->o) will have to be replaced with (buf->len)
|
|
- bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del()
|
|
- bo_ptr() -> b_head() ; bo_end() -> co_end()
|
|
- b_adv() -> c_adv() ; b_rew() -> c_rew()
|
|
- buf->o will have to be replaced with either chn->output or a function
|
|
argument containing a copy of chn->output. These ones should cancel out
|
|
at the end of the operation.
|
|
- buf->i -> (b_len(buf) - chn->output)
|
|
|
|
Temporary difficulties :
|
|
- compression makes use of both (i) and (o), by taking care of only touching
|
|
(i) and never (o). The filters know how not to touch (o), and the internal
|
|
compression API needs a small update so that this previous ->o value is
|
|
passed as an argument that the filter will collect from the channel. If it
|
|
is simpler (it probably isn't), a temporary "struct oldbuf" could be
|
|
created to emulate the old behaviour and be fed/used by the filters code.
|
|
|
|
- buffer_slow_realign() distinguishes input data from output data so that the
|
|
output data is always placed at the end, leaving a clean contigous buffer
|
|
once forwarded. Instead, a "split" argument will have to be added so that
|
|
the caller may decide where to split the contents. Muxes will pass zero
|
|
here while channels will pass chn->output.
|
|
|
|
|
|
6. Second implementation step
|
|
|
|
The second step will consist in making "struct buffer" only hold the descriptor
|
|
and not the data anymore. It will then look like this :
|
|
|
|
struct buffer {
|
|
void *area; // start of the storage area
|
|
uint head; // start offset of remaining data relative to area
|
|
uint len; // contents length after head
|
|
uint size; // size of the storage area (wrapping point)
|
|
};
|
|
|
|
Each buffer allocation will have to atomically allocate a struct buffer and an
|
|
area. Buffer copies will consist in exchanging the "struct buffer" contents
|
|
only.
|
|
|
|
The chunk API must then be updated so that some new versions of chunk_putblk(),
|
|
chunk_printf() etc can write to a storage area, and so that bi_putchk() and
|
|
bo_putchk() instead can swap the storage areas when possible.
|
|
|
|
At this point b_size() will be used to know where to release the allocated
|
|
area. The storage will simply consist in (start,len) which is perfectly suited
|
|
to have slabs. Just like chunks, b_size()==0 can be used to mention that no
|
|
free() must be done on the area. Doing so will make it much simpler to send
|
|
pre-formated messages (eg: error messages) into a buffer because such messages
|
|
will then be stored into a "struct ist" and sending such a message will be as
|
|
simple as doing :
|
|
|
|
b->area = ist.str;
|
|
b->len = ist.len;
|
|
b->head = 0;
|
|
b->size = 0;
|
|
|
|
The chunk struct can then be removed and replaced everywhere with a struct
|
|
buffer. Only the functions will remain, though they will likely have to be
|
|
renamed. Maybe the buffer manipulation functions will have to be split between
|
|
those which support wrapping and those which don't (chunks don't support
|
|
wrapping).
|
|
|
|
The buf_empty structure will then disappear since a small 20-bytes structure
|
|
will be enough to represent an empty buffer.
|
|
|
|
|
|
7. Third implementation step
|
|
|
|
The third step will consist in placing a struct buffer into the struct channel.
|
|
This way no allocation is needed at all, and any storage can be used to deliver
|
|
contents. This allows to trivially upgrade a buffer on the fly by picking from
|
|
a different slab. It also allows to deliver error messages without ever having
|
|
to perform a buffer allocation. Doing so removes the need for the early buffer
|
|
allocation for the response in process_stream(), as it is only needed to have a
|
|
reliable place to send an error message to. This will ensure the buffer
|
|
allocator can be simplified and made more robust against the risk of deadlock
|
|
on memory shortage.
|
|
|
|
|
|
8. Caveats
|
|
|
|
The following points require extra care :
|
|
- there will be some subtracts to figure the buffer "input" length (formerly
|
|
buf->i). In the past it always used to be an unsigned value. Extreme care
|
|
will have to be taken to always use an inline function to compute this so
|
|
that it doesn't accidently become signed.
|
|
|
|
- supporting buf->size==0 to point to a special string may require some extra
|
|
checks to avoid causing an integer underflow when calculating (size-len) or
|
|
(size-len-head) to figure the available room.
|
|
|
|
- it is very likely that some further changes will be tempting to do in the
|
|
channel to better integrate the buffer (which becomes very similar to the
|
|
pipe), but we must not go as far as removing the visibility of the "struct
|
|
buffer" because it will be used as entry point for many functions.
|
|
|
|
- it is likely that a number of the chunk_*, bi_* and bo_* variants have very
|
|
minor variations like return codes or error checking that can make their
|
|
replacement very tricky. Each set of such functions must be studied in
|
|
advance, and their users as well.
|