331 lines
18 KiB
Plaintext
331 lines
18 KiB
Plaintext
2018-02-21 - Layering in haproxy 1.9
|
|
------------------------------------
|
|
|
|
2 main zones :
|
|
- application : reads from conn_streams, writes to conn_streams, often uses
|
|
streams
|
|
|
|
- connection : receives data from the network, presented into buffers
|
|
available via conn_streams, sends data to the network
|
|
|
|
|
|
The connection zone contains multiple layers which behave independently in each
|
|
direction. The Rx direction is activated upon callbacks from the lower layers.
|
|
The Tx direction is activated recursively from the upper layers. Between every
|
|
two layers there may be a buffer, in each direction. When a buffer is full
|
|
either in Tx or Rx direction, this direction is paused from the network layer
|
|
and the location where the congestion is encountered. Upon end of congestion
|
|
(cs_recv() from the upper layer, of sendto() at the lower layers), a
|
|
tasklet_wakeup() is performed on the blocked layer so that suspended operations
|
|
can be resumed. In this case, the Rx side restarts propagating data upwards
|
|
from the lowest blocked level, while the Tx side restarts propagating data
|
|
downwards from the highest blocked level. Proceeding like this ensures that
|
|
information known to the producer may always be used to tailor the buffer sizes
|
|
or decide of a strategy to best aggregate data. Additionally, each time a layer
|
|
is crossed without transformation, it becomes possible to send without copying.
|
|
|
|
The Rx side notifies the application of data readiness using a wakeup or a
|
|
callback. The Tx side notifies the application of room availability once data
|
|
have been moved resulting in the uppermost buffer having some free space.
|
|
|
|
When crossing a mux downwards, it is possible that the sender is not allowed to
|
|
access the buffer because it is not yet its turn. It is not a problem, the data
|
|
remains in the conn_stream's buffer (or the stream one) and will be restarted
|
|
once the mux is ready to consume these data.
|
|
|
|
|
|
cs_recv() -------. cs_send()
|
|
^ +--------> |||||| -------------+ ^
|
|
| | -------' | | stream
|
|
--|----------|-------------------------------|-------|-------------------
|
|
| | V | connection
|
|
data .---. | | room
|
|
ready! |---| |---| available!
|
|
|---| |---|
|
|
|---| |---|
|
|
| | '---'
|
|
^ +------------+-------+ |
|
|
| | ^ | /
|
|
/ V | V /
|
|
/ recvfrom() | sendto() |
|
|
-------------|----------------|--------------|---------------------------
|
|
| | poll! V kernel
|
|
|
|
|
|
The cs_recv() function should act on pointers to buffer pointers, so that the
|
|
callee may decide to pass its own buffer directly by simply swapping pointers.
|
|
Similarly for cs_send() it is desirable to let the callee steal the buffer by
|
|
swapping the pointers. This way it remains possible to implement zero-copy
|
|
forwarding.
|
|
|
|
Some operation flags will be needed on cs_recv() :
|
|
- RECV_ZERO_COPY : refuse to merge new data into the current buffer if it
|
|
will result in a data copy (ie the buffer is not empty), unless no more
|
|
than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper
|
|
than waiting and playing with pointers)
|
|
|
|
- RECV_AT_ONCE : only perform the operation if it will result in the source
|
|
buffer to become empty at the end of the operation so that no two buffers
|
|
remain allocated at the end. It will most of the time result in either a
|
|
small read or a zero-copy operation.
|
|
|
|
- RECV_PEEK : retrieve a copy of pending data without removing these data
|
|
from the source buffer. Maybe an alternate solution could consist in
|
|
finding the pointer to the source buffer and accessing these data directly,
|
|
except that it might be less interesting for the long term, thread-wise.
|
|
|
|
- RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail.
|
|
This should help various protocol parsers which need to receive a complete
|
|
frame before proceeding.
|
|
|
|
- RECV_ENOUGH : no more data expected after this read if it's of the
|
|
requested size, thus no need to re-enable receiving on the lower layers.
|
|
|
|
- RECV_ONE_SHOT : perform a single read without re-enabling reading on the
|
|
lower layers, like we currently do when receiving an HTTP/1 request. Like
|
|
RECV_ENOUGH where any size is enough. Probably that the two could be merged
|
|
(eg: by having a MIN argument like RECV_MIN).
|
|
|
|
|
|
Some operation flags will be needed on cs_send() :
|
|
- SEND_ZERO_COPY : refuse to merge the presented data with existing data and
|
|
prefer to wait for current data to leave and try again, unless the consumer
|
|
considers the amount of data acceptable for a copy.
|
|
|
|
- SEND_AT_ONCE : only perform the operation if it will result in the source
|
|
buffer to become empty at the end of the operation so that no two buffers
|
|
remain allocated at the end. It will most of the time result in either a
|
|
small write or a zero-copy operation.
|
|
|
|
|
|
Both operations should return a composite status :
|
|
- number of bytes transferred
|
|
- status flags (shutr, shutw, reset, empty, full, ...)
|
|
|
|
|
|
2018-07-23 - Update after merging rxbuf
|
|
---------------------------------------
|
|
|
|
It becomes visible that the mux will not always be welcome to decode incoming
|
|
data because it will sometimes imply extra memory copies and/or usage for no
|
|
benefit.
|
|
|
|
Ideally, when when a stream is instantiated based on incoming data, these
|
|
incoming data should be passed and the upper layers called, but it should then
|
|
be up these upper layers to peek more data in certain circumstances. Typically
|
|
if the pending connection data are larger than what is expected to be passed
|
|
above, it means some data may cause head-of-line blocking (HOL) to other
|
|
streams, and needs to be pushed up through the layers to let other streams
|
|
continue to work. Similarly very large H2 data frames after header frames
|
|
should probably not be passed as they may require copies that could be avoided
|
|
if passed later. However if the decoded frame fits into the conn_stream's
|
|
buffer, there is an opportunity to use a single buffer for the conn_stream
|
|
and the channel. The H2 demux could set a blocking flag indicating it's waiting
|
|
for the upper stream to take over demuxing. This flag would be purged once the
|
|
upper stream would start reading, or when extra data come and change the
|
|
conditions.
|
|
|
|
Forcing structured headers and raw data to coexist within a single buffer is
|
|
quite challenging for many code parts. For example it's perfectly possible to
|
|
see a fragmented buffer containing series of headers, then a small data chunk
|
|
that was received at the same time, then a few other headers added by request
|
|
processing, then another data block received afterwards, then possibly yet
|
|
another header added by option http-send-name-header, and yet another data
|
|
block. This causes some pain for compression which still needs to know where
|
|
compressed and uncompressed data start/stop. It also makes it very difficult
|
|
to account the exact bytes to pass through the various layers.
|
|
|
|
One solution consists in thinking about buffers using 3 representations :
|
|
|
|
- a structured message, which is used for the internal HTTP representation.
|
|
This message may only be atomically processed. It has no clear byte count,
|
|
it's a message.
|
|
|
|
- a raw stream, consisting in sequences of bytes. That's typically what
|
|
happens in data sequences or in tunnel.
|
|
|
|
- a pipe, which contains data to be forwarded, and that haproxy cannot have
|
|
access to.
|
|
|
|
The processing efficiency decreases with the higher complexity above, but the
|
|
capabilities increase. The structured message can contain anything including
|
|
serialized data blocks to be processed or forwarded. The raw stream contains
|
|
data blocks to be processed or forwarded. The pipe only contains data blocks
|
|
to be forwarded. The the latter ones are only an optimization of the former
|
|
ones.
|
|
|
|
Thus ideally a channel should have access to all such 3 storage areas at once,
|
|
depending on the use case :
|
|
(1) a structured message,
|
|
(2) a raw stream,
|
|
(3) a pipe
|
|
|
|
Right now a channel only has (2) and (3) but after the native HTTP rework, it
|
|
will only have (1) and (3). Placing a raw stream exclusively in (1) comes with
|
|
some performance drawbacks which are not easily recovered, and with some quite
|
|
difficult management still involving the reserve to ensure that a data block
|
|
doesn't prevent headers from being appended. But during header processing, the
|
|
payload may be necessary so we cannot decide to drop this option.
|
|
|
|
A long-term approach would consist in ensuring that a single channel may have
|
|
access to all 3 representations at once, and to enumerate priority rules to
|
|
define how they interact together. That's exactly what is currently being done
|
|
with the pipe and the raw buffer right now. Doing so would also save the need
|
|
for storing payload in the structured message and void the requirement for the
|
|
reserve. But it would cost more memory to process POST data and server
|
|
responses. Thus an intermediary step consists in keeping this model in mind but
|
|
not implementing everything yet.
|
|
|
|
Short term proposal : a channel has access to a buffer and a pipe. A non-empty
|
|
buffer is either in structured message format OR raw stream format. Only the
|
|
channel knows. However a structured buffer MAY contain raw data in a properly
|
|
formatted way (using the envelope defined by the structured message format).
|
|
|
|
By default, when a demux writes to a CS rxbuf, it will try to use the lowest
|
|
possible level for what is being done (i.e. splice if possible, otherwise raw
|
|
stream, otherwise structured message). If the buffer already contains a
|
|
structured message, then this format is exclusive. From this point the MUX has
|
|
two options : either encode the incoming data to match the structured message
|
|
format, or refrain from receiving into the CS's rxbuf and wait until the upper
|
|
layer request those data.
|
|
|
|
This opens a simplified option which could be suited even for the long term :
|
|
- cs_recv() will take one or two flags to indicate if a buffer already
|
|
contains a structured message or not ; the upper layer knows it.
|
|
|
|
- cs_recv() will take two flags to indicate what the upper layer is willing
|
|
to take :
|
|
- structured message only
|
|
- raw stream only
|
|
- any of them
|
|
|
|
From this point the mux can decide to either pass anything or refrain from
|
|
doing so.
|
|
|
|
- the demux stores the knowledge it has from the contents into some CS flags
|
|
to indicate whether or not some structured message are still available, and
|
|
whether or not some raw data are still available. Thus the caller knows
|
|
whether or not extra data are available.
|
|
|
|
- when the demux works on its own, it refrains from passing structured data
|
|
to a non-empty buffer, unless these data are causing trouble to other
|
|
streams (HOL).
|
|
|
|
- when a demux has to encapsulate raw data into a structured message, it will
|
|
always have to respect a configured reserve so that extra header processing
|
|
can be done on the structured message inside the buffer, regardless of the
|
|
supposed available room. In addition, the upper layer may indicate using an
|
|
extra recv() flag whether it wants the demux to defragment serialized data
|
|
(for example by moving trailing headers apart) or if it's not necessary.
|
|
This flag will be set by the stream interface if compression is required or
|
|
if the http-buffer-request option is set for example. Probably that using
|
|
to_forward==0 is a stronger indication that the reserve must be respected.
|
|
|
|
- cs_recv() and cs_send() when fed with a message, should not return byte
|
|
counts but message counts (i.e. 0 or 1). This implies that a single call to
|
|
either of these functions cannot mix raw data and structured messages at
|
|
the same time.
|
|
|
|
At this point it looks like the conn_stream will have some encapsulation work
|
|
to do for the payload if it needs to be encapsulated into a message. This
|
|
further magnifies the importance of *not* decoding DATA frames into the CS's
|
|
rxbuf until really needed.
|
|
|
|
The CS will probably need to hold indication of what is available at the mux
|
|
level, not only in the CS. Eg: we know that payload is still available.
|
|
|
|
Using these elements, it should be possible to ensure that full header frames
|
|
may be received without enforcing any reserve, that too large frames that do
|
|
not fit will be detected because they return 0 message and indicate that such
|
|
a message is still pending, and that data availability is correctly detected
|
|
(later we may expect that the stream-interface allocates a larger or second
|
|
buffer to place the payload).
|
|
|
|
Regarding the ability for the channel to forward data, it looks like having a
|
|
new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in
|
|
optimizing the forwarding to make use of splicing when available. It is not yet
|
|
totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)"
|
|
followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it
|
|
still needs to be studied. The general idea seems to be that the receiver might
|
|
have to call the sender directly once they agree on how to transfer data (pipe
|
|
or buffer). If the transfer is incomplete, the cs_xfer() return value and/or
|
|
flags will indicate the current situation (src empty, dst full, etc) so that
|
|
the caller may register for notifications on the appropriate event and wait to
|
|
be called again to continue.
|
|
|
|
Short term implementation :
|
|
1) add new CS flags to qualify what the buffer contains and what we expect
|
|
to read into it;
|
|
|
|
2) set these flags to pretend we have a structured message when receiving
|
|
headers (after all, H1 is an atomic header as well) and see what it
|
|
implies for the code; for H1 it's unclear whether it makes sense to try
|
|
to set it without the H1 mux.
|
|
|
|
3) use these flags to refrain from sending DATA frames after HEADERS frames
|
|
in H2.
|
|
|
|
4) flush the flags at the stream interface layer when performing a cs_send().
|
|
|
|
5) use the flags to enforce receipt of data only when necessary
|
|
|
|
We should be able to end up with sequential receipt in H2 modelling what is
|
|
needed for other protocols without interfering with the native H1 devs.
|
|
|
|
|
|
2018-08-17 - Considerations after killing cs_recv()
|
|
---------------------------------------------------
|
|
|
|
With the ongoing reorganisation of the I/O layers, it's visible that cs_recv()
|
|
will have to transfer data between the cs' rxbuf and the channel's buffer while
|
|
not being aware of the data format. Moreover, in case there's no data there, it
|
|
needs to recursively call the mux's rcv_buf() to trigger a decoding, while this
|
|
function is sometimes replaced with cs_recv(). All this shows that cs_recv() is
|
|
in fact needed while data are pushed upstream from the lower layers, and is not
|
|
suitable for the "pull" mode. Thus it was decided to remove this function and
|
|
put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be
|
|
replaced with cs_recv() since it is the only one knowing about the buffer's
|
|
format.
|
|
|
|
This opportunity simplified something : if the cs's rxbuf is only read by the
|
|
mux's rcv_buf() method, then it doesn't need to be located into the CS and is
|
|
well placed into the mux's representation of the stream. This has an important
|
|
impact for H2 as it offers more freedom to the mux to allocate/free/reallocate
|
|
this buffer, and it ensures the mux always has access to it.
|
|
|
|
Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1
|
|
mux has already uncovered the difficulty related to the channel shutting down
|
|
on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled
|
|
to the stream and the stream can close immediately once its buffers are empty,
|
|
it required a way to support orphaned CS with pending data in their txbuf. This
|
|
is something that the H2 mux already has to deal with, by carefully leaving the
|
|
data in the channel's buffer. But due to the snd_buf() call being top-down, it
|
|
is always possible to push the stream's data via the mux's snd_buf() call
|
|
without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only
|
|
implemented in the mux and attached to the mux's representation of the stream,
|
|
and doing so allows to immediately release the channel once the data are safe
|
|
in the mux's buffer.
|
|
|
|
This is an important change which clarifies the roles and responsibilities of
|
|
each layer in the chain : when receiving data from a mux, it's the mux's
|
|
responsibility to make sure it can correctly decode the incoming data and to
|
|
buffer the possible excess of data it cannot pass to the requester. This means
|
|
that decoding an H2 frame, which is not retryable since it has an impact on the
|
|
HPACK decompression context, and which cannot be reordered for the same reason,
|
|
simply needs to be performed to the H2 stream's rxbuf which will then be passed
|
|
to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a
|
|
time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to
|
|
read as much as it needs to be able to restart later, possibly by buffering
|
|
some data into a local buffer. And it's only once all the output data has been
|
|
consumed by snd_buf() that the stream is free to disappear.
|
|
|
|
This model presents the nice benefit of being infinitely stackable and solving
|
|
the last identified showstoppers to move towards a structured message internal
|
|
representation, as it will give full power to the rcv_buf() and snd_buf() to
|
|
process what they need.
|
|
|
|
For now the conn_stream's flags indicating whether a shutdown has been seen in
|
|
any direction or if an end of stream was seen will remain in the conn_stream,
|
|
though it's likely that some of them will move to the mux's representation of
|
|
the stream after structured messages are implemented.
|