haproxy/doc/design-thoughts/http2.txt

278 lines
16 KiB
Plaintext

2014/10/23 - design thoughts for HTTP/2
- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a
connection holds a compression context (headers table, etc...). We probably
need to have an h2_conn struct.
- multiple transactions will be handled in parallel for a given h2_conn. They
are called streams in HTTP/2 terminology.
- multiplexing : for a given client-side h2 connection, we can have multiple
server-side h2 connections. And for a server-side h2 connection, we can have
multiple client-side h2 connections. Streams circulate in N-to-N fashion.
- flow control : flow control will be applied between multiple streams. Special
care must be taken so that an H2 client cannot block some H2 servers by
sending requests spread over multiple servers to the point where one server
response is blocked and prevents other responses from the same server from
reaching their clients. H2 connection buffers must always be empty or nearly
empty. The per-stream flow control needs to be respected as well as the
connection's buffers. It is important to implement some fairness between all
the streams so that it's not always the same which gets the bandwidth when
the connection is congested.
- some clients can be H1 with an H2 server (is this really needed ?). Most of
the initial use case will be H2 clients to H1 servers. It is important to keep
in mind that H1 servers do not do flow control and that we don't want them to
block transfers (eg: post upload).
- internal tasks : some H2 clients will be internal tasks (eg: health checks).
Some H2 servers will be internal tasks (eg: stats, cache). The model must be
compatible with this use case.
- header indexing : headers are transported compressed, with a reference to a
static or a dynamic header, or a literal, possibly huffman-encoded. Indexing
is specific to the H2 connection. This means there is no way any binary data
can flow between both sides, headers will have to be decoded according to the
incoming connection's context and re-encoded according to the outgoing
connection's context, which can significantly differ. In order to avoid the
parsing trouble we currently face, headers will have to be clearly split
between name and value. It is worth noting that neither the incoming nor the
outgoing connections' contexts will be of any use while processing the
headers. At best we can have some shortcuts for well-known names that map
well to the static ones (eg: use the first static entry with same name), and
maybe have a few special cases for static name+value as well. Probably we can
classify headers in such categories :
- static name + value
- static name + other value
- dynamic name + other value
This will allow for better processing in some specific cases. Headers
supporting a single value (:method, :status, :path, ...) should probably
be stored in a single location with a direct access. That would allow us
to retrieve a method using hdr[METHOD]. All such indexing must be performed
while parsing. That also means that HTTP/1 will have to be converted to this
representation very early in the parser and possibly converted back to H/1
after processing.
Header names/values will have to be placed in a small memory area that will
inevitably get fragmented as headers are rewritten. An automatic packing
mechanism must be implemented so that when there's no more room, headers are
simply defragmented/packet to a new table and the old one is released. Just
like for the static chunks, we need to have a few such tables pre-allocated
and ready to be swapped at any moment. Repacking must not change any index
nor affect the way headers are compressed so that it can happen late after a
retry (send-name-header for example).
- header processing : can still happen on a (header, value) basis. Reqrep/
rsprep completely disappear and will have to be replaced with something else
to support renaming headers and rewriting url/path/...
- push_promise : servers can push dummy requests+responses. They advertise
the stream ID in the push_promise frame indicating the associated stream ID.
This means that it is possible to initiate a client-server stream from the
information coming from the server and make the data flow as if the client
had made it. It's likely that we'll have to support two types of server
connections: those which support push and those which do not. That way client
streams will be distributed to existing server connections based on their
capabilities. It's important to keep in mind that PUSH will not be rewritten
in responses.
- stream ID mapping : since the stream ID is per H2 connection, stream IDs will
have to be mapped. Thus a given stream is an entity with two IDs (one per
side). Or more precisely a stream has two end points, each one carrying an ID
when it ends on an HTTP2 connection. Also, for each stream ID we need to
quickly find the associated transaction in progress. Using a small quick
unique tree seems indicated considering the wide range of valid values.
- frame sizes : frame have to be remapped between both sides as multiplexed
connections won't always have the same characteristics. Thus some frames
might be spliced and others will be sliced.
- error processing : care must be taken to never break a connection unless it
is dead or corrupt at the protocol level. Stats counter must exist to observe
the causes. Timeouts are a great problem because silent connections might
die out of inactivity. Ping frames should probably be scheduled a few seconds
before the connection timeout so that an unused connection is verified before
being killed. Abnormal requests must be dealt with using RST_STREAM.
- ALPN : ALPN must be observed onthe client side, and transmitted to the server
side.
- proxy protocol : proxy protocol makes little to no sense in a multiplexed
protocol. A per-stream equivalent will surely be needed if implementations
do not quickly generalize the use of Forward.
- simplified protocol for local devices (eg: haproxy->varnish in clear and
without handshake, and possibly even with splicing if the connection's
settings are shared)
- logging : logging must report a number of extra information such as the
stream ID, and whether the transaction was initiated by the client or by the
server (which can be deduced from the stream ID's parity). In case of push,
the number of the associated stream must also be reported.
- memory usage : H2 increases memory usage by mandating use of 16384 bytes
frame size minimum. That means slightly more than 16kB of buffer in each
direction to process any frame. It will definitely have an impact on the
deployed maxconn setting in places using less than this (4..8kB are common).
Also, the header list is persistent per connection, so if we reach the same
size as the request, that's another 16kB in each direction, resulting in
about 48kB of memory where 8 were previously used. A more careful encoder
can work with a much smaller set even if that implies evicting entries
between multiple headers of the same message.
- HTTP/1.0 should very carefully be transported over H2. Since there's no way
to pass version information in the protocol, the server could use some
features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers,
...).
- host / :authority : ":authority" is the norm, and "host" will be absent when
H2 clients generate :authority. This probably means that a dummy Host header
will have to be produced internally from :authority and removed when passing
to H2 behind. This can cause some trouble when passing H2 requests to H1
proxies, because there's no way to know if the request should contain scheme
and authority in H1 or not based on the H2 request. Thus a "proxy" option
will have to be explicitly mentionned on HTTP/1 server lines. One of the
problem that it creates is that it's not longer possible to pass H/1 requests
to H/1 proxies without an explicit configuration. Maybe a table of the
various combinations is needed.
:scheme :authority host
HTTP/2 request present present absent
HTTP/1 server req absent absent present
HTTP/1 proxy req present present present
So in the end the issue is only with H/2 requests passed to H/1 proxies.
- ping frames : they don't indicate any stream ID so by definition they cannot
be forwarded to any server. The H2 connection should deal with them only.
There's a layering problem with H2. The framing layer has to be aware of the
upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass
it over a framing layer to mux the streams, the frame type must be passed below
so that frames are properly arranged. Header encoding is connection-based and
all streams using the same connection will interact in the way their headers
are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and
this entity has to know for each stream what its headers are.
Probably that we should remove *all* headers from transported data and move
them on the fly to a parallel structure that can be shared between H1 and H2
and consumed at the appropriate level. That means buffers only transport data.
Trailers have to be dealt with differently.
So if we consider an H1 request being forwarded between a client and a server,
it would look approximately like this :
- request header + body land into a stream's receive buffer
- headers are indexed and stripped out so that only the body and whatever
follows remain in the buffer
- both the header index and the buffer with the body stay attached to the
stream
- the sender can rebuild the whole headers. Since they're found in a table
supposed to be stable, it can rebuild them as many times as desired and
will always get the same result, so it's safe to build them into the trash
buffer for immediate sending, just as we do for the PROXY protocol.
- the upper protocol should probably provide a build_hdr() callback which
when called by the socket layer, builds this header block based on the
current stream's header list, ready to be sent.
- the socket layer has to know how many bytes from the headers are left to be
forwarded prior to processing the body.
- the socket layer needs to consume only the acceptable part of the body and
must not release the buffer if any data remains in it (eg: pipelining over
H1). This is already handled by channel->o and channel->to_forward.
- we could possibly have another optional callback to send a preamble before
data, that could be used to send chunk sizes in H1. The danger is that it
absolutely needs to be stable if it has to be retried. But it could
considerably simplify de-chunking.
When the request is sent to an H2 server, an H2 stream request must be made
to the server, we find an existing connection whose settings are compatible
with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If
none is found, a new connection must be established, unless maxconn is reached.
Servers must have a maxstream setting just like they have a maxconn. The same
queue may be used for that.
The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2
that becomes impossible (and useless). We still need something like the
"tcp-request session" hook to apply just after the SSL handshake is done.
It is impossible to defragment the body on the fly in HTTP/2. Since multiple
messages are interleaved, we cannot wait for all of them and block the head of
line. Thus if body analysis is required, it will have to use the stream's
buffer, which necessarily implies a copy. That means that with each H2 end we
necessarily have at least one copy. Sometimes we might be able to "splice" some
bytes from one side to the other without copying into the stream buffer (same
rules as for TCP splicing).
In theory, only data should flow through the channel buffer, so each side's
connector is responsible for encoding data (H1: linear/chunks, H2: frames).
Maybe the same mechanism could be extrapolated to tunnels / TCP.
Since we'd use buffers only for data (and for receipt of headers), we need to
have dynamic buffer allocation.
Thus :
- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to
send something that we need to build and that needs to be persistent in case
of partial send. H1 headers are built on the fly from the header table to a
temporary buffer that is immediately sent and whose amount of sent bytes is
the only information kept (like for PROXY protocol). H2 headers are more
complex since the encoding depends on what was successfully sent. Thus we
need to build them and put them into a temporary buffer that remains
persistent in case send() fails. It is possible to have a limited pool of
Tx buffers and refrain from sending if there is no more buffer available in
the pool. In that case we need a wake-up mechanism once a buffer is
available. Once the data are sent, the Tx buffer is then immediately recycled
in its pool. Note that no tx buffer being used (eg: for hdr or control) means
that we have to be able to serialize access to the connection and retry with
the same stream. It also means that a stream that times out while waiting for
the connector to read the second half of its request has to stay there, or at
least needs to be handled gracefully. However if the connector cannot read
the data to be sent, it means that the buffer is congested and the connection
is dead, so that probably means it can be killed.
- Rx buffers have to be pre-allocated just before calling recv(). A connection
will first try to pick a buffer and disable reception if it fails, then
subscribe to the list of tasks waiting for an Rx buffer.
- full Rx buffers might sometimes be moved around to the next buffer instead of
experiencing a copy. That means that channels and connectors must use the
same format of buffer, and that only the channel will have to see its
pointers adjusted.
- Tx of data should be made as much as possible without copying. That possibly
means by directly looking into the connection buffer on the other side if
the local Tx buffer does not exist and the stream buffer is not allocated, or
even performing a splice() call between the two sides. One of the problem in
doing this is that it requires proper ordering of the operations (eg: when
multiple readers are attached to a same buffer). If the splitting occurs upon
receipt, there's no problem. If we expect to retrieve data directly from the
original buffer, it's harder since it contains various things in an order
which does not even indicate what belongs to whom. Thus possibly the only
mechanism to implement is the buffer permutation which guarantees zero-copy
and only in the 100% safe case. Also it's atomic and does not cause HOL
blocking.
It makes sense to chose the frontend_accept() function right after the
handshake ended. It is then possible to check the ALPN, the SNI, the ciphers
and to accept to switch to the h2_conn_accept handler only if everything is OK.
The h2_conn_accept handler will have to deal with the connection setup,
initialization of the header table, exchange of the settings frames and
preparing whatever is needed to fire new streams upon receipt of unknown
stream IDs. Note: most of the time it will not be possible to splice() because
we need to know in advance the amount of bytes to write the header, and here it
will not be possible.
H2 health checks must be seen as regular transactions/streams. The check runs a
normal client which seeks an available stream from a server. The server then
finds one on an existing connection or initiates a new H2 connection. The H2
checks will have to be configurable for sharing streams or not. Another option
could be to specify how many requests can be made over existing connections
before insisting on getting a separate connection. Note that such separate
connections might end up stacking up once released. So probably that they need
to be recycled very quickly (eg: fix how many unused ones can exist max).