haproxy/doc/design-thoughts/http2.txt

2014/10/23 - design thoughts for HTTP/2

- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a
  connection holds a compression context (headers table, etc...). We probably
  need to have an h2_conn struct.

- multiple transactions will be handled in parallel for a given h2_conn. They
  are called streams in HTTP/2 terminology.

- multiplexing : for a given client-side h2 connection, we can have multiple
  server-side h2 connections. And for a server-side h2 connection, we can have
  multiple client-side h2 connections. Streams circulate in N-to-N fashion.

- flow control : flow control will be applied between multiple streams. Special
  care must be taken so that an H2 client cannot block some H2 servers by
  sending requests spread over multiple servers to the point where one server
  response is blocked and prevents other responses from the same server from
  reaching their clients. H2 connection buffers must always be empty or nearly
  empty. The per-stream flow control needs to be respected as well as the
  connection's buffers. It is important to implement some fairness between all
  the streams so that it's not always the same which gets the bandwidth when
  the connection is congested.

- some clients can be H1 with an H2 server (is this really needed ?). Most of
  the initial use case will be H2 clients to H1 servers. It is important to keep
  in mind that H1 servers do not do flow control and that we don't want them to
  block transfers (eg: post upload).

- internal tasks : some H2 clients will be internal tasks (eg: health checks).
  Some H2 servers will be internal tasks (eg: stats, cache). The model must be
  compatible with this use case.

- header indexing : headers are transported compressed, with a reference to a
  static or a dynamic header, or a literal, possibly huffman-encoded. Indexing
  is specific to the H2 connection. This means there is no way any binary data
  can flow between both sides, headers will have to be decoded according to the
  incoming connection's context and re-encoded according to the outgoing
  connection's context, which can significantly differ. In order to avoid the
  parsing trouble we currently face, headers will have to be clearly split
  between name and value. It is worth noting that neither the incoming nor the
  outgoing connections' contexts will be of any use while processing the
  headers. At best we can have some shortcuts for well-known names that map
  well to the static ones (eg: use the first static entry with same name), and
  maybe have a few special cases for static name+value as well. Probably we can
  classify headers in such categories :

    - static name + value
    - static name + other value
    - dynamic name + other value

  This will allow for better processing in some specific cases. Headers
  supporting a single value (:method, :status, :path, ...) should probably
  be stored in a single location with a direct access. That would allow us
  to retrieve a method using hdr[METHOD]. All such indexing must be performed
  while parsing. That also means that HTTP/1 will have to be converted to this
  representation very early in the parser and possibly converted back to H/1
  after processing.

  Header names/values will have to be placed in a small memory area that will
  inevitably get fragmented as headers are rewritten. An automatic packing
  mechanism must be implemented so that when there's no more room, headers are
  simply defragmented/packet to a new table and the old one is released. Just
  like for the static chunks, we need to have a few such tables pre-allocated
  and ready to be swapped at any moment. Repacking must not change any index
  nor affect the way headers are compressed so that it can happen late after a
  retry (send-name-header for example).

- header processing : can still happen on a (header, value) basis. Reqrep/
  rsprep completely disappear and will have to be replaced with something else
  to support renaming headers and rewriting url/path/...

- push_promise : servers can push dummy requests+responses. They advertise
  the stream ID in the push_promise frame indicating the associated stream ID.
  This means that it is possible to initiate a client-server stream from the
  information coming from the server and make the data flow as if the client
  had made it. It's likely that we'll have to support two types of server
  connections: those which support push and those which do not. That way client
  streams will be distributed to existing server connections based on their
  capabilities. It's important to keep in mind that PUSH will not be rewritten
  in responses.

- stream ID mapping : since the stream ID is per H2 connection, stream IDs will
  have to be mapped. Thus a given stream is an entity with two IDs (one per
  side). Or more precisely a stream has two end points, each one carrying an ID
  when it ends on an HTTP2 connection. Also, for each stream ID we need to
  quickly find the associated transaction in progress. Using a small quick
  unique tree seems indicated considering the wide range of valid values.

- frame sizes : frame have to be remapped between both sides as multiplexed
  connections won't always have the same characteristics. Thus some frames
  might be spliced and others will be sliced.

- error processing : care must be taken to never break a connection unless it
  is dead or corrupt at the protocol level. Stats counter must exist to observe
  the causes. Timeouts are a great problem because silent connections might
  die out of inactivity. Ping frames should probably be scheduled a few seconds
  before the connection timeout so that an unused connection is verified before
  being killed. Abnormal requests must be dealt with using RST_STREAM.

- ALPN : ALPN must be observed on the client side, and transmitted to the server
  side.

- proxy protocol : proxy protocol makes little to no sense in a multiplexed
  protocol. A per-stream equivalent will surely be needed if implementations
  do not quickly generalize the use of Forward.

- simplified protocol for local devices (eg: haproxy->varnish in clear and
  without handshake, and possibly even with splicing if the connection's
  settings are shared)

- logging : logging must report a number of extra information such as the
  stream ID, and whether the transaction was initiated by the client or by the
  server (which can be deduced from the stream ID's parity). In case of push,
  the number of the associated stream must also be reported.

- memory usage : H2 increases memory usage by mandating use of 16384 bytes
  frame size minimum. That means slightly more than 16kB of buffer in each
  direction to process any frame. It will definitely have an impact on the
  deployed maxconn setting in places using less than this (4..8kB are common).
  Also, the header list is persistent per connection, so if we reach the same
  size as the request, that's another 16kB in each direction, resulting in
  about 48kB of memory where 8 were previously used. A more careful encoder
  can work with a much smaller set even if that implies evicting entries
  between multiple headers of the same message.

- HTTP/1.0 should very carefully be transported over H2. Since there's no way
  to pass version information in the protocol, the server could use some
  features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers,
  ...).

- host / :authority : ":authority" is the norm, and "host" will be absent when
  H2 clients generate :authority. This probably means that a dummy Host header
  will have to be produced internally from :authority and removed when passing
  to H2 behind. This can cause some trouble when passing H2 requests to H1
  proxies, because there's no way to know if the request should contain scheme
  and authority in H1 or not based on the H2 request. Thus a "proxy" option
  will have to be explicitly mentioned on HTTP/1 server lines. One of the
  problem that it creates is that it's not longer possible to pass H/1 requests
  to H/1 proxies without an explicit configuration. Maybe a table of the
  various combinations is needed.

                           :scheme   :authority   host
       HTTP/2 request      present   present      absent
       HTTP/1 server req   absent    absent       present
       HTTP/1 proxy req    present   present      present

  So in the end the issue is only with H/2 requests passed to H/1 proxies.

- ping frames : they don't indicate any stream ID so by definition they cannot
  be forwarded to any server. The H2 connection should deal with them only.

There's a layering problem with H2. The framing layer has to be aware of the
upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass
it over a framing layer to mux the streams, the frame type must be passed below
so that frames are properly arranged. Header encoding is connection-based and
all streams using the same connection will interact in the way their headers
are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and
this entity has to know for each stream what its headers are.

Probably that we should remove *all* headers from transported data and move
them on the fly to a parallel structure that can be shared between H1 and H2
and consumed at the appropriate level. That means buffers only transport data.
Trailers have to be dealt with differently.

So if we consider an H1 request being forwarded between a client and a server,
it would look approximately like this :

  - request header + body land into a stream's receive buffer
  - headers are indexed and stripped out so that only the body and whatever
    follows remain in the buffer
  - both the header index and the buffer with the body stay attached to the
    stream
  - the sender can rebuild the whole headers. Since they're found in a table
    supposed to be stable, it can rebuild them as many times as desired and
    will always get the same result, so it's safe to build them into the trash
    buffer for immediate sending, just as we do for the PROXY protocol.
  - the upper protocol should probably provide a build_hdr() callback which
    when called by the socket layer, builds this header block based on the
    current stream's header list, ready to be sent.
  - the socket layer has to know how many bytes from the headers are left to be
    forwarded prior to processing the body.
  - the socket layer needs to consume only the acceptable part of the body and
    must not release the buffer if any data remains in it (eg: pipelining over
    H1). This is already handled by channel->o and channel->to_forward.
  - we could possibly have another optional callback to send a preamble before
    data, that could be used to send chunk sizes in H1. The danger is that it
    absolutely needs to be stable if it has to be retried. But it could
    considerably simplify de-chunking.

When the request is sent to an H2 server, an H2 stream request must be made
to the server, we find an existing connection whose settings are compatible
with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If
none is found, a new connection must be established, unless maxconn is reached.

Servers must have a maxstream setting just like they have a maxconn. The same
queue may be used for that.

The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2
that becomes impossible (and useless). We still need something like the
"tcp-request session" hook to apply just after the SSL handshake is done.

It is impossible to defragment the body on the fly in HTTP/2. Since multiple
messages are interleaved, we cannot wait for all of them and block the head of
line. Thus if body analysis is required, it will have to use the stream's
buffer, which necessarily implies a copy. That means that with each H2 end we
necessarily have at least one copy. Sometimes we might be able to "splice" some
bytes from one side to the other without copying into the stream buffer (same
rules as for TCP splicing).

In theory, only data should flow through the channel buffer, so each side's
connector is responsible for encoding data (H1: linear/chunks, H2: frames).
Maybe the same mechanism could be extrapolated to tunnels / TCP.

Since we'd use buffers only for data (and for receipt of headers), we need to
have dynamic buffer allocation.

Thus :
- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to
  send something that we need to build and that needs to be persistent in case
  of partial send. H1 headers are built on the fly from the header table to a
  temporary buffer that is immediately sent and whose amount of sent bytes is
  the only information kept (like for PROXY protocol). H2 headers are more
  complex since the encoding depends on what was successfully sent. Thus we
  need to build them and put them into a temporary buffer that remains
  persistent in case send() fails. It is possible to have a limited pool of
  Tx buffers and refrain from sending if there is no more buffer available in
  the pool. In that case we need a wake-up mechanism once a buffer is
  available. Once the data are sent, the Tx buffer is then immediately recycled
  in its pool. Note that no tx buffer being used (eg: for hdr or control) means
  that we have to be able to serialize access to the connection and retry with
  the same stream. It also means that a stream that times out while waiting for
  the connector to read the second half of its request has to stay there, or at
  least needs to be handled gracefully. However if the connector cannot read
  the data to be sent, it means that the buffer is congested and the connection
  is dead, so that probably means it can be killed.

- Rx buffers have to be pre-allocated just before calling recv(). A connection
  will first try to pick a buffer and disable reception if it fails, then
  subscribe to the list of tasks waiting for an Rx buffer.

- full Rx buffers might sometimes be moved around to the next buffer instead of
  experiencing a copy. That means that channels and connectors must use the
  same format of buffer, and that only the channel will have to see its
  pointers adjusted.

- Tx of data should be made as much as possible without copying. That possibly
  means by directly looking into the connection buffer on the other side if
  the local Tx buffer does not exist and the stream buffer is not allocated, or
  even performing a splice() call between the two sides. One of the problem in
  doing this is that it requires proper ordering of the operations (eg: when
  multiple readers are attached to a same buffer). If the splitting occurs upon
  receipt, there's no problem. If we expect to retrieve data directly from the
  original buffer, it's harder since it contains various things in an order
  which does not even indicate what belongs to whom. Thus possibly the only
  mechanism to implement is the buffer permutation which guarantees zero-copy
  and only in the 100% safe case. Also it's atomic and does not cause HOL
  blocking.

It makes sense to chose the frontend_accept() function right after the
handshake ended. It is then possible to check the ALPN, the SNI, the ciphers
and to accept to switch to the h2_conn_accept handler only if everything is OK.
The h2_conn_accept handler will have to deal with the connection setup,
initialization of the header table, exchange of the settings frames and
preparing whatever is needed to fire new streams upon receipt of unknown
stream IDs. Note: most of the time it will not be possible to splice() because
we need to know in advance the amount of bytes to write the header, and here it
will not be possible.

H2 health checks must be seen as regular transactions/streams. The check runs a
normal client which seeks an available stream from a server. The server then
finds one on an existing connection or initiates a new H2 connection. The H2
checks will have to be configurable for sharing streams or not. Another option
could be to specify how many requests can be made over existing connections
before insisting on getting a separate connection. Note that such separate
connections might end up stacking up once released. So probably that they need
to be recycled very quickly (eg: fix how many unused ones can exist max).