2014/10/23 - design thoughts for HTTP/2 - connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a connection holds a compression context (headers table, etc...). We probably need to have an h2_conn struct. - multiple transactions will be handled in parallel for a given h2_conn. They are called streams in HTTP/2 terminology. - multiplexing : for a given client-side h2 connection, we can have multiple server-side h2 connections. And for a server-side h2 connection, we can have multiple client-side h2 connections. Streams circulate in N-to-N fashion. - flow control : flow control will be applied between multiple streams. Special care must be taken so that an H2 client cannot block some H2 servers by sending requests spread over multiple servers to the point where one server response is blocked and prevents other responses from the same server from reaching their clients. H2 connection buffers must always be empty or nearly empty. The per-stream flow control needs to be respected as well as the connection's buffers. It is important to implement some fairness between all the streams so that it's not always the same which gets the bandwidth when the connection is congested. - some clients can be H1 with an H2 server (is this really needed ?). Most of the initial use case will be H2 clients to H1 servers. It is important to keep in mind that H1 servers do not do flow control and that we don't want them to block transfers (eg: post upload). - internal tasks : some H2 clients will be internal tasks (eg: health checks). Some H2 servers will be internal tasks (eg: stats, cache). The model must be compatible with this use case. - header indexing : headers are transported compressed, with a reference to a static or a dynamic header, or a literal, possibly huffman-encoded. Indexing is specific to the H2 connection. This means there is no way any binary data can flow between both sides, headers will have to be decoded according to the incoming connection's context and re-encoded according to the outgoing connection's context, which can significantly differ. In order to avoid the parsing trouble we currently face, headers will have to be clearly split between name and value. It is worth noting that neither the incoming nor the outgoing connections' contexts will be of any use while processing the headers. At best we can have some shortcuts for well-known names that map well to the static ones (eg: use the first static entry with same name), and maybe have a few special cases for static name+value as well. Probably we can classify headers in such categories : - static name + value - static name + other value - dynamic name + other value This will allow for better processing in some specific cases. Headers supporting a single value (:method, :status, :path, ...) should probably be stored in a single location with a direct access. That would allow us to retrieve a method using hdr[METHOD]. All such indexing must be performed while parsing. That also means that HTTP/1 will have to be converted to this representation very early in the parser and possibly converted back to H/1 after processing. Header names/values will have to be placed in a small memory area that will inevitably get fragmented as headers are rewritten. An automatic packing mechanism must be implemented so that when there's no more room, headers are simply defragmented/packet to a new table and the old one is released. Just like for the static chunks, we need to have a few such tables pre-allocated and ready to be swapped at any moment. Repacking must not change any index nor affect the way headers are compressed so that it can happen late after a retry (send-name-header for example). - header processing : can still happen on a (header, value) basis. Reqrep/ rsprep completely disappear and will have to be replaced with something else to support renaming headers and rewriting url/path/... - push_promise : servers can push dummy requests+responses. They advertise the stream ID in the push_promise frame indicating the associated stream ID. This means that it is possible to initiate a client-server stream from the information coming from the server and make the data flow as if the client had made it. It's likely that we'll have to support two types of server connections: those which support push and those which do not. That way client streams will be distributed to existing server connections based on their capabilities. It's important to keep in mind that PUSH will not be rewritten in responses. - stream ID mapping : since the stream ID is per H2 connection, stream IDs will have to be mapped. Thus a given stream is an entity with two IDs (one per side). Or more precisely a stream has two end points, each one carrying an ID when it ends on an HTTP2 connection. Also, for each stream ID we need to quickly find the associated transaction in progress. Using a small quick unique tree seems indicated considering the wide range of valid values. - frame sizes : frame have to be remapped between both sides as multiplexed connections won't always have the same characteristics. Thus some frames might be spliced and others will be sliced. - error processing : care must be taken to never break a connection unless it is dead or corrupt at the protocol level. Stats counter must exist to observe the causes. Timeouts are a great problem because silent connections might die out of inactivity. Ping frames should probably be scheduled a few seconds before the connection timeout so that an unused connection is verified before being killed. Abnormal requests must be dealt with using RST_STREAM. - ALPN : ALPN must be observed onthe client side, and transmitted to the server side. - proxy protocol : proxy protocol makes little to no sense in a multiplexed protocol. A per-stream equivalent will surely be needed if implementations do not quickly generalize the use of Forward. - simplified protocol for local devices (eg: haproxy->varnish in clear and without handshake, and possibly even with splicing if the connection's settings are shared) - logging : logging must report a number of extra information such as the stream ID, and whether the transaction was initiated by the client or by the server (which can be deduced from the stream ID's parity). In case of push, the number of the associated stream must also be reported. - memory usage : H2 increases memory usage by mandating use of 16384 bytes frame size minimum. That means slightly more than 16kB of buffer in each direction to process any frame. It will definitely have an impact on the deployed maxconn setting in places using less than this (4..8kB are common). Also, the header list is persistent per connection, so if we reach the same size as the request, that's another 16kB in each direction, resulting in about 48kB of memory where 8 were previously used. A more careful encoder can work with a much smaller set even if that implies evicting entries between multiple headers of the same message. - HTTP/1.0 should very carefully be transported over H2. Since there's no way to pass version information in the protocol, the server could use some features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers, ...). - host / :authority : ":authority" is the norm, and "host" will be absent when H2 clients generate :authority. This probably means that a dummy Host header will have to be produced internally from :authority and removed when passing to H2 behind. This can cause some trouble when passing H2 requests to H1 proxies, because there's no way to know if the request should contain scheme and authority in H1 or not based on the H2 request. Thus a "proxy" option will have to be explicitly mentionned on HTTP/1 server lines. One of the problem that it creates is that it's not longer possible to pass H/1 requests to H/1 proxies without an explicit configuration. Maybe a table of the various combinations is needed. :scheme :authority host HTTP/2 request present present absent HTTP/1 server req absent absent present HTTP/1 proxy req present present present So in the end the issue is only with H/2 requests passed to H/1 proxies. - ping frames : they don't indicate any stream ID so by definition they cannot be forwarded to any server. The H2 connection should deal with them only. There's a layering problem with H2. The framing layer has to be aware of the upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass it over a framing layer to mux the streams, the frame type must be passed below so that frames are properly arranged. Header encoding is connection-based and all streams using the same connection will interact in the way their headers are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and this entity has to know for each stream what its headers are. Probably that we should remove *all* headers from transported data and move them on the fly to a parallel structure that can be shared between H1 and H2 and consumed at the appropriate level. That means buffers only transport data. Trailers have to be dealt with differently. So if we consider an H1 request being forwarded between a client and a server, it would look approximately like this : - request header + body land into a stream's receive buffer - headers are indexed and stripped out so that only the body and whatever follows remain in the buffer - both the header index and the buffer with the body stay attached to the stream - the sender can rebuild the whole headers. Since they're found in a table supposed to be stable, it can rebuild them as many times as desired and will always get the same result, so it's safe to build them into the trash buffer for immediate sending, just as we do for the PROXY protocol. - the upper protocol should probably provide a build_hdr() callback which when called by the socket layer, builds this header block based on the current stream's header list, ready to be sent. - the socket layer has to know how many bytes from the headers are left to be forwarded prior to processing the body. - the socket layer needs to consume only the acceptable part of the body and must not release the buffer if any data remains in it (eg: pipelining over H1). This is already handled by channel->o and channel->to_forward. - we could possibly have another optional callback to send a preamble before data, that could be used to send chunk sizes in H1. The danger is that it absolutely needs to be stable if it has to be retried. But it could considerably simplify de-chunking. When the request is sent to an H2 server, an H2 stream request must be made to the server, we find an existing connection whose settings are compatible with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If none is found, a new connection must be established, unless maxconn is reached. Servers must have a maxstream setting just like they have a maxconn. The same queue may be used for that. The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2 that becomes impossible (and useless). We still need something like the "tcp-request session" hook to apply just after the SSL handshake is done. It is impossible to defragment the body on the fly in HTTP/2. Since multiple messages are interleaved, we cannot wait for all of them and block the head of line. Thus if body analysis is required, it will have to use the stream's buffer, which necessarily implies a copy. That means that with each H2 end we necessarily have at least one copy. Sometimes we might be able to "splice" some bytes from one side to the other without copying into the stream buffer (same rules as for TCP splicing). In theory, only data should flow through the channel buffer, so each side's connector is responsible for encoding data (H1: linear/chunks, H2: frames). Maybe the same mechanism could be extrapolated to tunnels / TCP. Since we'd use buffers only for data (and for receipt of headers), we need to have dynamic buffer allocation. Thus : - Tx buffers do not exist. We allocate a buffer on the fly when we're ready to send something that we need to build and that needs to be persistent in case of partial send. H1 headers are built on the fly from the header table to a temporary buffer that is immediately sent and whose amount of sent bytes is the only information kept (like for PROXY protocol). H2 headers are more complex since the encoding depends on what was successfully sent. Thus we need to build them and put them into a temporary buffer that remains persistent in case send() fails. It is possible to have a limited pool of Tx buffers and refrain from sending if there is no more buffer available in the pool. In that case we need a wake-up mechanism once a buffer is available. Once the data are sent, the Tx buffer is then immediately recycled in its pool. Note that no tx buffer being used (eg: for hdr or control) means that we have to be able to serialize access to the connection and retry with the same stream. It also means that a stream that times out while waiting for the connector to read the second half of its request has to stay there, or at least needs to be handled gracefully. However if the connector cannot read the data to be sent, it means that the buffer is congested and the connection is dead, so that probably means it can be killed. - Rx buffers have to be pre-allocated just before calling recv(). A connection will first try to pick a buffer and disable reception if it fails, then subscribe to the list of tasks waiting for an Rx buffer. - full Rx buffers might sometimes be moved around to the next buffer instead of experiencing a copy. That means that channels and connectors must use the same format of buffer, and that only the channel will have to see its pointers adjusted. - Tx of data should be made as much as possible without copying. That possibly means by directly looking into the connection buffer on the other side if the local Tx buffer does not exist and the stream buffer is not allocated, or even performing a splice() call between the two sides. One of the problem in doing this is that it requires proper ordering of the operations (eg: when multiple readers are attached to a same buffer). If the splitting occurs upon receipt, there's no problem. If we expect to retrieve data directly from the original buffer, it's harder since it contains various things in an order which does not even indicate what belongs to whom. Thus possibly the only mechanism to implement is the buffer permutation which guarantees zero-copy and only in the 100% safe case. Also it's atomic and does not cause HOL blocking. It makes sense to chose the frontend_accept() function right after the handshake ended. It is then possible to check the ALPN, the SNI, the ciphers and to accept to switch to the h2_conn_accept handler only if everything is OK. The h2_conn_accept handler will have to deal with the connection setup, initialization of the header table, exchange of the settings frames and preparing whatever is needed to fire new streams upon receipt of unknown stream IDs. Note: most of the time it will not be possible to splice() because we need to know in advance the amount of bytes to write the header, and here it will not be possible. H2 health checks must be seen as regular transactions/streams. The check runs a normal client which seeks an available stream from a server. The server then finds one on an existing connection or initiates a new H2 connection. The H2 checks will have to be configurable for sharing streams or not. Another option could be to specify how many requests can be made over existing connections before insisting on getting a separate connection. Note that such separate connections might end up stacking up once released. So probably that they need to be recycled very quickly (eg: fix how many unused ones can exist max).