This state was only a delimiter between headers and body but it now
causes more harm than good because it requires someone to change it.
Since the H1 parser knows if we're in DATA or CHUNK_SIZE, simply let
it set the right next state so that h1m->state constantly matches
what is expected afterwards.
This will allow the parser to fill some extra fields like the method or
status without having to store them permanently in the HTTP message. At
this point however the parser cannot restart from an interrupted read.
This way we maintain the old mechanism stating that -2 means we block
on errors, -1 means we only capture them, and a positive value indicates
the position of the first error.
Currently the only user of struct h1m is the h2 mux when it has to parse
an H1 message coming from the channel. Unfortunately this is not enough
to efficiently parse HTTP/1 messages like those coming from the network
as we don't want to restart from scratch at every byte received.
This patch reintroduces the "next" offset into the H1 message so that any
H1 parser can use it to restart when called with a state that is not the
initial state.
This is the *parsing* state of an HTTP/1 message. Currently the h1_state
is composite as it's made both of parsing and control (100SENT, BODY,
DONE, TUNNEL, ENDING etc). The purpose here is to have a purely H1 state
that can be used by H1 parsers. For now it's equivalent to h1_state.
It's a bit painful to have to deal with HTTP semantics for each protocol
version (H1 and H2), and working on the version-agnostic code further
emphasizes the problem.
This patch creates http.h and http.c which are agnostic to the version
in use, and which borrow a few parts from proto_http and from h1. For
example the once thought h1-specific h1_char_classes array is in fact
dictated by RFC7231 and is used to parse HTTP headers. A few changes
were made to a few files which were including proto_http.h while they
only needed http.h.
Certain string definitions pre-dated the introduction of indirect
strings (ist) so some were used to simplify the definition of the known
HTTP methods. The current lookup code saves 2 kB of a heavily used table
and is faster than the previous table based lookup (typ. 14 ns vs 16
before).
The HTTP/1 code always has the reserve left available so the buffer is
never full there. But with HTTP/2 we have to deal with full buffers,
and it happens that the chunk size parser cannot tell the difference
between a full buffer and an empty one since it compares the start and
the stop pointer.
Let's change this to instead deal with the number of bytes left to process.
As a side effect, this code ends up being about 10% faster than the previous
one, even on HTTP/1.
The H1 parser used by the H2 gateway was a bit lax and could validate
non-numbers in the status code. Since it computes the code on the fly
it's problematic, as "30:" is read as status code 310. Let's properly
check that it's a number now. No backport needed.
This is needed in the H2->H1 gateway so that we know how long the trailers
block is in chunked encoding. It returns the number of bytes, or 0 if some
are missing, or -1 in case of parse error.
It was painful not to have the status code available, especially when
it was computed. Let's store it and ensure we don't claim content-length
anymore on 1xx, only 0 body bytes.
The HTTP/2->HTTP/1 gateway will need to process HTTP/1 responses. We
cannot sanely rely on the HTTP/1 txn to parse a response because :
1) responses generated by haproxy such as error messages, redirects,
stats or Lua are neither parsed nor indexed ; this could be
addressed over the long term but will take time.
2) the http txn is useless to parse the body : the states present there
are only meaningful to received bytes (ie next bytes to parse) and
not at all to sent bytes. Thus chunks cannot be followed at all.
Even when implementing this later, it's unsure whether it will be
possible when dealing with compression.
So using the HTTP txn is now out of the equation and the only remaining
solution is to call an HTTP/1 message parser. We already have one, it was
slightly modified to avoid keeping states by benefitting from the fact
that the response was produced by haproxy and this is entirely available.
It assumes the following rules are true, or that incuring an extra cost
to work around them is acceptable :
- the response buffer is read-write and supports modifications in place
- headers sent through / by haproxy are not folded. Folding is still
implemented by replacing CR/LF/tabs/spaces with spaces if encountered
- HTTP/0.9 responses are never sent by haproxy and have never been
supported at all
- haproxy will not send partial responses, the whole headers block will
be sent at once ; this means that we don't need to keep expensive
states and can afford to restart the parsing from the beginning when
facing a partial response ;
- response is contiguous (does not wrap). This was already the case
with the original parser and ensures we can safely dereference all
fields with (ptr,len)
The parser replaces all of the http_msg fields that were necessary with
local variables. The parser is not called on an http_msg but on a string
with a start and an end. The HTTP/1 states were reused for ease of use,
though the request-specific ones have not been implemented for now. The
error position and error state are supported and optional ; these ones
may be used later for bug hunting.
The parser issues the list of all the headers into a caller-allocated
array of struct ist.
The content-length/transfer-encoding header are checked and the relevant
info fed the h1 message state (flags + body_len).
The chunk crlf parser used to depend on the channel and on the HTTP
message, eventhough it's not really needed. Let's remove this dependency
so that it can be used within the H2 to H1 gateway.
As part of this small API change, it was renamed to h1_skip_chunk_crlf()
to mention that it doesn't depend on http_msg anymore.
The chunk parser used to depend on the channel and on the HTTP message
but it's not really needed as they're only used to retrieve the buffer
as well as to return the number of bytes parsed and the chunk size.
Here instead we pass the (few) relevant information in arguments so that
the function may be reused without a channel nor an HTTP message (ie
from the H2 to H1 gateway).
As part of this API change, it was renamed to h1_parse_chunk_size() to
mention that it doesn't depend on http_msg anymore.
Functions http_parse_chunk_size(), http_skip_chunk_crlf() and
http_forward_trailers() were moved to h1.h and h1.c respectively so
that they can be called from outside. The parts that were inline
remained inline as it's critical for performance (+41% perf
difference reported in an earlier test). For now the "http_" prefix
remains in their name since they still depend on the http_msg type.
Certain types and enums are very specific to the HTTP/1 parser, and we'll
need to share them with the HTTP/2 to HTTP/1 translation code. Let's move
them to h1.c/h1.h. Those with very few occurrences or only used locally
were renamed to explicitly mention the relevant HTTP version :
enum ht_state -> h1_state.
http_msg_state_str -> h1_msg_state_str
HTTP_FLG_* -> H1_FLG_*
http_char_classes -> h1_char_classes
Others like HTTP_IS_*, HTTP_MSG_* are left to be done later.