Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
For large farms where servers are regularly added or removed, picking
a random server from the pool can ensure faster load transitions than
when using round-robin and less traffic surges on the newly added
servers than when using leastconn.
This commit introduces "balance random". It internally uses a random as
the key to the consistent hashing mechanism, thus all features available
in consistent hashing such as weights and bounded load via hash-balance-
factor are usable. It is extremely convenient because one common concern
when using random is what happens when a server is hammered a bit too
much. Here that can trivially be avoided, like in the configuration below :
backend bk0
balance random
hash-balance-factor 110
server-template s 1-100 127.0.0.1:8000 check inter 1s
Note that while "balance random" internally relies on a hash algorithm,
it holds the same properties as round-robin and as such is compatible with
reusing an existing server connection with "option prefer-last-server".
Commit 522eea7 ("MINOR: ssl: Handle sending early data to server.") added
a dependency on SRV_SSL_O_EARLY_DATA which only exists when USE_OPENSSL
is defined (which is probably not the best solution) and breaks the build
when ssl is not enabled. Just add an ifdef USE_OPENSSL around the block
for now.
This adds a new keyword on the "server" line, "allow-0rtt", if set, we'll try
to send early data to the server, as long as the client sent early data, as
in case the server rejects the early data, we no longer have them, and can't
resend them, so the only option we have is to send back a 425, and we need
to be sure the client knows how to interpret it correctly.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
A lock for LB parameters has been added inside the proxy structure and atomic
operations have been used to update server variables releated to lb.
The only significant change is about lb_map. Because the servers status are
updated in the sync-point, we can call recalc_server_map function synchronously
in map_set_server_status_up/down function.
For now, we have a list of each type per thread. So there is no need to lock
them. This is the easiest solution for now, but not the best one because there
is no sharing between threads. An idle connection on a thread will not be able
be used by a stream on another thread. So it could be a good idea to rework this
patch later.
Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
srv_queue([<backend>/]<server>) : integer
Returns an integer value corresponding to the number of connections currently
pending in the designated server's queue. If <backend> is omitted, then the
server is looked up in the current backend. It can sometimes be used together
with the "use-server" directive to force to use a known faster server when it
is not much loaded. See also the "srv_conn", "avg_queue" and "queue" sample
fetch methods.
The server state and weight was reworked to handle
"pending" values updated by checks/CLI/LUA/agent.
These values are commited to be propagated to the
LB stack.
In further dev related to multi-thread, the commit
will be handled into a sync point.
Pending values are named using the prefix 'next_'
Current values used by the LB stack are named 'cur_'
2 places were using an open-coded implementation of this function to count
available servers. Note that the avg_queue_size() fetch didn't check that
the proxy was in STOPPED state so it would possibly return a wrong server
count here but that wouldn't impact the returned value.
Signed-off-by: Nenad Merdanovic <nmerdan@haproxy.com>
This is like the nbsrv() sample fetch function except that it works as
a converter so it can count the number of available servers of a backend
name retrieved using a sample fetch or an environment variable.
Signed-off-by: Nenad Merdanovic <nmerdan@haproxy.com>
Keeping the address and the port in the same field causes a lot of problems,
specifically on the DNS part where we're forced to cheat on the family to be
able to keep the port. This causes some issues such as some families not being
resolvable anymore.
This patch first moves the service port to a new field "svc_port" so that the
port field is never used anymore in the "addr" field (struct sockaddr_storage).
All call places were adapted (there aren't that many).
when using "option prefer-last-server", we may not always stay on
the same backend if option balance told us otherwise.
For example, backend may change in the following cases:
balance hdr()
balance rdp-cookie
balance source
balance uri
balance url_param
[wt: backport this to 1.7 and 1.6]
According to nbsrv() documentation this fetcher should return "an
integer value corresponding to the number of usable servers".
In case backend is disabled none of servers is usable, so I believe
fetcher should return 0.
This patch should be backported to 1.7, 1.6, 1.5.
Now we exclusively use xprt_get(XPRT_RAW) instead of &raw_sock or
xprt_get(XPRT_SSL) for &ssl_sock. This removes a bunch of #ifdef and
include spread over a number of location including backend, cfgparse,
checks, cli, hlua, log, server and session.
These 2 patches add ability to fetch frontend/backend name in your
logic, so they can be used later to make routing decisions (fe_name) or
taking some actions based on backend which responded to request (be_name).
In our case we needed a fetcher to be able to extract information we
needed from frontend name.
When a backend doesn't use any known LB algorithm, backend_lb_algo_str()
returns NULL. It used to cause "nil" to be printed in the stats dump
since version 1.4 but causes 1.7 to try to parse this NULL to encode
it as a CSV string, causing a crash on "show stat" in this case.
The only situation where this can happen is when "transparent" or
"dispatch" are used in a proxy, in which case the LB algorithm is
BE_LB_ALGO_NONE. Thus now we explicitly report "none" when this
situation is detected, and we preventively report "unknown" if any
unknown algorithm is detected, which may happen if such an algo is
added in the future and the function is not updated.
This fix must be backported to 1.7 and may be backported as far as
1.4, though it has less impact there.
We'll have to use srv_set_admin_flag() to propagate some server flags
during the startup, and we don't want the resulting actions to cause
warnings, logs nor e-mail alerts to be generated since we're just applying
the config or a state file. So let's condition these notifications to the
fact that we're starting.
For active servers, this is the sum of the eweights of all active
servers before this one in the backend, and
[srv->cumulative_weight .. srv_cumulative_weight + srv_eweight) is a
space occupied by this server in the range [0 .. lbprm.tot_wact), and
likewise for backup servers with tot_wbck. This allows choosing a
server or a range of servers proportional to their weight, by simple
integer comparison.
Signed-off-by: Andrew Rodland <andrewr@vimeo.com>
The "sni" server directive does some bad stuff on many occasions because
it works on a sample of type string and limits len to size-1 by hand. The
problem is that size used to be zero on many occasions before the recent
changes to smp_dup() and that it effectively results in setting len to -1
and writing the zero byte *before* the string (and not terminating the
string).
This patch makes use of the recently introduced smp_make_safe() to address
this issue.
This fix must be backported to 1.6.
Since commit 6879ad3 ("MEDIUM: sample: fill the struct sample with the
session, proxy and stream pointers") merged in 1.6-dev2, the sample
contains the pointer to the stream and sample fetch functions as well
as converters use it heavily.
The problem is that earlier commit 87b0966 ("REORG/MAJOR: session:
rename the "session" entity to "stream"") had split the session and
stream resulting in the possibility for smp->strm to be NULL before
the stream was initialized. This is what happens in tcp-request
connection rulesets, as discovered by Baptiste.
The sample fetch functions must now check that smp->strm is valid
before using it. An alternative could consist in using a dummy stream
with nothing in it to avoid some checks but it would only result in
deferring them to the next step anyway, and making it harder to detect
that a stream is valid or the dummy one.
There is still an issue with variables which requires a complete
independant fix. They use strm->sess to find the session with strm
possibly NULL and passed as an argument. All call places indirectly
use smp->strm to build strm. So the problem is there but the API needs
to be changed to remove this duplicate argument that makes it much
harder to know what pointer to use.
This fix must be backported to 1.6, as well as the next one fixing
variables.
There is a bug in connect_server() : we use si_attach_conn() to offer
the current session's connection to the session we're stealing the
connection from. Unfortunately, si_attach_conn() uses the standard data
connection operations while here we need to use the idle connection
operations.
This results in a situation where when the server's idle timeout strikes,
the read0 is silently ignored, causes the response channel to be shut down
for reads, and the connection remains attached. Next attempt to send a
request when using this connection simply results in nothing being done
because we try to send over an already closed connection. Worse, if the
client aborts, then no timeout remains at all and the session waits
forever and remains assigned to the server.
A more-or-less easy way to reproduce this bug is to have two concurrent
streams each connecting to a different server with "http-reuse aggressive",
typically a cache farm using a URL hash :
stream1: GET /1 HTTP/1.1
stream2: GET /2 HTTP/1.1
stream1: GET /2 HTTP/1.1
wait for the server 1's connection to timeout
stream2: GET /1 HTTP/1.1
The connection hangs here, and "show sess all" shows a closed connection
with a SHUTR on the response channel.
The fix is very simple though not optimal. It consists in calling
si_idle_conn() again after attaching the connection. But in practise
it should not be done like this. The real issue is that there's no way
to cleanly attach a connection to a stream interface without changing
the connection's operations. So the API clearly needs to be revisited
to make such operations easier.
Many thanks to Yves Lafon from W3C for providing lots of useful dumps
and testing patches to help figure the root cause!
This fix must be backported to 1.6.
This was the first transparent proxy technology supported by haproxy
circa 2005 but it was obsoleted in 2007 by Tproxy 4.0 which removed a
lot of the earlier versions' shortcomings and was finally merged into
the kernel. Since nobody has been using cttproxy for many years now
and nobody has even just tried to compile the files, it's time to
remove it. The doc was updated as well.
The union name "data" is a little bit heavy while we read the source
code because we can read "data.data.sint". The rename from "data" to "u"
makes the read easiest like "data.u.sint".
This patch remove the struct information stored both in the struct
sample_data and in the striuct sample. Now, only thestruct sample_data
contains data, and the struct sample use the struct sample_data for storing
his own data.
This strategy is less extreme than "always", it only dispatches first
requests to validated reused connections, and moves a connection from
the idle list to the safe list once it has seen a second request, thus
proving that it could be reused.
The "safe" mode consists in picking existing connections only when
processing a request that's not the first one from a connection. This
ensures that in case where the server finally times out and closes, the
client can decide to replay idempotent requests.
Now instead of closing the existing connection attached to the
stream interface, we first check if the one we pick was attached to
another stream interface, in which case the connections are swapped
if possible (eg: if the current connection is not private). That way
the previous connection remains attached to an existing session and
significantly increases the chances of being reused.
In connect_server(), if we don't have a connection attached to the
stream-int, we first look into the server's idle_conns list and we
pick the first one there, we detach it from its owner if it had one.
If we used to have a connection, we close it.
This mechanism works well but doesn't scale : as servers increase,
the likeliness that the connection attached to the stream interface
doesn't match the server and gets closed increases.
This flag is set on an outgoing connection when this connection gets
some properties that must not be shared with other connections, such
as dynamic transparent source binding, SNI or a proxy protocol header,
or an authentication challenge from the server. This will be needed
later to implement connection reuse.
Since we now always call this function with the reuse parameter cleared,
let's simplify the function's logic as it cannot return the existing
connection anymore. The savings on this inline function are appreciable
(240 bytes) :
$ size haproxy.old haproxy.new
text data bss dec hex filename
1020383 40816 36928 1098127 10c18f haproxy.old
1020143 40816 36928 1097887 10c09f haproxy.new
connect_server() already does most of the check that is done again in
si_alloc_conn(), so let's simply reuse the existing connection instead
of calling the function again. It will also simplify the connection
reuse.
Indeed, for reuse to be set, it also requires srv_conn to be valid. In the
end, the only situation where we have to release the existing connection
and allocate a new one is when reuse == 0.
objt_server() is called multiple times at various places while some
places already make use of srv for this. Let's move the call at the
top of the function and use it all over the place.
This patch removes the 32 bits unsigned integer and the 32 bit signed
integer. It replaces these types by a unique type 64 bit signed.
This makes easy the usage of integer and clarify signed and unsigned use.
With the previous version, signed and unsigned are used ones in place of
others, and sometimes the converter loose the sign. For example, divisions
are processed with "unsigned", if one entry is negative, the result is
wrong.
Note that the integer pattern matching and dotted version pattern matching
are already working with signed 64 bits integer values.
There is one user-visible change : the "uint()" and "sint()" sample fetch
functions which used to return a constant integer have been replaced with
a new more natural, unified "int()" function. These functions were only
introduced in the latest 1.6-dev2 so there's no impact on regular
deployments.
The new "sni" server directive takes a sample fetch expression and
uses its return value as a hostname sent as the TLS SNI extension.
A typical use case consists in forwarding the front connection's SNI
value to the server in a bridged HTTPS forwarder :
sni ssl_fc_sni
This patch removes the structs "session", "stream" and "proxy" from
the sample-fetches and converters function prototypes.
This permits to remove some weight in the prototype call.
The get_server_ph_post() function assumes that the buffer is contiguous.
While this is true for all the header part, it is not necessarily true
for the end of data the fit in the reserve. In this case there's a risk
to read past the end of the buffer for a few hundred bytes, and possibly
to crash the process if what follows is not mapped.
The fix consists in truncating the analyzed length to the length of the
contiguous block that follows the headers.
A config workaround for this bug would be to disable balance url_param.
This fix must be backported to 1.5. It seems 1.4 did have the check.
Many such function need a session, and till now they used to dereference
the stream. Once we remove the stream from the embryonic session, this
will not be possible anymore.
So as of now, sample fetch functions will be called with this :
- sess = NULL, strm = NULL : never
- sess = valid, strm = NULL : tcp-req connection
- sess = valid, strm = valid, strm->txn = NULL : tcp-req content
- sess = valid, strm = valid, strm->txn = valid : http-req / http-res
All of them can now retrieve the HTTP transaction *if it exists* from
the stream and be sure to get NULL there when called with an embryonic
session.
The patch is a bit large because many locations were touched (all fetch
functions had to have their prototype adjusted). The opportunity was
taken to also uniformize the call names (the stream is now always "strm"
instead of "l4") and to fix indent where it was broken. This way when
we later introduce the session here there will be less confusion.
Now this one is dynamically allocated. It means that 280 bytes of memory
are saved per TCP stream, but more importantly that it will become
possible to remove the l7 pointer from fetches and converters since
it will be deduced from the stream and will support being null.
A lot of care was taken because it's easy to forget a test somewhere,
and the previous code used to always trust s->txn for being valid, but
all places seem to have been visited.
All HTTP fetch functions check the txn first so we shouldn't have any
issue there even when called from TCP. When branching from a TCP frontend
to an HTTP backend, the txn is properly allocated at the same time as the
hdr_idx.
When s->si[0].end was dereferenced as a connection or anything in
order to retrieve information about the originating session, we'll
now use sess->origin instead so that when we have to chain multiple
streams in HTTP/2, we'll keep accessing the same origin.
Just like for the listener, the frontend is session-wide so let's move
it to the session. There are a lot of places which were changed but the
changes are minimal in fact.
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
These 4 combinations are needlessly complicated since the session already
has direct access to the associated stream interfaces without having to
check an indirect pointer.
The purpose of these two macros will be to pass via the session to
find the relevant stream interfaces so that we don't need to store
the ->cons nor ->prod pointers anymore. Currently they're only defined
so that all references could be removed.
Note that many places need a second pass of clean up so that we don't
have any chn_prod(&s->req) anymore and only &s->si[0] instead, and
conversely for the 3 other cases.
The channels were pointers to outside structs and this is not needed
anymore since the buffers have moved, but this complicates operations.
Move them back into the session so that both channels and stream interfaces
are always allocated for a session. Some places (some early sample fetch
functions) used to validate that a channel was NULL prior to dereferencing
it. Now instead we check if chn->buf is NULL and we force it to remain NULL
until the channel is initialized.
When the destination IP is dynamically set, we can't use the "target"
to define the proto. This patch ensures that we always use the protocol
associated with the address family. The proto field was removed from
the server and check structs.
balance hdr(<name>) provides on option 'use_domain_only' to match only the
domain part in a header (designed for the Host header).
Olivier Fredj reported that the hashes were not the same for
'subdomain.domain.tld' and 'domain.tld'.
This is because the pointer was rewinded one step to far, resulting in a hash
calculated against wrong values :
- '.domai' for 'subdomain.domain.tld'
- ' domai' for 'domain.tld' (beginning with the space in the header line)
Another special case is when no dot can be found in the header : the hash will
be calculated against an empty string.
The patch addresses both cases : 'domain' will be used to compute the hash for
'subdomain.domain.tld', 'domain.tld' and 'domain' (using the whole header value
for the last case).
The fix must be backported to haproxy 1.5 and 1.4.
The path "MAJOR: namespace: add Linux network namespace support" doesn't
permit to use internal data producer like a "peers synchronisation"
system. The result is a segfault when the internal application starts.
This patch fix the commit b3e54fe387
It is introduced in 1.6dev version, it doesn't need to be backported.
This patch makes it possible to create binds and servers in separate
namespaces. This can be used to proxy between multiple completely independent
virtual networks (with possibly overlapping IP addresses) and a
non-namespace-aware proxy implementation that supports the proxy protocol (v2).
The setup is something like this:
net1 on VLAN 1 (namespace 1) -\
net2 on VLAN 2 (namespace 2) -- haproxy ==== proxy (namespace 0)
net3 on VLAN 3 (namespace 3) -/
The proxy is configured to make server connections through haproxy and sending
the expected source/target addresses to haproxy using the proxy protocol.
The network namespace setup on the haproxy node is something like this:
= 8< =
$ cat setup.sh
ip netns add 1
ip link add link eth1 type vlan id 1
ip link set eth1.1 netns 1
ip netns exec 1 ip addr add 192.168.91.2/24 dev eth1.1
ip netns exec 1 ip link set eth1.$id up
...
= 8< =
= 8< =
$ cat haproxy.cfg
frontend clients
bind 127.0.0.1:50022 namespace 1 transparent
default_backend scb
backend server
mode tcp
server server1 192.168.122.4:2222 namespace 2 send-proxy-v2
= 8< =
A bind line creates the listener in the specified namespace, and connections
originating from that listener also have their network namespace set to
that of the listener.
A server line either forces the connection to be made in a specified
namespace or may use the namespace from the client-side connection if that
was set.
For more documentation please read the documentation included in the patch
itself.
Signed-off-by: KOVACS Tamas <ktamas@balabit.com>
Signed-off-by: Sarkozi Laszlo <laszlo.sarkozi@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Commit 98634f0 ("MEDIUM: backend: Enhance hash-type directive with an
algorithm options") cleaned up the hashing code by using a centralized
function. A bug appeared in get_server_uh() which is the URI hashing
function. Prior to the patch, the function would stop hashing on the
question mark, or on the trailing slash of a maximum directory count.
Consecutive to the patch, this last character is included into the
hash computation. This means that :
GET /0
GET /0?
Are not hashed similarly. The following configuration reproduces it :
mode http
balance uri
server s1 0.0.0.0:1234 redir /s1
server s2 0.0.0.0:1234 redir /s2
Many thanks to Vedran Furac for reporting this issue. The fix must
be backported to 1.5.
When we were generating a hash, it was done using an unsigned long. When the hash was used
to select a backend, it was sent as an unsigned int. This made it difficult to predict which
backend would be selected.
This patch updates get_hash, and the hash methods to use an unsigned int, to remain consistent
throughout the codebase.
This fix should be backported to 1.5 and probably in part to 1.4.
Checks.c has become a total mess. A number of proxy or server maintenance
and queue management functions were put there probably because they were
used there, but that makes the code untouchable. And that's without saying
that their names does not always relate to what they really do!
So let's do a first pass by moving these ones :
- set_backend_down() => backend.c
- redistribute_pending() => queue.c:pendconn_redistribute()
- check_for_pending() => queue.c:pendconn_grab_from_px()
- shutdown_sessions => server.c:srv_shutdown_sessions()
- shutdown_backup_sessions => server.c:srv_shutdown_backup_sessions()
All of them were moved at once.
Servers used to have 3 flags to store a state, now they have 4 states
instead. This avoids lots of confusion for the 4 remaining undefined
states.
The encoding from the previous to the new states can be represented
this way :
SRV_STF_RUNNING
| SRV_STF_GOINGDOWN
| | SRV_STF_WARMINGUP
| | |
0 x x SRV_ST_STOPPED
1 0 0 SRV_ST_RUNNING
1 0 1 SRV_ST_STARTING
1 1 x SRV_ST_STOPPING
Note that the case where all bits were set used to exist and was randomly
dealt with. For example, the task was not stopped, the throttle value was
still updated and reported in the stats and in the http_server_state header.
It was the same if the server was stopped by the agent or for maintenance.
It's worth noting that the internal function names are still quite confusing.
Now we introduce srv->admin and srv->prev_admin which are bitfields
containing one bit per source of administrative status (maintenance only
for now). For the sake of backwards compatibility we implement a single
source (ADMF_FMAINT) but the code already checks any source (ADMF_MAINT)
where the STF_MAINTAIN bit was previously checked. This will later allow
us to add ADMF_IMAINT for maintenance mode inherited from tracked servers.
Along doing these changes, it appeared that some places will need to be
revisited when implementing the inherited bit, this concerns all those
modifying the ADMF_FMAINT bit (enable/disable actions on the CLI or stats
page), and the checks to report "via" on the stats page. But currently
the code is harmless.
Till now, the server's state and flags were all saved as a single bit
field. It causes some difficulties because we'd like to have an enum
for the state and separate flags.
This commit starts by splitting them in two distinct fields. The first
one is srv->state (with its counter-part srv->prev_state) which are now
enums, but which still contain bits (SRV_STF_*).
The flags now lie in their own field (srv->flags).
The function srv_is_usable() was updated to use the enum as input, since
it already used to deal only with the state.
Note that currently, the maintenance mode is still in the state for
simplicity, but it must move as well.
We used to call srv_is_usable() with either the current state and weights
or the previous ones. This causes trouble for future changes, so let's first
split it in two variants :
- srv_is_usable(srv) considers the current status
- srv_was_usable(srv) considers the previous status
We used to have is_addr() in place to validate sometimes the existence
of an address, sometimes a valid IPv4 or IPv6 address. Replace them
carefully so that is_inet_addr() is used wherever we can only use an
IPv4/IPv6 address.
The RDP cookie extractor compares the 32-bit address from the request
to the address of each server in the farm without first checking that
the server's address is IPv4. This is a leftover from the IPv4 to IPv6
conversion. It's harmless as it's unlikely that IPv4 and IPv6 servers
will be mixed in an RDP farm, but better fix it.
This patch does not need to be backported.
This commit modifies the PROXY protocol V2 specification to support headers
longer than 255 bytes allowing for optional extensions. It implements the
PROXY protocol V2 which is a binary representation of V1. This will make
parsing more efficient for clients who will know in advance exactly how
many bytes to read. Also, it defines and implements some optional PROXY
protocol V2 extensions to send information about downstream SSL/TLS
connections. Support for PROXY protocol V1 remains unchanged.
Finn Arne Gangstad suggested that we should have the ability to break
keep-alive when the target server has reached its maxconn and that a
number of connections are present in the queue. After some discussion
around his proposed patch, the following solution was suggested : have
a per-proxy setting to fix a limit to the number of queued connections
on a server after which we break keep-alive. This ensures that even in
high latency networks where keep-alive is beneficial, we try to find a
different server.
This patch is partially based on his original proposal and implements
this configurable threshold.
All the code inherited from version 1.1 still holds a lot ot sessions
called "t" because in 1.1 they were tasks. This naming is very annoying
and sometimes even confusing, for example in code involving tables.
Let's get rid of this once for all and before 1.5-final.
Nothing changed beyond just carefully renaming these variables.
Cyril Bont reported that the "lastsess" field of a stats-only backend
was never updated. In fact the same is true for any applet and anything
not a server. Also, lastsess was not updated for a server reusing its
connection for a new request.
Since the goal of this field is to report recent activity, it's better
to ensure that all accesses are reported. The call has been moved to
the code validating the session establishment instead, since everything
passes there.
http_body_rewind() returns the number of bytes to rewind before buf->p to
find the message's body. It relies on http_hdr_rewind() to find the beginning
and adds msg->eoh + msg->eol which are always safe.
http_data_rewind() does the same to get the beginning of the data, which
differs from above when a chunk is present. It uses the function above and
adds msg->sol.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_uri_rewind() returns the number of bytes to rewind before buf->p to
find the URI. It relies on http_hdr_rewind() to find the beginning and
is just here to simplify operations.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_hdr_rewind() returns the number of bytes to rewind before buf->p to
find the beginning of headers. At the moment it's not exact as it still
relies on buf->o, assuming that no other data from a past message were
pending there, but it's what was done till there.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_body_bytes() returns the number of bytes of the current message body
present in the buffer. It is compatible with being called before and after
the headers are forwarded.
This is done to centralize further ->sov changes.
We used to have msg->sov updated for every chunk that was parsed. The issue
is that we want to be able to rewind after chunks were parsed in case we need
to redispatch a request and perform a new hash on the request or insert a
different server header name.
Currently, msg->sov and msg->next make parallel progress. We reached a point
where they're always equal because msg->next is initialized from msg->sov,
and is subtracted msg->sov's value each time msg->sov bytes are forwarded.
So we can now ensure that msg->sov can always be replaced by msg->next for
every state after HTTP_MSG_BODY where it is used as a position counter.
This allows us to keep msg->sov untouched whatever the number of chunks that
are parsed, as is needed to extract data from POST request (eg: url_param).
However, we still need to know the starting position of the data relative to
the body, which differs by the chunk size length. We use msg->sol for this
since it's now always zero and unused in the body.
So with this patch, we have the following situation :
- msg->sov = msg->eoh + msg->eol = size of the headers including last CRLF
- msg->sol = length of the chunk size if any. So msg->sov + msg->sol = DATA.
- msg->next corresponds to the byte being inspected based on the current
state and is always >= msg->sov before starting to forward anything.
Since sov and next are updated in case of header rewriting, a rewind will
fix them both when needed. Of course, ->sol has no reason for changing in
such conditions, so it's fine to keep it relative to msg->sov.
In theory, even if a redispatch has to be performed, a transformation
occurring on the request would still work because the data moved would
still appear at the same place relative to bug->p.
This is the continuation of previous patch. Now that full buffers are
not rejected anymore, let's wait for at least the advertised chunk or
body length to be present or the buffer to be full. When either
condition is met, the message processing can go forward.
Thus we don't need to use url_param_post_limit anymore, which was passed
in the configuration as an optionnal <max_wait> parameter after the
"check_post" value. This setting was necessary when the feature was
implemented because there was no support for parsing message bodies.
The argument is now silently ignored if set in the configuration.
Finn Arne Gangstad reported that commit 6b726adb35 ("MEDIUM: http: do
not report connection errors for second and further requests") breaks
support for serving static files by abusing the errorfile 503 statement.
Indeed, a second request over a connection sent to any server or backend
returning 503 would silently be dropped.
The proper solution consists in adding a flag on the session indicating
that the server connection was reused, and to only avoid the error code
in this case.
Since 1.5-dev20, we have a working server-side keep-alive and an option
"prefer-last-server" to indicate that we explicitly want to reuse the
same server as the last one. Unfortunately this breaks the redispatch
feature because assign_server() insists on reusing the same server as
the first one attempted even if the connection failed to establish.
A simple solution consists in only considering the last connection if
it was connected. Otherwise there is no reason for being interested in
reusing the same server.
Summary:
Track and report last session time on the stats page for each server
in every backend, as well as the backend.
This attempts to address the requirement in the ROADMAP
- add a last activity date for each server (req/resp) that will be
displayed in the stats. It will be useful with soft stop.
The stats page reports this as time elapsed since last session. This
change does not adequately address the requirement for long running
session (websocket, RDP... etc).
In HTTP keep-alive mode, if we receive a 401, we still have a chance
of being able to send the visitor again to the same server over the
same connection. This is required by some broken protocols such as
NTLM, and anyway whenever there is an opportunity for sending the
challenge to the proper place, it's better to do it (at least it
helps with debugging).
If we reuse a server-side connection, we must not reinitialize its context nor
try to enable send_proxy. At the moment HTTP keep-alive over SSL fails on the
first attempt because the SSL context was cleared, so it only worked after a
retry.
When the load balancing algorithm in use is not deterministic, and a previous
request was sent to a server to which haproxy still holds a connection, it is
sometimes desirable that subsequent requests on a same session go to the same
server as much as possible. Note that this is different from persistence, as
we only indicate a preference which haproxy tries to apply without any form
of warranty. The real use is for keep-alive connections sent to servers. When
this option is used, haproxy will try to reuse the same connection that is
attached to the server instead of rebalancing to another server, causing a
close of the connection. This can make sense for static file servers. It does
not make much sense to use this in combination with hashing algorithms.
This commit allows an existing server-side connection to be reused if
it matches the same target. Basic controls are performed ; right now
we do not allow to reuse a connection when dynamic source binding is
in use or when the destination address or port is dynamic (eg: proxy
mode). Later we'll have to also disable connection sharing when PROXY
protocol is being used or when non-idempotent requests are processed.
When allocating a new connection, only the caller knows whether it's
acceptable to reuse the previous one or not. Let's pass this information
to si_alloc_conn() which will do the cleanup if the connection is not
acceptable.
Having the check state partially stored in the server doesn't help.
Some functions such as srv_getinter() rely on the server being checked
to decide what check frequency to use, instead of relying on the check
being configured. So let's get rid of SRV_CHECKED and SRV_AGENT_CHECKED
and only use the check's states instead.
Till now the send_proxy_ofs field remained in the stream interface,
but since the dynamic allocation of the connection, it makes a lot
of sense to move that into the connection instead of the stream
interface, since it will not be statically allocated for each
session.
Also, it turns out that moving it to the connection fils an alignment
hole on 64 bit architectures so it does not consume more memory, and
removing it from the stream interface was an opportunity to correctly
reorder fields and reduce the stream interface's size from 160 to 144
bytes (-10%). This is 32 bytes saved per session.
The outgoing connection is now allocated dynamically upon the first attempt
to touch the connection's source or destination address. If this allocation
fails, we fail on SN_ERR_RESOURCE.
As we didn't use si->conn anymore, it was removed. The endpoints are released
upon session_free(), on the error path, and upon a new transaction. That way
we are able to carry the existing server's address across retries.
The stream interfaces are not initialized anymore before session_complete(),
so we could even think about allocating them dynamically as well, though
that would not provide much savings.
The session initialization now makes use of conn_new()/conn_free(). This
slightly simplifies the code and makes it more logical. The connection
initialization code is now shorter by about 120 bytes because it's done
at once, allowing the compiler to remove all redundant initializations.
The si_attach_applet() function now takes care of first detaching the
existing endpoint, and it is called from stream_int_register_handler(),
so we can safely remove the calls to si_release_endpoint() in the
application code around this call.
A call to si_detach() was made upon stream_int_unregister_handler() to
ensure we always free the allocated connection if one was allocated in
parallel to setting an applet (eg: detect HTTP proxy while proceeding
with stats maybe).
si_prepare_conn() is not appropriate in our case as it both initializes and
attaches the connection to the stream interface. Due to the asymmetry between
accept() and connect(), it causes some fields such as the control and transport
layers to be reinitialized.
Now that we can separately initialize these fields using conn_prepare(), let's
break this function to only attach the connection to the stream interface.
Also, by analogy, si_prepare_none() was renamed si_detach(), and
si_prepare_applet() was renamed si_attach_applet().
The connection will only remain there as a pre-allocated entity whose
goal is to be placed in ->end when establishing an outgoing connection.
All connection initialization can be made on this connection, but all
information retrieved should be applied to the end point only.
This change is huge because there were many users of si->conn. Now the
only users are those who initialize the new connection. The difficulty
appears in a few places such as backend.c, proto_http.c, peers.c where
si->conn is used to hold the connection's target address before assigning
the connection to the stream interface. This is why we have to keep
si->conn for now. A future improvement might consist in dynamically
allocating the connection when it is needed.
This function makes no sense anymore and will cause trouble to convert
the remains of connection/applet to end points. Let's replace it now
with its contents.
A very old bug resulting from some code refactoring causes
assign_server_address() to refrain from retrieving the destination
address from the client-side connection when transparent mode is
enabled and we're connecting to a server which has address 0.0.0.0.
The impact is low since such configurations are unlikely to ever
be encountered. The fix should be backported to older branches.
This function was designed for haproxy while testing other functions
in the past. Initially it was not planned to be used given the not
very interesting numbers it showed on real URL data : it is not as
smooth as the other ones. But later tests showed that the other ones
are extremely sensible to the server count and the type of input data,
especially DJB2 which must not be used on numeric input. So in fact
this function is still a generally average performer and it can make
sense to merge it in the end, as it can provide an alternative to
sdbm+avalanche or djb2+avalanche for consistent hashing or when hashing
on numeric data such as a source IP address or a visitor identifier in
a URL parameter.
Summary:
Avalanche is supported not as a native hashing choice, but a modifier
on the hashing function. Note that this means that possible configs
written after 1.5-dev4 using "hash-type avalanche" will get an informative
error instead. But as discussed on the mailing list it seems nobody ever
used it anyway, so let's fix it before the final 1.5 release.
The default values were selected for backward compatibility with previous
releases, as discussed on the mailing list, which means that the consistent
hashing will still apply the avalanche hash by default when no explicit
algorithm is specified.
Examples
(default) hash-type map-based
Map based hashing using sdbm without avalanche
(default) hash-type consistent
Consistent hashing using sdbm with avalanche
Additional Examples:
(a) hash-type map-based sdbm
Same as default for map-based above
(b) hash-type map-based sdbm avalanche
Map based hashing using sdbm with avalanche
(c) hash-type map-based djb2
Map based hashing using djb2 without avalanche
(d) hash-type map-based djb2 avalanche
Map based hashing using djb2 with avalanche
(e) hash-type consistent sdbm avalanche
Same as default for consistent above
(f) hash-type consistent sdbm
Consistent hashing using sdbm without avalanche
(g) hash-type consistent djb2
Consistent hashing using djb2 without avalanche
(h) hash-type consistent djb2 avalanche
Consistent hashing using djb2 with avalanche
Summary:
In testing at tumblr, we found that using djb2 hashing instead of the
default sdbm hashing resulted is better workload distribution to our backends.
This commit implements a change, that allows the user to specify the hash
function they want to use. It does not limit itself to consistent hashing
scenarios.
The supported hash functions are sdbm (default), and djb2.
For a discussion of the feature and analysis, see mailing list thread
"Consistent hashing alternative to sdbm" :
http://marc.info/?l=haproxy&m=138213693909219
Note: This change does NOT make changes to new features, for instance,
applying an avalance hashing always being performed before applying
consistent hashing.
This function is also called directly from backend.c, so let's stop
building fake args to call it as a sample fetch, and have a lower
layer more generic function instead.
We're having a lot of duplicate code just because of minor variants between
fetch functions that could be dealt with if the functions had the pointer to
the original keyword, so let's pass it as the last argument. An earlier
version used to pass a pointer to the sample_fetch element, but this is not
the best solution for two reasons :
- fetch functions will solely rely on the keyword string
- some other smp_fetch_* users do not have the pointer to the original
keyword and were forced to pass NULL.
So finally we're passing a pointer to the keyword as a const char *, which
perfectly fits the original purpose.
Benoit Dolez reported a failure to start haproxy 1.5-dev19. The
process would immediately report an internal error with missing
fetches from some crap instead of ACL names.
The cause is that some versions of gcc seem to trim static structs
containing a variable array when moving them to BSS, and only keep
the fixed size, which is just a list head for all ACL and sample
fetch keywords. This was confirmed at least with gcc 3.4.6. And we
can't move these structs to const because they contain a list element
which is needed to link all of them together during the parsing.
The bug indeed appeared with 1.5-dev19 because it's the first one
to have some empty ACL keyword lists.
One solution is to impose -fno-zero-initialized-in-bss to everyone
but this is not really nice. Another solution consists in ensuring
the struct is never empty so that it does not move there. The easy
solution consists in having a non-null list head since it's not yet
initialized.
A new "ILH" list head type was thus created for this purpose : create
an Initialized List Head so that gcc cannot move the struct to BSS.
This fixes the issue for this version of gcc and does not create any
burden for the declarations.
This patch does not change the logic of the code, it only changes the
way OS-specific defines are tested.
At the moment the transparent proxy code heavily depends on Linux-specific
defines. This first patch introduces a new define "CONFIG_HAP_TRANSPARENT"
which is set every time the defines used by transparent proxy are present.
This also means that with an up-to-date libc, it should not be necessary
anymore to force CONFIG_HAP_LINUX_TPROXY during the build, as the flags
will automatically be detected.
The CTTPROXY flags still remain separate because this older API doesn't
work the same way.
A new line has been added in the version output for haproxy -vv to indicate
what transparent proxy support is available.
Now that ACLs solely rely on sample fetch functions, make them use the
same arg mask. All inconsistencies have been fixed separately prior to
this patch, so this patch almost only adds a new pointer indirection
and removes all references to ARG*() in the definitions.
The parsing is still performed by the ACL code though.
ACL fetch functions used to directly reference a fetch function. Now
that all ACL fetches have their sample fetches equivalent, we can make
ACLs reference a sample fetch keyword instead.
In order to simplify the code, a sample keyword name may be NULL if it
is the same as the ACL's, which is the most common case.
A minor change appeared, http_auth always expects one argument though
the ACL allowed it to be missing and reported as such afterwards, so
fix the ACL to match this. This is not really a bug.
The following sample fetch functions were only usable by ACLs but are now
usable by sample fetches too :
avg_queue, be_conn, be_id, be_sess_rate, connslots, nbsrv,
queue, srv_conn, srv_id, srv_is_up, srv_sess_rate
The fetch functions have been renamed "smp_fetch_*".
The file acl.c is a real mess, it both contains functions to parse and
process ACLs, and some sample extraction functions which act on buffers.
Some other payload analysers were arbitrarily dispatched to proto_tcp.c.
So now we're moving all payload-based fetches and ACLs to payload.c
which is capable of extracting data from buffers and rely on everything
that is protocol-independant. That way we can safely inflate this file
and only use the other ones when some fetches are really specific (eg:
HTTP, SSL, ...).
As a result of this cleanup, the following new sample fetches became
available even if they're not really useful :
always_false, always_true, rep_ssl_hello_type, rdp_cookie_cnt,
req_len, req_ssl_hello_type, req_ssl_sni, req_ssl_ver, wait_end
The function 'acl_fetch_nothing' was wrong and never used anywhere so it
was removed.
The "rdp_cookie" sample fetch used to have a mandatory argument while it
was optional in ACLs, which are supposed to iterate over RDP cookies. So
we're making it optional as a fetch too, and it will return the first one.
This is just like previous commit, but for the backend this time. All this
code did not need to remain duplicated. These are 500 more bytes shaved off.
Both servers and proxies share a common set of parameters for outgoing
connections, and since they're not stored in a similar structure, a lot
of code is duplicated in the connection setup, which is one sensible
area.
Let's first define a common struct for these settings and make use of it.
Next patches will de-duplicate code.
This change also fixes a build breakage that happens when USE_LINUX_TPROXY
is not set but USE_CTTPROXY is set, which seem to be very unlikely
considering that the issue was introduced almost 2 years ago an never
reported.
Considering there is no option yet for maxconnrate for servers, I wrote
an ACL to check a backend server session rate which we use to send to an
"overflow" backend to prevent latency responses to our clients (very
sensitive latency requirements).
Instead of storing a couple of (int, ptr) in the struct connection
and the struct session, we use a different method : we only store a
pointer to an integer which is stored inside the target object and
which contains a unique type identifier. That way, the pointer allows
us to retrieve the object type (by dereferencing it) and the object's
address (by computing the displacement in the target structure). The
NULL pointer always corresponds to OBJ_TYPE_NONE.
This reduces the size of the connection and session structs. It also
simplifies target assignment and compare.
In order to improve the generated code, we try to put the obj_type
element at the beginning of all the structs (listener, server, proxy,
si_applet), so that the original and target pointers are always equal.
A lot of code was touched by massive replaces, but the changes are not
that important.
We will need to be able to switch server connections on a session and
to keep idle connections. In order to achieve this, the preliminary
requirement is that the connections can survive the session and be
detached from them.
Right now they're still allocated at exactly the same place, so when
there is a session, there are always 2 connections. We could soon
improve on this by allocating the outgoing connection only during a
connect().
This current patch touches a lot of code and intentionally does not
change any functionnality. Performance tests show no regression (even
a very minor improvement). The doc has not yet been updated.
With this commit, we now separate the channel from the buffer. This will
allow us to replace buffers on the fly without touching the channel. Since
nobody is supposed to keep a reference to a buffer anymore, doing so is not
a problem and will also permit some copy-less data manipulation.
Interestingly, these changes have shown a 2% performance increase on some
workloads, probably due to a better cache placement of data.
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
Alex Markham reported and diagnosed a bug appearing on 1.5-dev11,
causing a crash on x86_64 when header hashing is used. The cause is
a missing (int) cast causing a negative offset to appear positive
and the resulting pointer to go out of bounds.
The crash is not possible anymore since 1.5-dev12 because a second
bug caused the negative sign to disappear so the pointer is always
within range but always wrong, so balance hdr() never works anymore.
This fix restores the correct behaviour and ensures the sign is
correct.
The last uses of the stream interfaces were in tcp_connect_server() and
could easily and more appropriately be moved to its callers, si_connect()
and connect_server(), making a lot more sense.
Now the function should theorically be usable for health checks.
It also appears more obvious that the file is split into two distinct
parts :
- the protocol layer used at the connection level
- the tcp analysers executing tcp-* rules and their samples/acls.
We need to have the source and destination addresses in the connection.
They were lying in the stream interface so let's move them. The flags
SI_FL_FROM_SET and SI_FL_TO_SET have been moved as well.
It's worth noting that tcp_connect_server() almost does not use the
stream interface anymore except for a few flags.
It has been identified that once we detach the connection from the SI,
it will probably be needed to keep a copy of the server-side addresses
in the SI just for logging purposes. This has not been implemented right
now though.
Some parts of the sock_ops structure were only used by the stream
interface and have been moved into si_ops. Some of them were callbacks
to the stream interface from the connection and have been moved into
app_cp as they're the application seen from the connection (later,
health-checks will need to use them). The rest has moved to data_ops.
Normally at this point the connection could live without knowing about
stream interfaces at all.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.
At the moment, the struct is still embedded into the struct channel, but
all the functions have been updated to use struct buffer only when possible,
otherwise struct channel. Some functions would likely need to be splitted
between a buffer-layer primitive and a channel-layer function.
Later the buffer should become a pointer in the struct buffer, but doing so
requires a few changes to the buffer allocation calls.
This is a massive rename. We'll then split channel and buffer.
This change needs a lot of cleanups. At many locations, the parameter
or variable is still called "buf" which will become ambiguous. Also,
the "struct channel" is still defined in buffers.h.
This patch brings a new "whole" parameter to "balance uri" which makes
the hash work over the whole uri, not just the part before the query
string. Len and depth parameter are still honnored.
The reason for this new feature is explained below.
I have 3 backend servers, each accepting different form of HTTP queries:
http://backend1.server.tld/service1.php?q=...
http://backend1.server.tld/service2.php?q=...
http://backend2.server.tld/index.php?query=...&subquery=...
http://backend3.server.tld/image/49b8c0d9ff
Each backend server returns a different response based on either:
- the URI path (the left part of the URI before the question mark)
- the query string (the right part of the URI after the question mark)
- or the combination of both
I wanted to set up a common caching cluster (using 6 Squid servers, each
configured as reverse proxy for those 3 backends) and have HAProxy balance
the queries among the Squid servers based on URL. I also wanted to achieve
hight cache hit ration on each Squid server and send the same queries to
the same Squid servers. Initially I was considering using the 'balance uri'
algorithm, but that would not work as in case of backend2 all queries would
go to only one Squid server. The 'balance url_param' would not work either
as it would send the backend3 queries to only one Squid server.
So I thought the simplest solution would be to use 'balance uri', but to
calculate the hash based on the whole URI (URI path + query string),
instead of just the URI path.
We start to move everything needed to manage a connection to a special
entity "struct connection". We have the data layer operations and the
control operations there. We'll also have more info in the future such
as file descriptors and applet contexts, so that in the end it becomes
detachable from the stream interface, which will allow connections to
be reused between sessions.
For now on, we start with minimal changes.
Since the recent buffer reorg, msg->som is redundant with buf->p but still
appears at a number of places. This tiny patch allows to confirm that som
follows two states :
- 0 from the moment the message starts to be parsed
- relative offset to ->p for start of chunk when parsing chunks
During this second state, ->sol is never used, so we should probably merge
the two.
This is a left-over from the buffer changes. Msg->sol is always null at the
end of the parsing, so we must not use it anymore to read headers or find
the beginning of a message. As a side effect, the dump of the request in
debug mode is working again because it was relying on msg->sol not being
null.
Maybe it will even be mergeable with another of the message pointers.
The recent split between the buffers and HTTP messages in 1.5-dev9 caused
a major trouble : in the past, we used to keep a pointer to HTTP data in the
buffer struct itself, which was the cause of most of the pain we had to deal
with buffers.
Now the two are split but we lost the information about the beginning of
the HTTP message once it's being forwarded. While it seems normal, it happens
that several parts of the code currently rely on this ability to inspect a
buffer containing old contents :
- balance uri
- balance url_param
- balance url_param check_post
- balance hdr()
- balance rdp-cookie()
- http-send-name-header
All these happen after the data are scheduled for being forwarded, which
also causes a server to be selected. So for a long time we've been relying
on supposedly sent data that we still had a pointer to.
Now that we don't have such a pointer anymore, we only have one possibility :
when we need to inspect such data, we have to rewind the buffer so that ->p
points to where it previously was. We're lucky, no data can leave the buffer
before it's being connecting outside, and since no inspection can begin until
it's empty, we know that the skipped data are exactly ->o. So we rewind the
buffer by ->o to get headers and advance it back by the same amount.
Proceeding this way is particularly important when dealing with chunked-
encoded requests, because the ->som and ->sov fields may be reused by the
chunk parser before the connection attempt is made, so we cannot rely on
them.
Also, we need to be able to come back after retries and redispatches, which
might change the size of the request if http-send-name-header is set. All of
this is accounted for by the output queue so in the end it does not look like
a bad solution.
No backport is needed.
Instead of hard-coding sock_raw in connect_server(), we set this socket
operation at config parsing time. Right now, only servers and peers have
it. Proxies are still hard-coded as sock_raw. This will be needed for
future work on SSL which requires a different socket layer.
Commit e164e7a removed get_src/get_dst setting in the stream interfaces but
forgot to set it in proto_tcp. Get the feature back because we need it for
logging, transparent mode, ACLs etc... We now rely on the stream interface
direction to know what syscall to use.
One benefit of doing it this way is that we don't use getsockopt() anymore
on outgoing stream interfaces nor on UNIX sockets.
We'll soon have an SSL socket layer, and in order to ease the difference
between the two, we use the name "sock_raw" to designate the one which
directly talks to the sockets without any conversion.
Patterns were using a bitmask to indicate if request or response was desired
in fetch functions and keywords. ACLs were using a bitmask in fetch keywords
and a single bit in fetch functions. ACLs were also using an ACL_PARTIAL bit
in fetch functions indicating that a non-final fetch was performed, which was
an abuse of the existing direction flag.
The change now consists in using :
- a capabilities field for fetch keywords => SMP_CAP_REQ/RES to indicate
if a keyword supports requests, responses, both, etc...
- an option field for fetch functions to indicate what the caller expects
(request/response, final/non-final)
The ACL_PARTIAL bit was reversed to get SMP_OPT_FINAL as it's more explicit
to know we're working on a final buffer than on a non-final one.
ACL_DIR_* were removed, as well as PATTERN_FETCH_*. L4 fetches were improved
to support being called on responses too since they're still available.
The <dir> field of all fetch functions was changed to <opt> which is now
unsigned.
The patch is large but mostly made of cosmetic changes to accomodate this, as
almost no logic change happened.
Having the args everywhere will make it easier to share fetch functions
between patterns and ACLs. The only place where we could have needed
the expr was in the http_prefetch function which can do well without.
Previously, both pattern, backend and persist_rdp_cookie would build fake
ACL expressions to fetch an RDP cookie by calling acl_fetch_rdp_cookie().
Now we switch roles. The RDP cookie fetch function is provided as a sample
fetch function that all others rely on, including ACL. The code is exactly
the same, only the args handling moved from expr->args to args. The code
was moved to proto_tcp.c, but probably that a dedicated file would be more
suited to content handling.
These ones were either unused or improperly used. Some integers were marked
read-only, which does not make much sense. Buffers are not read-only, they're
"constant" in that they must be kept intact after any possible change.
This one is not needed anymore as we can return the data and its type in the
sample provided by the caller. ACLs now always return the proper type. BOOL
is already returned when the result is expected to be processed as a boolean.
temp_pattern has been unexported now.
The new sample types are necessary for the acl-pattern convergence.
These types are boolean and signed int. Some types were renamed for
less ambiguity (ip->ipv4, integer->uint).
A large number of ACLs make use of frontend, backend or table names in their
arguments, and fall back to the current proxy when no argument is passed. If
the expected capability is not available, the ACL silently fails at runtime.
Now we make all those names mandatory in the parser and we rely on
acl_find_targets() to replace the missing names with the holding proxy,
then to perform the appropriate tests, and to reject errors at parsing
time.
It is possible that some faulty configurations will get rejected from now
on, while they used to silently fail till now. This is the reason why this
change is marked as MAJOR.
Proxy names are now resolved when the config is parsed and not at runtime.
This means that errors will be caught for real instead of having an ACL
silently never match. Another benefit is that the fetch will be much faster
since the lookup will not have to be performed anymore, eg for all ACLs
based on explicitly named stick-tables.
However some buggy configurations which used to silently fail in the past
will now refuse to load, hence the MAJOR tag.
The types and minimal number of ACL keyword arguments are now stored in
their declaration. This will allow many more fantasies if some ACL use
several arguments or types.
Doing so required to rework all ACL keyword declarations to add two
parameters. So this was a good opportunity for a general cleanup and
to sort all entries in alphabetical order.
We still have two pending issues :
- parse_acl_expr() checks for errors but has no way to report them to
the user ;
- the types of some arguments are still not resolved and kept as strings
(eg: ARGT_FE/BE/TAB) for compatibility reasons, which must be resolved
in acl_find_targets()
The ACL parser now uses the argument parser to build a typed argument list.
Right now arguments are all strings and only one argument is supported since
this is what ACLs currently support.
msg->sol is now a relative pointer just like all other ones. There is no
more absolute references to the buffer outside the struct buffer itself.
Next two cleanups should include removing buffer references to functions
which already have an msg, and removal of wrapping detection in request
and response parsing which cannot wrap by definition.
These offsets were relative to the buffer itself. Now they're relative to
the buffer's origin (buf->p) which normally corresponds to the start of
current message.
This saves a big dependency between the HTTP message struct and the buffers.
It appeared during this change that ->col is not used anymore (it will have
to be removed). Next step is to turn ->eol and ->sol from absolute to relative.
We don't have buf->l anymore. We have buf->i for pending data and
the total length is retrieved by adding buf->o. Some computation
already become simpler.
Despite extreme care, bugs are not excluded.
It's worth noting that msg->err_pos as set by HTTP request/response
analysers becomes relative to pending data and not to the beginning
of the buffer. This has not been completed yet so differences might
occur when outgoing data are left in the buffer.
These callbacks are used to retrieve the source and destination address
of a socket. The address flags are not hold on the stream interface and
not on the session anymore. The addresses are collected when needed.
This still needs to be improved to store the IP and port separately so
that it is not needed to perform a getsockname() when only the IP address
is desired for outgoing traffic.
The hash of IPv6 addresses was not properly aligned and resulted in the
last quarter of the address not being hashed. In practice, this is rarely
detected since MAC addresses are used in the second half. But this becomes
very visible with IPv6-mapped IPv4 addresses such as ::FFFF:1.2.3.4 where
the IPv4 part is never hashed.
This bug has been there forever, since introduction of "balance source" in
v1.2.11. The fix must then be backported to all stable versions.
Thanks to Alex Markham for reporting this issue to the list !
%Bi return the backend source IP
%Bp return the backend source port
Add a function pointer in logformat_type to do additional configuration
during the log-format variable parsing.
The principle behind this load balancing algorithm was first imagined
and modeled by Steen Larsen then iteratively refined through several
work sessions until it would totally address its original goal.
The purpose of this algorithm is to always use the smallest number of
servers so that extra servers can be powered off during non-intensive
hours. Additional tools may be used to do that work, possibly by
locally monitoring the servers' activity.
The first server with available connection slots receives the connection.
The servers are choosen from the lowest numeric identifier to the highest
(see server parameter "id"), which defaults to the server's position in
the farm. Once a server reaches its maxconn value, the next server is used.
It does not make sense to use this algorithm without setting maxconn. Note
that it can however make sense to use minconn so that servers are not used
at full load before starting new servers, and so that introduction of new
servers requires a progressively increasing load (the number of servers
would more or less follow the square root of the load until maxconn is
reached). This algorithm ignores the server weight, and is more beneficial
to long sessions such as RDP or IMAP than HTTP, though it can be useful
there too.
The new function does not return IP addresses but header values instead,
so that the caller is free to make what it want of them. The conversion
is not quite clean yet, as the previous test which considered that address
0.0.0.0 meant "no address" is still used. A different IP parsing function
should be used to take this into account.
Now strings and data blocks are stored in the temp_pattern's chunk
and matched against this one.
The rdp_cookie currently makes extensive use of acl_fetch_rdp_cookie()
and will be a good candidate for the initial rework so that ACLs use
the patterns framework and not the other way around.
All ACL fetches which return integer value now store the result into
the temporary pattern struct. All ACL matches which rely on integer
also get their value there.
Note: the pattern data types are not set right now.
This is 1.5-specific. It causes issues with transparent source binding involving
hdr_ip. We must not try to bind() to a foreign address when the family is not set,
and we must set the family when an address is set.
Stream interfaces used to distinguish between client and server addresses
because they were previously of different types (sockaddr_storage for the
client, sockaddr_in for the server). This is not the case anymore, and this
distinction is confusing at best and has caused a number of regressions to
be introduced in the process of converting everything to full-ipv6. We can
now remove this and have a much cleaner code.
Nick Chalk reported that a connection to a server which has no port specified
used twice the port number. The reason is that the port number was taken from
the wrong part of the address, the client's destination address was used as the
base port instead of the server's configured address.
Thanks to Nick for his helpful diagnostic.
A similar issue as the previous one causes port mapping to fail in some
combinations of client and server address families. Using the macros fixes
the issue.
Adding health checks has become a real pain, with cross-references to all
checks everywhere because they're all a single bit. Since they're all
exclusive, let's change this to have a check number only. We reserve 4
bits allowing up to 16 checks (15+tcp), only 7 of which are currently
used. The code has shrunk by almost 1kB and we saved a few option bits.
The "dispatch" option has been moved to px->options, making a few tests
a bit cleaner.
Since we now have the copy of the target in the session, use it instead
of relying on the SI for it. The SI drops the target upon unregister()
so applets such as stats were logged as "NOSRV".
This option enables use of the PROXY protocol with the server, which
allows haproxy to transport original client's address across multiple
architecture layers.
It's very annoying that frontend and backend stats are merged because we
don't know what we're observing. For instance, if a "listen" instance
makes use of a distinct backend, it's impossible to know what the bytes_out
means.
Some points take care of not updating counters twice if the backend points
to the frontend, indicating a "listen" instance. The thing becomes more
complex when we try to add support for server side keep-alive, because we
have to maintain a pointer to the backend used for last request, and to
update its stats. But we can't perform such comparisons anymore because
the counters will not match anymore.
So in order to get rid of this situation, let's have both frontend AND
backend stats in the "struct proxy". We simply update the relevant ones
during activity. Some of them are only accounted for in the backend,
while others are just for frontend. Maybe we can improve a bit on that
later, but the essential part is that those counters now reflect what
they really mean.
This patch turns internal server addresses to sockaddr_storage to
store IPv6 addresses, and makes the connect() function use it. This
code already works but some caveats with getaddrinfo/gethostbyname
still need to be sorted out while the changes had to be merged at
this stage of internal architecture changes. So for now the config
parser will not emit an IPv6 address yet so that user experience
remains unchanged.
This change should have absolutely zero user-visible effect, otherwise
it's a bug introduced during the merge, that should be reported ASAP.
This one has been removed and is now totally superseded by ->target.
To get the server, one must use target_srv(&s->target) instead of
s->srv now.
The function ensures that non-server targets still return NULL.
s->prev_srv is used by assign_server() only, but all code paths leading
to it now take s->prev_srv from the existing s->srv. So assign_server()
can do that copy into its own stack.
If at one point a different srv is needed, we still have a copy of the
last server on which we failed a connection attempt in s->target.
When dealing with HTTP keep-alive, we'll have to know if we can reuse
an existing connection. For that, we'll have to check if the current
connection was made on the exact same target (referenced in the stream
interface).
Thus, we need to first assign the next target to the session, then
copy it to the stream interface upon connect(). Later we'll check for
equivalence between those two operations.
Till now we used the fact that the dispatch address was not null to use
the dispatch mode. This is very unconvenient, so let's have a dedicated
option.
When doing a connect() on a stream interface, some information is needed
from the server and from the backend. In some situations, we don't have
a server and only a backend (eg: peers). In other cases, we know we have
an applet and we don't want to connect to anything, but we'd still like
to have the info about the applet being used.
For this, we now store a pointer to the "target" into the stream interface.
The target describes what's on the other side before trying to connect. It
can be a server, a proxy or an applet for now. Later we'll probably have
descriptors for multiple-stage chains so that the final information may
still be found.
This will help removing many specific cases in the code. It already made
it possible to remove the "srv" and "be" parameters to tcpv4_connect_server().
Bryan Talbot reported that POST requests with a query string were not
correctly processed if the hash parameter was the first one, because
the delimiter that was looked for to trigger the parsing was '&' instead
of '?'.
Also, while checking the code, it became apparent that it was enough for
a query string to be present in the request for POST parameters to be
ignored, even if the url_param was in the body and not in the URL.
The code has then been fixed like this :
1) look for URL param. If found, return it.
2) if no URL param was found and method is POST, then look it up into
the body
The code now seems to pass all request combinations.
This patch must be backported to 1.4 since 1.4 is equally broken right now.
Till now, the forwarding code was making use of the hdr_content_len member
to hold the size of the last chunk parsed. As such, it was reset after being
scheduled for forwarding. The issue is that this entry was reset before the
data could be viewed by backend.c in order to parse a POST body, so the
"balance url_param check_post" did not work anymore.
In order to fix this, we need two things :
- the chunk size (reset upon every forward)
- the total body size (not reset)
hdr_content_len was thus replaced by the former (hence the size of the patch)
as it makes more sense to have it stored that way than the way around.
This patch should be backported to 1.4 with care, considering that it affects
the forwarding code.
When the number of servers is a multiple of the size of the input set,
map-based hash can be inefficient. This typically happens with 64
servers when doing URI hashing. The "avalanche" hash-type applies an
avalanche hash before performing a map lookup in order to smooth the
distribution. The result is slightly less smooth than the map for small
numbers of servers, but still better than the consistent hashing.
Till now when a server was configured with address 0.0.0.0, the
connection was forwarded to this address which generally is intercepted
by the system as a local address, so this was completely useless.
One sometimes useful feature for outgoing transparent proxies is to
be able to forward the connection to the same address the client
requested. This patch fixes the meaning of 0.0.0.0 precisely to
ensure that the connection will be forwarded to the initial client's
destination address.
It's not normal to initialize the server-side stream interface from the
accept() function, because it may change later. Thus, we introduce a new
stream_sock_prepare_interface() function which is called just before the
connect() and which sets all of the stream_interface's callbacks to the
default ones used for real sockets. The ->connect function is also set
at the same instant so that we can easily add new server-side protocols
soon.
The 'client.c' file now only contained frontend-specific functions,
so it has naturally be renamed 'frontend.c'. Same for client.h. This
has also been an opportunity to remove some cross references from
files that should not have depended on it.
In the end, this file should contain a protocol-agnostic accept()
code, which would initialize a session, task, etc... based on an
accept() from a lower layer. Right now there are still references
to TCP.
Some ACLs in the client ought to belong to proto_tcp, or protocols.
This file should only contain frontend-specific information and will
be renamed that way in next commit.
Some functions which act on generic buffer contents without being
tcp-specific were historically in proto_tcp.c. This concerns ACLs
and RDP cookies. Those have been moved away to more appropriate
locations. Ideally we should create some new files for each layer6
protocol parser. Let's do that later.
This ACL was missing in complex setups where the status of a remote site
has to be considered in switching decisions. Until there, using a server's
status in an ACL required to have a dedicated backend, which is a bit heavy
when multiple servers have to be monitored.
Using get_ip_from_hdr2() we can look for occurrence #X or #-X and
extract the IP it contains. This is typically designed for use with
the X-Forwarded-For header.
Using "usesrc hdr_ip(name,occ)", it becomes possible to use the IP address
found in <name>, and possibly specify occurrence number <occ>, as the
source to connect to a server. This is possible both in a server and in
a backend's source statement. This is typically used to use the source
IP previously set by a upstream proxy.
The transparent proxy address selection was set in the TCP connect function
which is not the most appropriate place since this function has limited
access to the amount of parameters which could produce a source address.
Instead, now we determine the source address in backend.c:connect_server(),
right after calling assign_server_address() and we assign this address in
the session and pass it to the TCP connect function. This cannot be performed
in assign_server_address() itself because in some cases (transparent mode,
dispatch mode or http_proxy mode), we assign the address somewhere else.
This change will open the ability to bind to addresses extracted from many
other criteria (eg: from a header).
Isidore Li reported an occasional segfault when using URL hashing, and
kindly provided backtraces and core files to help debugging.
The problem was triggered by reset connections before the URL was sent,
and was due to the same bug which was fixed by commit e45997661b
(connections were attempted in case of connection abort). While that
bug was already fixed, it appeared that the same segfault could be
triggered when URL hashing is configured in an HTTP backend when the
frontend runs in TCP mode and no URL was seen. It is totally abnormal
to try to hash a null URL, as well as to process any kind of L7 hashing
when a full request was not seen.
This additional fix now ensures that layer7 hashing is not performed on
incomplete requests.
This is used to force access to down servers for some requests. This
is useful when validating that a change on a server correctly works
before enabling the server again.
Some message pointers were not usable once the message reached the
HTTP_MSG_DONE state. This is the case for ->som which points to the
body because it is needed to parse chunks. There is one case where
we need the beginning of the message : server redirect. We have to
call http_get_path() after the request has been parsed. So we rely
on ->sol without counting on ->som. In order to achieve this, we're
making ->rq.{u,v} relative to the beginning of the message instead
of the buffer. That simplifies the code and makes it cleaner.
Preliminary tests show this is OK.
Supported informations, available via "tr/td title":
- cap: capabilities (proxy)
- mode: one of tcp, http or health (proxy)
- id: SNMP ID (proxy, socket, server)
- IP (socket, server)
- cookie (backend, server)
The rq.u field is relative to buf->data, not to msg->sol. We have
to subtract msg->som everywhere this error was made. Maybe it will
be simpler to have a pointer to the buffer in the message and find
appropriate data there.
When parsing body for URL parameters, we must not consider that
data are available from buf->data but from buf->data + msg->som.
This is not a problem right now but may become with keep-alive.
Now that the HTTP analyser will already have parsed the beginning
of the request body, we don't have to check for transfer-encoding
anymore since we have the current chunk size in hdr_content_len.
These ACLs are used to check the number of active connections on the
frontend, backend or in a backend's queue. The avg_queue returns the
average number of queued connections per server, and for this, divides
the total number of queued connections by the number of alive servers.
The dst_conn ACL has been slightly changed to more reflect its name and
original usage, which is to return the number of connections on the
destination address/port (the socket) and not the whole frontend.
Consistent hashing provides some interesting advantages over common
hashing. It avoids full redistribution in case of a server failure,
or when expanding the farm. This has a cost however, the hashing is
far from being perfect, as we associate a server to a request by
searching the server with the closest key in a tree. Since servers
appear multiple times based on their weights, it is recommended to
use weights larger than approximately 10-20 in order to smoothen
the distribution a bit.
In some cases, playing with weights will be the only solution to
make a server appear more often and increase chances of being picked,
so stats are very important with consistent hashing.
In order to indicate the type of hashing, use :
hash-type map-based (default, old one)
hash-type consistent (new one)
Consistent hashing can make sense in a cache farm, in order not
to redistribute everyone when a cache changes state. It could also
probably be used for long sessions such as terminal sessions, though
that has not be attempted yet.
More details on this method of hashing here :
http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
There are a few remaining max values that need to move to counters.
Also, the counters are more often used than some config information,
so get them closer to the other useful struct members for better cache
efficiency.
The "static-rr" is just the old round-robin algorithm. It is still
in use when a hash algorithm is used and the data to hash is not
present, but it was impossible to configure it explicitly. This one
is cheaper in terms of CPU and supports unlimited numbers of servers,
so it makes sense to be able to use it.
LB algo macros were composed of the LB algo by itself without any indication
of the method to use to look up a server (the lb function itself). This
method was implied by the LB algo, which was not very convenient to add
more algorithms. Now we have several fields in the LB macros, some to
describe what to look for in the requests, some to describe how to transform
that (kind of algo) and some to describe what lookup function to use.
The next patch will make it possible to factor out some code for all algos
which rely on a map.
We need to remove hash map accesses out of backend.c if we want to
later support new hash methods. This patch separates the hash computation
method from the server lookup. It leaves the lookup function to lb_map.c
and calls it with the result of the hash.
It was becoming painful to have all the LB algos in backend.c.
Let's move them to their own files. A few hashing functions still
need be broken in two parts, one for the contents and one for the
map position.
There is no reason to inline functions which are used to grab a server
depending on an LB algo. They are large and used at several places.
Uninlining them saves 400 bytes of code.
Please consider the following patches. They are required to
compile haproxy-1.4-dev2 on FreeBSD.
Summary:
1) include <sys/types.h> before <netinet/tcp.h>
2) Use IPPROTO_TCP instead of SOL_TCP
(they are both defined as 6, TCP protocol number)
The connection establishment was completely handled by backend.c which
normally just handles LB algos. Since it's purely TCP, it must move to
proto_tcp.c. Also, instead of calling it directly, we now call it via
the stream interface, which will later help us unify session handling.
Andrew Azarov reported that haproxy-1.4-dev1 does not build
under FreeBSD 7.2 because SOL_TCP is not defined. So add a
check for its definition before using it. This only impacts
network optimisations anyway.
This Linux-specific option was never really used in production and
has since been superseded by new splicing options brought by recent
Linux kernels.
It caused several particular cases in the code because the kernel
would take care of the session without haproxy being able to do
anything on it, which became hard to handle in the new architecture.
Let's simply get rid of it now that there is a replacement available.
This patch adds support for hashing RDP cookies in order to
use them as a load-balancing key. The new "rdp-cookie(name)"
load-balancing metric has to be used for this. It is still
mandatory to wait for an RDP cookie in the frontend, otherwise
it will randomly work.
The splice code did not consider compatibility between both ends
of the connection. Now we set different capabilities on each
stream interface, depending on what the protocol can splice to/from.
Right now, only TCP is supported. Thanks to this, we're now able to
automatically detect when splice() is not implemented and automatically
disable it on one end instead of reporting errors to the upper layer.
This new option enables combining of request buffer data with
the initial ACK of an outgoing TCP connection. Doing so saves
one packet per connection which is quite noticeable on workloads
mostly consisting in small objects. The option is not enabled by
default.
Setting TCP_CORK on a socket before sending the last segment enables
automatic merging of this segment with the FIN from the shutdown()
call. Playing with TCP_CORK is not easy though as we have to track
the status of the TCP_NODELAY flag since both are mutually exclusive.
Doing so saves one more packet per session and offers about 5% more
performance.
There is no reason not to do it, so there is no associated option.
Some users are already hitting the 64k source port limit when
connecting to servers. The system usually maintains a list of
unused source ports, regardless of the source IP they're bound
to. So in order to go beyond the 64k concurrent connections, we
have to manage the source ip:port lists ourselves.
The solution consists in assigning a source port range to each
server and use a free port in that range when connecting to that
server, either for a proxied connection or for a health check.
The port must then be put back into the server's range when the
connection is closed.
This mechanism is used only when a port range is specified on
a server. It makes it possible to reach 64k connections per
server, possibly all from the same IP address. Right now it
should be more than enough even for huge deployments.
There is a patch made by me that allow for balancing on any http header
field.
[WT:
made minor changes:
- turned 'balance header name' into 'balance hdr(name)' to match more
closely the ACL syntax for easier future convergence
- renamed the proxy structure fields header_* => hh_*
- made it possible to use the domain name reduction to any header, not
only "host" since it makes sense to do it with other ones.
Otherwise patch looks good.
/WT]
The connect timeout was not properly detected due to the fact that
it was not correctly initialized. It must be set as the stream interface
timeout, not the buffer's write timeout.
These new ACLs match frontend session rate and backend session rate.
Examples are provided in the doc to explain how to use that in order
to limit abuse of service.
With this change, all frontends, backends, and servers maintain a session
counter and a timer to compute a session rate over the last second. This
value will be very useful because it varies instantly and can be used to
check thresholds. This value is also reported in the stats in a new "rate"
column.
Specifying "interface <name>" after the "source" statement allows
one to bind to a specific interface for proxy<->server traffic.
This makes it possible to use multiple links to reach multiple
servers, and to force traffic to pass via an interface different
from the one the system would have chosen based on the routing
table.
Setting "nosplice" in the global section will disable the use of TCP
splicing (both tcpsplice and linux 2.6 splice). The same will be
achieved using the "-dS" parameter on the command line.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
"option transparent" was set and checked on frontends only while it
is purely a backend thing as it replaces the "balance" mode. For this
reason, it did only work in "listen" sections. This change will then
not affect the rare users of this option.
I'm in the process of setting up one haproxy instance now, and I find
the following acl option useful. I'm not too sure why this option has
not been available before, but I find this useful for my own usage, so
I'm submitting this patch in the hope that it will be useful as well.
The basic idea is to be able to measure the available connection slots
still available (connection, + queue) - anything beyond that can be
redirected to a different backend. 'connslots' = number of available
server connection slots, + number of available server queue slots. In
the case where we encounter srv maxconn = 0, or srv maxqueue = 0 (in
which case we dont need to care about connslots) the value you get is
-1. Note also that this code does not take care of dynamic connections
at this point in time.
The reason why I'm using this new acl (as opposed to 'nbsrv') is that
'nbsrv' only measures servers that are actually *down*. Whereas this
other acl is more fine-grained, and looks into the number of conn
slots available as well.
The new function looks like the previous one except that it operates
at the stream interface level and assumes an already closed SI.
Also remove some old unused occurrences of srv_close_with_err().
It is quite hard to track when the current session has already been counted
or discounted from the server's total number of established sessions. For
this reason, we introduce a new session flag, SN_CURR_SESS, which indicates
if the current session is one of those reported by the server or not. It
simplifies session accounting and makes it far more robust. It also makes
it possible to perform a last-minute cleanup during session_free().
Right now, with this fix and a few more buffer transitions fixes, no session
were found to remain after a test.
The connection setup code has been refactored in order to
make it run only on low level (stream interface). Several
complicated functions have been removed from backend.c,
and we now have sess_update_stream_int() to manage
an assigned connection, sess_prepare_conn_req() to assign a
server to a connection request, perform_http_redirect() to
redirect instead of connecting to server, and return_srv_error()
to return connection error status messages.
The stream_interface status changes are checked before adjusting
buffer flags, so that the buffers can be informed about this lower
level update.
A new connection is initiated by changing si->state from SI_ST_INI
to SI_ST_REQ.
The code seems to work but is awfully dirty. Some functions need
to be moved, and the layering is not yet quite clear.
A lot of dead old code has simply been removed.
It was not practical to have QUEUE and TAR timers in buffers, as they caused
triggering of the timeout flags. Move them to the stream interface where they
belong.
As of now, a stream socket does not directly wake up the task
but it does contact the stream interface which itself knows the
task. This allows us to perform a few cleanups upon errors and
shutdowns, which reduces the number of calls to data_update()
from 8 per session to 2 per session, and make all the functions
called in the process_session() loop completely swappable.
Some improvements are required. We need to provide a shutw()
function on stream interfaces so that one side which closes
its read part on an empty buffer can propagate the close to
the remote side.
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
In backend.c, we had an EV_FD_SET() called before fd_insert().
This is wrong because fd_insert updates maxfd which might be
used by some of the pollers during EV_FD_SET(), although this
is not currently the case.
It's a shame not to use buffer->wex for connection timeouts since by
definition it cannot be used till the connection is not established.
Using it instead of ->cex also makes the buffer processing more
symmetric.
The SV_STANALYZE state was installed on the server side but was really
meant to be processed with the rest of the request on the client side.
It suffered from several issues, mostly related to the way timeouts were
handled while waiting for data.
All known issues related to timeouts during a request - and specifically
a request involving body processing - have been raised and fixed. At this
point, the code is a bit dirty but works fine, so next steps might be
cleanups with an ability to come back to the current state in case of
trouble.
If an HTTP/0.9-like POST request is sent to haproxy while
configured with url_param + check_post, it will crash. The
reason is that the total buffer length was computed based
on req->total (which equals the number of bytes read) and
not req->l (number of bytes in the buffer), thus leading
to wrong size calculations when calling memchr().
The affected code does not look like it could have been
exploited to run arbitrary code, only reads were performed
at wrong locations.
All currently known ACL verbs have been assigned a type which makes
it possible to detect inconsistencies, such as response values used
in request rules.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
The dequeuing logic was completely wrong. First, a task was assigned
to all servers to process the queue, but this task was never scheduled
and was only woken up on session free. Second, there was no reservation
of server entries when a task was assigned a server. This means that
as long as the task was not connected to the server, its presence was
not accounted for. This was causing trouble when detecting whether or
not a server had reached maxconn. Third, during a redispatch, a session
could lose its place at the server's and get blocked because another
session at the same moment would have stolen the entry. Fourth, the
redispatch option did not work when maxqueue was reached for a server,
and it was not possible to do so without indefinitely hanging a session.
The root cause of all those problems was the lack of pre-reservation of
connections at the server's, and the lack of tracking of servers during
a redispatch. Everything relied on combinations of flags which could
appear similarly in quite distinct situations.
This patch is a major rework but there was no other solution, as the
internal logic was deeply flawed. The resulting code is cleaner, more
understandable, uses less magics and is overall more robust.
As an added bonus, "option redispatch" now works when maxqueue has
been reached on a server.
This patch adds two optional arguments "len" and "depth" to
"balance uri". They are used to limit the length in characters
of the analysis, as well as the number of directory components
it applies to.
This patch extends the "url_param" load balancing method by introducing
the "check_post" option. Using this option enables analysis of the beginning
of POST requests to search for the specified URL parameter.
The patch also fixes a few minor typos in comments that were discovered
during code review.
The new "leastconn" LB algorithm selects the server which has the
least established or pending connections. The weights are considered,
so that a server with a weight of 20 will get twice as many connections
as the server with a weight of 10.
The algorithm respects the minconn/maxconn settings, as well as the
slowstart since it is a dynamic algorithm. It also correctly supports
backup servers (one and all).
It is generally suited for protocols with long sessions (such as remote
terminals and databases), as it will ensure that upon restart, a server
with no connection will take all new ones until its load is balanced
with others.
A test configuration has been added in order to ease regression testing.
Commit 3168223a7b broke option
"allbackups" in roundrobin mode due to an erroneous structure
member replacement in backend.c. The PR_O_USE_ALL_BK flag was
not tested in the right member anymore.
This bug uncoverred another one, by which all backup servers would
be used whatever the option's value, if all of them had been seen
as simultaneously failed at one moment.
This patch fixes the two stupid errors. Correctness has been tested
using the test-fwrr.cfg config example.
When haproxy decides that session needs to be redispatched it chose a server,
but there is no guarantee for it to be a different one. So, it often
happens that selected server is exactly the same that it was previously, so
a client ends up with a 503 error anyway, especially when one sever has
much bigger weight than others.
Changes from the previous version:
- drop stupid and unnecessary SN_DIRECT changes
- assign_server(): use srvtoavoid to keep the old server and clear s->srv
so SRV_STATUS_NOSRV guarantees that t->srv == NULL (again)
and get_server_rr_with_conns has chances to work (previously
we were passing a NULL here)
- srv_redispatch_connect(): remove t->srv->cum_sess and t->srv->failed_conns
incrementing as t->srv was guaranteed to be NULL
- add avoididx to get_server_rr_with_conns. I hope I correctly understand this code.
- fix http_flush_cookie_flags() and move it to assign_server_and_queue()
directly. The code here was supposed to set CK_DOWN and clear CK_VALID,
but: (TX_CK_VALID | TX_CK_DOWN) == TX_CK_VALID == TX_CK_MASK so:
if ((txn->flags & TX_CK_MASK) == TX_CK_VALID)
txn->flags ^= (TX_CK_VALID | TX_CK_DOWN);
was really a:
if ((txn->flags & TX_CK_MASK) == TX_CK_VALID)
txn->flags &= TX_CK_VALID
Now haproxy logs "--DI" after redispatching connection.
- defer srv->redispatches++ and s->be->redispatches++ so there
are called only if a conenction was redispatched, not only
supposed to.
- don't increment lbconn if redispatcher selected the same sarver
- don't count unsuccessfully redispatched connections as redispatched
connections
- don't count redispatched connections as errors, so:
- the number of connections effectively served by a server is:
srv->cum_sess - srv->failed_conns - srv->retries - srv->redispatches
and
SUM(servers->failed_conns) == be->failed_conns
- requires the "Don't increment server connections too much + fix retries" patch
- needs little more testing and probably some discussion so reverting to the RFC state
Tests #1:
retries 4
redispatch
i) 1 server(s): b (wght=1, down)
b) sessions=5, lbtot=1, err_conn=1, retr=4, redis=0
-> request failed
ii) server(s): b (wght=1, down), u (wght=1, down)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=1, retr=0, redis=0
-> request FAILED
iii) 2 server(s): b (wght=1, down), u (wght=1, up)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=0, retr=0, redis=0
-> request OK
iv) 2 server(s): b (wght=100, down), u (wght=1, up)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=0, retr=0, redis=0
-> request OK
v) 1 server(s): b (down for first 4 SYNS)
b) sessions=5, lbtot=1, err_conn=0, retr=4, redis=0
-> request OK
Tests #2:
retries 4
i) 1 server(s): b (down)
b) sessions=5, lbtot=1, err_conn=1, retr=4, redis=0
-> request FAILED
Commit 98937b8757 while fixing
one bug introduced another one. With "retries 4" and
"option redispatch" haproxy tries to connect 4 times to
one server server and 1 time to a second one. However
logs showed 5 connections to the first server (the
last one was counted twice) and 2 to the second.
This patch also fixes srv->retries and be->retries increments.
Now I get: 3 retries and 1 error in a first server (4 cum_sess)
and 1 error in a second server (1 cum_sess) with:
retries 4
option redispatch
and: 4 retries and 1 error (5 cum_sess) with:
retries 4
So, the number of connections effectively served by a server is:
srv->cum_sess - srv->failed_conns - srv->retries
GCC4 is stupid (unbelievable news!).
When some code uses __builtin_expect(x != 0, 1), it really performs
the check of x != 0 then tests that the result is not zero! This is
a double check when only one was expected. Some performance drops
of 10% in the HTTP parser code have been observed due to this bug.
GCC 3.4 is fine though.
A solution consists in expecting that the tested value is 1. In
this case, it emits the correct code, but it's still not optimal
it seems. Finally the best solution is to ignore likely() and to
pray for the compiler to emit correct code. However, we still have
to fix unlikely() to remove the test there too, and to fix all
code which passed pointers overthere to pass integers instead.
Now when a server has "redir <prefix>" on its config line, any HEAD or GET
request addressing it will lead to a 302 with Location set to "<prefix>"
immediately followed by the relative URI of the incoming request. This makes
it very easy to send redirect to browsers to check remote static servers, as
well as to provide redirection for remote sites when the local one is down.
There's no point trying to check original dest addr with only one
method when doing transparent proxy as in full transparent mode,
the real destination address is required. Let's copy the one from
the frontend.
The source address selection for health checks did not consider
the new transparent proxy method. Rely on the same unified function
as the other connect() calls.
This patch also fixes a bug by which the proxy's source address was
ignored if cttproxy was used.
Balabit's TPROXY version 4 which replaces CTTPROXY provides a similar
API to the previous proxy, but relies on IP_FREEBIND instead of
IP_TRANSPARENT. Let's add it.
Using some Linux kernel patches which add the IP_TRANSPARENT
SOL_IP option , it is possible to bind to a non-local address
on without having resort to any sort of NAT, thus causing no
performance degradation.
This is by far faster and cleaner than the previous CTTPROXY
method. The code has been slightly changed in order to remain
compatible with CTTPROXY as a fallback for the new method when
it does not work.
It is not needed anymore to specify the outgoing source address
for connect, it can remain 0.0.0.0.
This patch extends a little previously added functionality to also
count retries and redispatches for servers. Now it is possible to know
which server causes redispatches as it is not always the same that takes
most retries.
While working with the code I found that redistribute_pending() does not increment
srv->redispatches && be->redispatches. I don't know how to test it but
I think the fix is correct. If not I can withdraw it.
I also extended logs to show how many retries were done and if redispatching
was necessary ('+'). I'm using an additional session flag SN_REDISP to match
redispatched connections. I had to rearrange all defines in session.h to make
more room for it.
The documentation about logs was also fixed a little (sorry, english only),
as current version uses totally different format. BTW: examples are still
outdated, maybe next time...
Finally, I changed %d -> %u for retries/redispatches as those variables
are declared as unsigned.
It was abnormal to see more connect errors than connect attempts.
This was caused by the fact that the server's connection count was
not incremented for failed connect() attempts.
Now the per-server connections are correctly incremented for each
connect() attempt. This includes the retries too. The number of
connections effectively served by a server will then be :
srv->cum_sess - srv->errors - srv->warnings
One user reported that an indicator was missing in the statistics:
the number of times each server was selected by load balancing. It
is in fact the total number of sessions assigned to a server by the
load balancing algorithm. It should directly reflect the weight for
"fair" algorithms such as round-robin, since it will not account for
persistant connections.
It should help a lot tuning each server's weight depending on the
load it receives.
Now the connect timeout, tarpit timeout and queue timeout are
distinct. In order to retain compatibility with older versions,
if either queue or tarpit is left unset both in the proxy and
in the default proxy, then it is inherited from the connect
timeout as before.
Now we can compute the max place depending on the number of servers,
maximum weight and weight scale. The formula has been stored as a
comment so that it's easy to choose between smooth weight ramp up
and high number of servers. The default scale has been set to 16,
which permits 4000 servers with a granularity of 6% in the worst
case (weight=1).
The new "nbsrv" ACL verb matches the number of active servers in a backend.
By default, it applies to the backend where it is declared, but optionally
it can receive the name of another backend as an argument in parenthesis.
It counts the number of enabled active servers first, then the number of
enabled backup servers.
It's not always obvious for the callers of set_server_status_{up,down}
whether the new state really is up or down. Some flags as well as the
effective weight have to be considered. Let's ensure that those functions
perform the necessary check themselves so that if the state transition
cannot be performed, at least everything is updated as required.
When an HTTP server returns "404 not found", it indicates that at least
part of it is still running. For this reason, it can be convenient for
application administrators to be able to consider code 404 as valid,
but for a server which does not want to participate to load balancing
anymore. This is useful to seamlessly exclude a server from a farm
without acting on the load balancer. For instance, let's consider that
haproxy checks for the "/alive" file. To enable load balancing on a
server, the admin would simply do :
# touch /var/www/alive
And to disable the server, he would simply do :
# rm /var/www/alive
Another immediate gain from doing this is that it is now possible to
send NOTICE messages instead of ALERT messages when a server is first
disable, then goes down. This provides a graceful shutdown method.
To enable this behaviour, specify "http-check disable-on-404" in the
backend.
Hello,
You will find attached an updated release of previously submitted patch.
It polish some part and extend ACL engine to match IP and PORT parsed in
HTTP request. (and take care of comments made by Willy ! ;))
Best regards,
Alexandre
The number of possible options for a proxy has already reached
32, which is the current limit due to the fact that they are
each represented as a bit in a 32-bit word.
It's possible to move the load balancing algorithms to another
place. It will also save some space for future algorithms.
This round robin algorithm was written from trees, so that we
do not have to recompute any table when changing server weights.
This solution allows on-the-fly weight adjustments with immediate
effect on the load distribution.
There is still a limitation due to 32-bit computations, to about
2000 servers at full scale (weight 255), or more servers with
lower weights. Basically, sum(srv.weight)*4096 must be below 2^31.
Test configurations and an example program used to develop the
tree will be added next.
Many changes have been brought to the weights computations and
variables in order to accomodate for the possiblity of a server to
be running but disabled from load balancing due to a null weight.
Under some circumstances, it will be useful to be able to have
a server's effective weight bigger than the user weight, and this
is particularly true for dynamic weight-based algorithms. In order
to support this, we add a "wdiv" member to the lbprm structure
which will always be used to divide the weights before reporting
them.
Since the introduction of server weights, all load balancing algorithms
relied on a pre-computed map. Incidently, quite a bunch of map-specific
parameters were used at random places in order to get the number of
servers or their total weight. It was not architecturally acceptable
that optimizations for the map computation had impact on external parts.
For instance, during this cleanup it was found that a backend weight was
seen as 1 when only the first backup server is used, whatever its weight.
This cleanup consists in differentiating between LB-generic parameters,
such as total weights, number of servers, etc... and map-specific ones.
The struct proxy has been enhanced in order to make it easier to later
support other algorithms. The recount_servers() function now also
updates generic values such as total weights so that it's not needed
anymore to call recalc_server_map() when weights are needed. This
permitted to simplify some code which does not need to know about map
internals anymore.
Some applications do not have a strict persistence requirement, yet
it is still desirable for performance considerations, due to local
caches on the servers. For some reasons, there are some applications
which cannot rely on cookies, and for which the last resort is to use
a parameter passed in the URL.
The new 'url_param' balance method is there to solve this issue. It
accepts a parameter name which is looked up from the URL and which
is then hashed to select a server. If the parameter is not found,
then the round robin algorithm is used in order to provide a normal
load balancing across the servers for the first requests. It would
have been possible to use a source IP hash instead, but since such
applications are generally buried behind multiple levels of
reverse-proxies, it would not provide a good balance.
The doc has been updated, and two regression testing configurations
have been added.
A new function "backend_parse_balance" has been created in backend.c,
which is dedicated to the parsing of the "balance" keyword. It will
provide easier methods for adding new algorithms.
In preparation for newer balance algorithms, group the
sparse PR_O_BALANCE_* values into layer4 and layer7-based
algorithms. This will ease addition of newer algorithms.
This patch adds the "maxqueue" parameter to the server. This allows new
sessions to be immediately rebalanced when the server's queue is filled.
It's useful when session stickiness is just a performance boost (even a
huge one) but not a requirement.
This should only be used if session affinity isn't a hard functional
requirement but provides performance boost by keeping server-local
caches hot and compact).
Absence of 'maxqueue' option means unlimited queue. When queue gets filled
up to 'maxqueue' client session is moved from server-local queue to a global
one.
Hello,
This patch implements new statistics for SLA calculation by adding new
field 'Dwntime' with total down time since restart (both HTTP/CSV) and
extending status field (HTTP) or inserting a new one (CSV) with time
showing how long each server/backend is in a current state. Additionaly,
down transations are also calculated and displayed for backends, so it is
possible to know how many times selected backend was down, generating "No
server is available to handle this request." error.
New information are presentetd in two different ways:
- for HTTP: a "human redable form", one of "100000d 23h", "23h 59m" or
"59m 59s"
- for CSV: seconds
I believe that seconds resolution is enough.
As there are more columns in the status page I decided to shrink some
names to make more space:
- Weight -> Wght
- Check -> Chk
- Down -> Dwn
Making described changes I also made some improvements and fixed some
small bugs:
- don't increment s->health above 's->rise + s->fall - 1'. Previously it
was incremented an then (re)set to 's->rise + s->fall - 1'.
- do not set server down if it is down already
- do not set server up if it is up already
- fix colspan in multiple places (mostly introduced by my previous patch)
- add missing "status" header to CSV
- fix order of retries/redispatches in server (CSV)
- s/Tthen/Then/
- s/server/backend/ in DATA_ST_PX_BE (dumpstats.c)
Changes from previous version:
- deal with negative time intervales
- don't relay on s->state (SRV_RUNNING)
- little reworked human_time + compacted format (no spaces). If needed it
can be used in the future for other purposes by optionally making "cnt"
as an argument
- leave set_server_down mostly unchanged
- only little reworked "process_chk: 9"
- additional fields in CSV are appended to the rigth
- fix "SEC" macro
- named arguments (human_time, be_downtime, srv_downtime)
Hope it is OK. If there are only cosmetic changes needed please fill free
to correct it, however if there are some bigger changes required I would
like to discuss it first or at last to know what exactly was changed
especially since I already put this patch into my production server. :)
Thank you,
Best regards,
Krzysztof Oledzki