DOC: internals: add a documentation about the master worker

Add a documentation about the history of the master-worker and how it
was implemented in its first version and how it is currently working.
This is a global view of the architecture, and not an exhaustive
explanation of all mechanisms.
This commit is contained in:
William Lallemand 2024-06-12 14:46:05 +02:00
parent 91fe085943
commit 82a4dd7df6
1 changed files with 210 additions and 0 deletions

210
doc/internals/mworker.md Normal file
View File

@ -0,0 +1,210 @@
# Master Worker
2024-06-12
## History
### haproxy-systemd-wrapper
Back in 2013, distributions are discussing the adoption of systemd as the
default init, this was controversial but fedora and archlinux already uses it.
At this time HAProxy still had a multi-process model, and the way haproxy is
working was incompatible with the daemon mode.
Systemd is compatible with traditionnal forking services, but somehow HAProxy
is different. To work correctly, systemd needs a main PID, this is the PID of
the process that systemd will supervises.
With `nbproc 1` that could work, since systemd is able to guess the main PID,
and even to read a PID file. But there is something uncommon that HAProxy is
doing for a reload, which is not supported by systemd. Indeed the reload is in
fact a new haproxy process, which will ask the old one to leave. This means the
new main PID is supposed to change, but systemd is not supporting this, so it
will just see the previous leaving and consider that the service broke and kill
every other processes, meaning the new haproxy.
With `nbproc > 1` this is worse, systemd is confused with all the processes,
because they are independent, so there is not really a main process to
supervise.
The systemd-wrapper appeared in HAProxy 1.5, it's a separated binary, which
starts haproxy, so systemd can use the wrapper as the main PID, and the wrapper
never change PID. Upon a reload, which is done with a SIGUSR2 signal the wrapper
will launch a `haproxy -sf`. This was a non-intrusive work which a first step to
deploy in systemd environments. Later contributions would add the support for
upgrading the wrapper binary upon a reload.
However the wrapper suffered from several problems:
- It needed a intermediate haproxy process, it's basically a daemon mode, but
instead of the first process leaving to daemonize, it is kept in foreground to
waitpid() on all workers. Which means you need the wrapper + the -Ds + the
haproxy workers, and each reload start a new -Ds.
- it was difficult to integrate new features since it wasn't in haproxy itself
- there were multiple issues with handling the failures during reload
### mworker V1
HAProxy 1.8 got ride of the wrapper which was replaced by the master worker
mode. This first version was basically a reintegration of the wrapper features
within HAProxy. HAProxy is launched with the -W flag, read the configuration and
then fork. In mworker mode, the master is usually launched as a root process,
and will do chroot operations then setuid in the workers.
Like the wrapper, the master handle the SIGURS2 signal to reload, it is also
able to forward the SIGUSR1 signal to the workers, to ask for a soft stop.
The reload uses the same logic than the standard `-sf` method, but instead of
starting a new process, it will exec() with -sf in the same PID. Which means
that haproxy could upgrade its binary during the reload.
Once the SIGUSR2 signal is received, the master would block signals and unregister
signals handler so no signals would halt haproxy reload, as it could kill the
master to receive a USR2 if the signal is not register yet after the exec.
When doing the exec() upon a reload, a new argv array is constructed by copying
the current argv and adding `-sf` and the list of PIDs in the children list, as
well as the oldpids list.
When the workers are started, the master will first deinit the poller and clean
the FDs that are not needed anymore (inherited fd need to be kept however), then
the master will do a wait() loop instead of the haproxy polling loop, which will
wait for its workers to leave, or for a signal.
When reloading haproxy, a non-working configuration could exits the master,
which could end in killing all previous workers. This is a complex situation to
handle, since all configuration parsing code was not written to let a process
alive upon a failure. To handle this problem, an atexit() callback was used, so
haproxy would reexec() upon a configuration loading failure, without any
configuration, and without trying to fork new workers. This is called the
master-worker "wait mode".
The master-worker mode also comes with a feature which automates the seamless
reload (-x), meaning it would select the stats socket from the configuration to
be added to the -x parameter for the next reload, so the FD of the bind could be
retrieved automatically.
The master is supervising the workers, when a current worker (not a previous one
from before the reload) is exiting without being asked for a reload, the master
will emit an "exit-on-failure" error and will kill every workers with a SIGTERM
and exits with the same error code than the failed master, this behavior can be
changed by using the "no exit-on-failure" option in the global section.
While the master is supervising the workers using the wait() function, the
workers is also surpervising the master. To achieve this, there is a pipe
between the master and the workers. The FD of the worker side of the pipe is
inserted in the poller so it can watch for a close. When the pipe is closed this
means the master left, and this is not supposed to happen, so it could have
crash. When it happens all workers are leaving. To survive the reloads of the
master, the FD are saved in environment variables (HAPROXY_MWORKER_PIPE_{RD,WR})
The master-worker mode could be activated by using either "-W" or
"master-worker" in the global section of the configuration, but it is prefered
to use "-W".
The pidfile is usable in master-worker mode, instead of writing the PIDs of all
workers, this will only write the PID of the master.
A systemd mode (-Ws) could also be used, it behaves the same way as -W, but will
keep the master in foreground, and will send status messages to systemd using
the sd_notify API.
### mworker V2
HAProxy 1.9 go a little bit further with the master worker, instead of using the
mworker_wait() fuction from V1, it uses the haproxy polling loop, so the signals
will be handled directly by haproxy polling loop, removing the specific code.
Instead of using 1 pipe per haproxy instance, the V2 is using a socketpair per
worker and the polling loop allows real network communication using these
socketpairs. It needs to keep 1 FD per worker in the master, so they can be
reused after a reload. The master keeps a linked list of processes,
mworker_proc, containing socketpairs fd, PID, relative pid... This list is then
serialized in the HAPROXY_PROCESSES environment variable to be unserialized upon
a reload and the FD reinserted in the poller.
Since the FD are in the poller, there is a special flag in the listeners
LI_O_WORKER, which specify that some FD mustn't be used in the worker, these FD
are unbind once in the worker.
Meanwhile the thread support was implemented in haproxy, since mworker shares
more code than before when using the polling loop, the nbthread configuration
variable is not used for instancing the master, and the master will always
remain with only 1 thread.
The HAPROXY_PROCESSES structures allow to store a lot more thing, the number of
reload for each worker is kept, the PID etc...
The socketpairs are useful for bi-directional communication, so each socketpair
are connected to a stats applet on the worker side, so the master could access
to a stats socket for each worker.
The master implements a CLI proxy, which is an analyzer which is able to parse
CLI input, which will split individual CLI commands and redirect them to the
right worker. This is implemented like the HTTP pipelining with command being
sent and responsed one after another. This proxy could be accessed by using the
master CLI which is only bound using the -S option of the haproxy command.
Special prefixed using @ syntax are used to select the right worker.
The master CLI implements its own commands set like `show proc` which shows the
content of the HAPROXY_PROCESSES structure.
A 'reload' command was implemented so the reload could be asked from the master
CLI without using the SIGUSR2 signal.
### more features in mworker V2
HAProxy 2.0 implements a new configuration section called `program` this section
allows to handle the start and stop of executables with the master-worker. One
could launch the dataplane API from haproxy for example. The programs are
shown in the `show proc` command. The programs will be added to the
HAPROXY_PROCESSES structure. The option 'start-on-reload' allows to configure
the behavior of a program during an haproxy reload, it can either start a new
instance of the program or keep the previous one.
A `mworker-max-reloads` keyword was added in the global section, it allows to
limit the number of reload a worker can endure. That helps limiting the number
of remaining worker processes. This will send a SIGTERM to the worker once it
reach this value, instead of a SIGUSR1, so any stuck worker is killed.
Version and starting time were added to HAPROXY_PROCESSES so they could be
displayed in `show proc`.
HAProxy 2.1 added user/group to the program section so they could change their
uid after the fork.
HAProxy 2.5 added the reexec of haproxy in wait mode after a successful loading,
instead of doing it only after a configuration failure. It is useful to clear
the memory of the master because charging the configuration from the master can
take a lot of RAM, and there is no simple wait to free everything and decrease
the memory space of the process.
In HAProxy 2.6, the seamless reload with the master-worker changed, instead of
using a stats socket declared in the configuration, this uses the internal
socketpair of the previous worker. The change is actually simple, instead of
doing a `-x /path/to/previous/socket` it does a `-x sockpair@FD` using the FD
number that can be found in HAPROXY_PROCESSES. With this change the stats socket
in the configuration is less useful and everything can be done from the master
CLI.
With 2.7, the reload mecanism of the master CLI evolved, with previous versions,
this mecanism was asynchronous, so once the `reload` command was received, the
master would reload, the active master CLI connection was closed, and there was
no way to return a status as a response to the `reload` command. To achieve a
synchronous reload, a dedicated sockpair is used, one side uses a master CLI
applet and the other side wait to receive a socket. When the master CLI receives
the `reload` command, it takes the FD of the active master CLI session, sends it
in the socketpair and then does an exec. The FD is then stuck in the kernel
during the reload, because the poller is disabled. Once haproxy reloaded and the
poller active, the FD of the master CLI connection is received, so HAProxy can
reply a success or failure status for the reload. When built with
USE_SHM_OPEN=1, a shm is used to keep the warnings and errors when loading the
configuration in a shared buffer so this could survive the rexec in wait mode
and then be dumped as a response to the `reload` command after the status.
In 2.9 the master CLI command `hard-reload` was implemented, it works the same
way as the `reload` command, but instead of exec() with -sf for a soft-stop, it
starts with -st to achieve a hard stop on the previous worker.
Version 3.0 got rid of the libsystemd dependencies for sd_notify() after the
events of xz/openssh, the function is now implemented directly in haproxy in
src/systemd.c.