When a regex had been succesfully compiled by the JIT pass, it is better
to use the related match, thanksfully having same signature, for better
performance.
Signed-off-by: David Carlier <devnexen@gmail.com>
NetBSD apparently uses macros for tolower/toupper and complains about
the use of char for array subscripts. Let's properly cast all of them
to unsigned char where they are used.
This is needed to fix issue #729.
Most of the files dealing with error reports have to include log.h in order
to access ha_alert(), ha_warning() etc. But while these functions don't
depend on anything, log.h depends on a lot of stuff because it deals with
log-formats and samples. As a result it's impossible not to embark long
dependencies when using ha_warning() or qfprintf().
This patch moves these low-level functions to errors.h, which already
defines the error codes used at the same places. About half of the users
of log.h could be adjusted, sometimes revealing other issues such as
missing tools.h. Interestingly the total preprocessed size shrunk by
4%.
This one was not easy because it was embarking many includes with it,
which other files would automatically find. At least global.h, arg.h
and tools.h were identified. 93 total locations were identified, 8
additional includes had to be added.
In the rare files where it was possible to finalize the sorting of
includes by adjusting only one or two extra lines, it was done. But
all files would need to be rechecked and cleaned up now.
It was the last set of files in types/ and proto/ and these directories
must not be reused anymore.
The current state of the logging is a real mess. The main problem is
that almost all files include log.h just in order to have access to
the alert/warning functions like ha_alert() etc, and don't care about
logs. But log.h also deals with real logging as well as log-format and
depends on stream.h and various other things. As such it forces a few
heavy files like stream.h to be loaded early and to hide missing
dependencies depending where it's loaded. Among the missing ones is
syslog.h which was often automatically included resulting in no less
than 3 users missing it.
Among 76 users, only 5 could be removed, and probably 70 don't need the
full set of dependencies.
A good approach would consist in splitting that file in 3 parts:
- one for error output ("errors" ?).
- one for log_format processing
- and one for actual logging.
global.h was one of the messiest files, it has accumulated tons of
implicit dependencies and declares many globals that make almost all
other file include it. It managed to silence a dependency loop between
server.h and proxy.h by being well placed to pre-define the required
structs, forcing struct proxy and struct server to be forward-declared
in a significant number of files.
It was split in to, one which is the global struct definition and the
few macros and flags, and the rest containing the functions prototypes.
The UNIX_MAX_PATH definition was moved to compat.h.
And also rename standard.c to tools.c. The original split between
tools.h and standard.h dates from version 1.3-dev and was mostly an
accident. This patch moves the files back to what they were expected
to be, and takes care of not changing anything else. However this
time tools.h was split between functions and types, because it contains
a small number of commonly used macros and structures (e.g. name_desc)
which in turn cause the massive list of includes of tools.h to conflict
with the callers.
They remain the ugliest files of the whole project and definitely need
to be cleaned and split apart. A few types are defined there only for
functions provided there, and some parts are even OS-specific and should
move somewhere else, such as the symbol resolution code.
Regex are essentially included for myregex_t but it turns out that
several of the C files didn't include it directly, relying on the
one included by their own .h. This has been cleanly addressed so
that only the type is included by H files which need it, and adding
the missing includes for the other ones.
All files that were including one of the following include files have
been updated to only include haproxy/api.h or haproxy/api-t.h once instead:
- common/config.h
- common/compat.h
- common/compiler.h
- common/defaults.h
- common/initcall.h
- common/tools.h
The choice is simple: if the file only requires type definitions, it includes
api-t.h, otherwise it includes the full api.h.
In addition, in these files, explicit includes for inttypes.h and limits.h
were dropped since these are now covered by api.h and api-t.h.
No other change was performed, given that this patch is large and
affects 201 files. At least one (tools.h) was already freestanding and
didn't get the new one added.
The support for reqrep and friends was removed in 2.1 but the
chain_regex() function and the "action" field in the regex struct
was still there. This patch removes them.
One point worth mentioning though. There is a check_replace_string()
function whose purpose was to validate the replacement strings passed
to reqrep. It should also be used for other replacement regex, but is
never called. Callers of exp_replace() should be checked and a call to
this function should be added to detect the error early.
Now we atomically allocate the my_regex struct within function
regex_comp() and compile the regex or free both in case of failure. The
pointer to the allocated my_regex struct is returned directly. The
my_regex* argument to regex_comp() is removed.
Function regex_free() was modified so that it systematically frees the
my_regex entry. The function does nothing when called with a NULL as
argument (like free()). It will avoid existing risk of not properly
freeing the initialized area.
Other structures are also updated in order to be compatible (the ones
related to Lua and action rules).
Most register_build_opts() calls use static strings. These ones were
replaced with a trivial REGISTER_BUILD_OPTS() statement adding the string
and its call to the STG_REGISTER section. A dedicated section could be
made for this if needed, but there are very few such calls for this to
be worth it. The calls made with computed strings however, like those
which retrieve OpenSSL's version or zlib's version, were moved to a
dedicated function to guarantee they are called late in the process.
For example, the SSL call probably requires that SSL_library_init()
has been called first.
this adds a support of the newest pcre2 library,
more secure than its older sibling in a cost of a
more complex API.
It works pretty similarly to pcre's part to keep
the overall change smooth, except :
- we define the string class supported at compile time.
- after matching the ovec data is properly sized, althought
we do not take advantage of it here.
- the lack of jit support is treated less 'dramatically'
as pcre2_jit_compile in this case is 'no-op'.
Instead of repeating the type of the LHS argument (sizeof(struct ...))
in calls to malloc/calloc, we directly use the pointer
name (sizeof(*...)). The following Coccinelle patch was used:
@@
type T;
T *x;
@@
x = malloc(
- sizeof(T)
+ sizeof(*x)
)
@@
type T;
T *x;
@@
x = calloc(1,
- sizeof(T)
+ sizeof(*x)
)
When the LHS is not just a variable name, no change is made. Moreover,
the following patch was used to ensure that "1" is consistently used as
a first argument of calloc, not the last one:
@@
@@
calloc(
+ 1,
...
- ,1
)
This function (and its sister regex_exec_match2()) abstract the regex
execution but make it impossible to pass flags to the regex engine.
Currently we don't use them but we'll need to support REG_NOTBOL soon
(to indicate that we're not at the beginning of a line). So let's add
support for this flag and update the API accordingly.
pcre_study() has been around long before JIT has been added. It also seems to
affect the performance in some cases (positive).
Below I've attached some test restults. The test is based on
http://sljit.sourceforge.net/regex_perf.html (see bottom). It has been modified
to just test pcre_study vs. no pcre_study. Note: This test does not try to
match specific header it's instead run over a larger text with more and less
complex patterns to make the differences more clear.
% ./runtest
'mark.txt' loaded. (Length: 19665221 bytes)
-----------------
Regex: 'Twain'
[pcre-nostudy] time: 14 ms (2388 matches)
[pcre-study] time: 21 ms (2388 matches)
-----------------
Regex: '^Twain'
[pcre-nostudy] time: 109 ms (100 matches)
[pcre-study] time: 109 ms (100 matches)
-----------------
Regex: 'Twain$'
[pcre-nostudy] time: 14 ms (127 matches)
[pcre-study] time: 16 ms (127 matches)
-----------------
Regex: 'Huck[a-zA-Z]+|Finn[a-zA-Z]+'
[pcre-nostudy] time: 695 ms (83 matches)
[pcre-study] time: 26 ms (83 matches)
-----------------
Regex: 'a[^x]{20}b'
[pcre-nostudy] time: 90 ms (12495 matches)
[pcre-study] time: 91 ms (12495 matches)
-----------------
Regex: 'Tom|Sawyer|Huckleberry|Finn'
[pcre-nostudy] time: 1236 ms (3015 matches)
[pcre-study] time: 34 ms (3015 matches)
-----------------
Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)'
[pcre-nostudy] time: 5696 ms (3015 matches)
[pcre-study] time: 5655 ms (3015 matches)
-----------------
Regex: '[a-zA-Z]+ing'
[pcre-nostudy] time: 1290 ms (95863 matches)
[pcre-study] time: 1167 ms (95863 matches)
-----------------
Regex: '^[a-zA-Z]{0,4}ing[^a-zA-Z]'
[pcre-nostudy] time: 136 ms (4507 matches)
[pcre-study] time: 134 ms (4507 matches)
-----------------
Regex: '[a-zA-Z]+ing$'
[pcre-nostudy] time: 1334 ms (5360 matches)
[pcre-study] time: 1214 ms (5360 matches)
-----------------
Regex: '^[a-zA-Z ]{5,}$'
[pcre-nostudy] time: 198 ms (26236 matches)
[pcre-study] time: 197 ms (26236 matches)
-----------------
Regex: '^.{16,20}$'
[pcre-nostudy] time: 173 ms (4902 matches)
[pcre-study] time: 175 ms (4902 matches)
-----------------
Regex: '([a-f](.[d-m].){0,2}[h-n]){2}'
[pcre-nostudy] time: 1242 ms (68621 matches)
[pcre-study] time: 690 ms (68621 matches)
-----------------
Regex: '([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]'
[pcre-nostudy] time: 1215 ms (675 matches)
[pcre-study] time: 952 ms (675 matches)
-----------------
Regex: '"[^"]{0,30}[?!\.]"'
[pcre-nostudy] time: 27 ms (5972 matches)
[pcre-study] time: 28 ms (5972 matches)
-----------------
Regex: 'Tom.{10,25}river|river.{10,25}Tom'
[pcre-nostudy] time: 705 ms (2 matches)
[pcre-study] time: 68 ms (2 matches)
In some cases it's more or less the same but when it's faster than by a huge margin.
It always depends on the pattern, the string(s) to match against etc.
Signed-off-by: Christian Ruppert <c.ruppert@babiel.com>
pcre_study() may return NULL even though it succeeded. In this case error is
NULL otherwise error is not NULL. Also see man 3 pcre_study.
Previously a ACL pattern of e.g. ".*" would cause error because pcre_study did
not found anything to speed up matching and returned regex->extra = NULL and
error = NULL which in this case was a false-positive. That happend only when
PCRE_JIT was enabled for HAProxy but libpcre has been built without JIT.
Signed-off-by: Christian Ruppert <c.ruppert@babiel.com>
[wt: this needs to be backported to 1.5 as well]
The pcreposix layer (in the pcre projetc) execute strlen to find
thlength of the string. When we are using the function "regex_exex*2",
the length is used to add a final \0, when pcreposix is executed a
strlen is executed to compute the length.
If we are using a native PCRE api, the length is provided as an
argument, and these operations disappear.
This is useful because PCRE regex are more used than POSIC regex.
This patch remove all references of standard regex in haproxy. The last
remaining references are only in the regex.[ch] files.
In the file src/checks.c, the original function uses a "pmatch" array.
In fact this array is unused. This patch remove it.
This patchs rename the "regex_exec" to "regex_exec2". It add a new
"regex_exec", "regex_exec_match" and "regex_exec_match2" function. This
function can match regex and return array containing matching parts.
Otherwise, this function use the compiled method (JIT or PCRE or POSIX).
JIT require a subject with length. PCREPOSIX and native POSIX regex
require a null terminted subject. The regex_exec* function are splited
in two version. The first version take a null terminated string, but it
execute strlen() on the subject if it is compiled with JIT. The second
version (terminated by "2") take the subject and the length. This
version adds a null character in the subject if it is compiled with
PCREPOSIX or native POSIX functions.
The documentation of posix regex and pcreposix says that the function
returns 0 if the string matche otherwise it returns REG_NOMATCH. The
REG_NOMATCH macro take the value 1 with posix regex and the value 17
with the pcreposix. The documentaion of the native pcre API (used with
JIT) returns a negative number if no match, otherwise, it returns 0 or a
positive number.
This patch fix also the return codes of the regex_exec* functions. Now,
these function returns true if the string match, otherwise it returns
false.
Dmitry Sivachenko reported that "uint" doesn't build on FreeBSD 10.
On Linux it's defined in sys/types.h and indicated as "old". Just
get rid of the very few occurrences.
Currently exp_replace() (which is used in reqrep/reqirep) is
vulnerable to a buffer overrun. I have been able to reproduce it using
the attached configuration file and issuing the following command:
wget -O - -S -q http://localhost:8000/`perl -e 'print "a"x4000'`/cookie.php
Str was being checked only in in while (str) and it was possible to
read past that when more than one character was being accessed in the
loop.
WT:
Note that this bug is only marked MEDIUM because configurations
capable of triggering this bug are very unlikely to exist at all due
to the fact that most rewrites consist in static string additions
that largely fit into the reserved area (8kB by default).
This fix should also be backported to 1.4 and possibly even 1.3 since
it seems to have been present since 1.1 or so.
Config:
-------
global
maxconn 500
stats socket /tmp/haproxy.sock mode 600
defaults
timeout client 1000
timeout connect 5000
timeout server 5000
retries 1
option redispatch
listen stats
bind :8080
mode http
stats enable
stats uri /stats
stats show-legends
listen tcp_1
bind :8000
mode http
maxconn 400
balance roundrobin
reqrep ^([^\ :]*)\ /(.*)/(.*)\.php(.*) \1\ /\3.php?arg=\2\2\2\2\2\2\2\2\2\2\2\2\2\4
server srv1 127.0.0.1:9000 check port 9000 inter 1000 fall 1
server srv2 127.0.0.1:9001 check port 9001 inter 1000 fall 1
The pointer <regstr> is only used to compare and identify the original
regex string with the patterns. Now the patterns have a reference map
containing this original string. It is useless to store this value two
times.
The current file "regex.h" define an abstraction for the regex. It
provides the same struct name and the same "regexec" function for the
3 regex types supported: standard libc, basic pcre and jit pcre.
The regex compilation function is not provided by this file. If the
developper wants to use regex, he must write regex compilation code
containing "#define *JIT*".
This patch provides a unique regex compilation function according to
the compilation options.
In addition, the "regex.h" file checks the presence of the "#define
PCRE_CONFIG_JIT" when "USE_PCRE_JIT" is enabled. If this flag is not
present, the pcre lib doesn't support JIT and "#error" is emitted.
As suggested by Markus Elfring, a few "const char *" have replaced
some "char *" declarations where a function is not expected to
modify a value. It does not change the code but it helps detecting
coding errors.
The files are now stored under :
- include/haproxy for the generic includes
- include/types.h for the structures needed within prototypes
- include/proto.h for function prototypes and inline functions
- src/*.c for the C files
Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.
Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.