glibc-2.11 on x86_64 provides a machine-specific memchr() which is faster
than the generic C implementation by around 40%, so let's make it possible
to use it instead of the hand-coded version.
This version implements both 32 and 64 bit versions at once, it
avoids the need to have two separate output files. It also improves
efficiency on i386 platforms by adding a little bit of assembly where
gcc isn't efficient.
By adding a "landing area" at the end of the buffer, it becomes safe to
parse more bytes at once. On 32-bit this makes fgets run about 4% faster
but it does not save anything on 64-bit.
A bug in the algorithm used to find an LF in multiple bytes at once
made byte 0x80 trigger detection of byte 0x00, thus 0x8A matches byte
0x0A. In practice, this issue never happens since byte 0x8A won't be
displayed in logs (or it will be encoded). This could still possibly
happen in mixed logs.
A new idea came up to detect the presence of a null byte in a word.
It saves several operations compared to the previous one, and eliminates
the jumps (about 6 instructions which can run 2-by-2 in parallel).
This sole optimisation improved the line count speed by about 30%.