the design for TLS in dynamic-linked programs is mostly complete too,
but I have not yet implemented it. cost is nonzero but still low for
programs which do not use TLS and/or do not use threads (a few hundred
bytes of new code, plus dependency on memcpy). i believe it can be
made smaller at some point by merging __init_tls and __init_security
into __libc_start_main and avoiding duplicate auxv-parsing code.
at the same time, I've also slightly changed the logic pthread_create
uses to allocate guard pages to ensure that guard pages are not
counted towards commit charge.
for some reason this option is undocumented. not sure when it was
added, so I'm using a configure test. gcc was already setting the mark
correctly for C files, but assembler source files would need ugly
.note boilerplate in every single file to achieve this without the
option to the assembler.
blame whoever thought it would be a good idea to make the stack
executable by default rather than doing it the other way around...
based on proposed patches by Daniel Cegiełka, with minor changes:
- use a weak symbol for optreset so it doesn't clash with namespace
- also reset optpos (position in multi-option arg like -lR)
- also make getopt_long support reset
this function was overly complicated and not even obviously correct.
avoid using openat/linkat just like in shm_open, and instead expand
pathname using code shared with shm_open. remove bogus (and dangerous,
with priorities) use of spinlocks.
this commit also heavily streamlines the code and ensures there are no
failure cases that can happen after a new semaphore has been created
in the filesystem, since that case is unreportable.
this feature will be in the next version of POSIX, and can be used
internally immediately. there are many internal uses of fopen where
close-on-exec is needed to fix bugs.
also update syslog to use SOCK_CLOEXEC rather than separate fcntl
step, to make it safe in multithreaded programs that run external
programs.
emulation is not atomic; it could be made atomic by holding a lock on
forking during the operation, but this seems like overkill. my goal is
not to achieve perfect behavior on old kernels (which have plenty of
other imperfect behavior already) but to avoid catastrophic breakage
in (1) syslog, which would give no output on old kernels with the
change to use SOCK_CLOEXEC, and (2) programs built on a new kernel
where configure scripts detected a working SOCK_CLOEXEC, which later
get run on older kernels (they may otherwise fail to work completely).
based on initial work by rdp, with heavy modifications. some features
including threads are untested because qemu app-level emulation seems
to be broken and I do not have a proper system image for testing.
when strchr fails, and important piece of information already
computed, the string length, is thrown away. have strchrnul (with
namespace protection) be the underlying function so this information
can be kept, and let strchr be a wrapper for it. this also allows
strcspn to be considerably faster in the case where the match set has
a single element that's not matched.
testing with gcc 4.6.3 on x86, -Os, the old version does a duplicate
null byte check after the first loop. this is purely the compiler
being stupid, but the old code was also stupid and unintuitive in how
it expressed the check.
austin group interpretation for defect #529
(http://austingroupbugs.net/view.php?id=529) tightens the
requirements on close such that, if it returns with EINTR, the file
descriptor must not be closed. the linux kernel developers vehemently
disagree with this, and will not change it. we catch and remap EINTR
to EINPROGRESS, which the standard allows close() to return when the
operation was not finished but the file descriptor has been closed.
new behavior can be summarized as:
inputs that parse completely as a decimal number are treated as one,
and rejected only if the result is out of 16-bit range.
inputs that do not parse as a decimal number (where strtoul leaves
anything left over in the input) are searched in /etc/services.
this is useful when the underlying gcc is already a wrapper, which is
the case at least on some uclibc-based system images. it's also useful
for running an older/newer/nondefault version of gcc.
it was determined in discussion that these kind of limits are not
sufficient to protect single-threaded servers against denial of
service attacks from maliciously large round counts. the time scales
simply vary too much; many users will want login passwords with rounds
counts on a scale that gives decisecond latency, while highly loaded
webservers will need millisecond latency or shorter.
still some limit is left in place; the idea is not to protect against
attacks, but to avoid the runtime of a single call to crypt being, for
all practical purposes, infinite, so that configuration errors can be
caught and fixed without bringing down whole systems. these limits are
very high, on the order of minute-long runtimes for modest systems.
if same register is used for input/output, the compiler must be told.
otherwise is generates random junk code that clobbers the result. in
pure syscall-wrapper functions, nothing went wrong, but in more
complex functions where register allocation is non-trivial, things
broke badly.
with this patch, the malloc in libc.so built with -Os is nearly the
same speed as the one built with -O3. thus it solves the performance
regression that resulted from removing the forced -O3 when building
libc.so; now libc.so can be both small and fast.
I originally added -O3 for shared libraries to counteract very bad
behavior by GCC when building PIC code: it insists on reloading the
GOT register in static functions that need it, even if the address of
the function is never leaked from the translation unit and all local
callers of the function have already loaded the GOT register. this
measurably degrades performance in a few key areas like malloc. the
inlining done at -O3 avoids the issue, but that's really not a good
reason for overriding the user's choice of optimization level.
vfork is implemented as the fork syscall (with no atfork handlers run)
on archs where it is not available, so this change does not introduce
any change in behavior or regression for such archs.
I'm not 100% sure that Linux's O_PATH meets the POSIX requirements for
O_SEARCH, but it seems very close if not perfect. and old kernels
ignore it, so O_SEARCH will still work as desired as long as the
caller has read permissions to the directory.
by using the "ir" constraint (immediate or register) and the carefully
constructed instruction addu $2,$0,%2 which can take either an
immediate or a register for %2, the new inline asm admits maximal
optimization with no register spillage to the stack when the compiler
successfully performs constant propagration, but still works by
allocating a register when the syscall number cannot be recognized as
a constant. in the case of syscalls with 0-3 arguments it barely
matters, but for 4-argument syscalls, using an immediate for the
syscall number avoids creating a stack frame for the syscall wrapper
function.
all past and current kernel versions have done so, but there seems to
be no reason it's necessary and the sentiment from everyone I've asked
has been that we should not rely on it. instead, use r7 (an argument
register) which will necessarily be preserved upon syscall restart.
however this only works for 0-3 argument syscalls, and we have to
resort to the function call for 4-argument syscalls.
for the sake of simplicity, I've only used rep movsb rather than
breaking up the copy for using rep movsd/q. on all modern cpus, this
seems to be fine, but if there are performance problems, there might
be a need to go back and add support for rep movsd/q.
before restrict was added, memove called memcpy for forward copies and
used a byte-at-a-time loop for reverse copies. this was changed to
avoid invoking UB now that memcpy has an undefined copying order,
making memmove considerably slower.
performance is still rather bad, so I'll be adding asm soon.