presently, all archs/ABIs have struct stat matching the kernel
stat[64] type, except mips/mipsn32/mips64 which do conversion hacks in
syscall_arch.h to work around bugs in the kernel type. this patch
completely decouples them and adds a translation step to the success
path of fstatat. at present, this is just a gratuitous copying, but it
opens up multiple possibilities for future support for 64-bit time_t
on 32-bit archs and for cleaned-up/unified ABIs.
for clarity, the mips hacks are not yet removed in this commit, so the
mips kstat structs still correspond to the output of the hacks in
their syscall_arch.h files, not the raw kernel type. a subsequent
commit will fix this.
syscall numbers are now synced up across targets (starting from 403 the
numbers are the same on all targets other than an arch specific offset)
IPC syscalls sem*, shm*, msg* got added where they were missing (except
for semop: only semtimedop got added), the new semctl, shmctl, msgctl
imply IPC_64, see
linux commit 0d6040d4681735dfc47565de288525de405a5c99
arch: add split IPC system calls where needed
new 64bit time_t syscall variants got added on 32bit targets, see
linux commit 48166e6ea47d23984f0b481ca199250e1ce0730a
y2038: add 64-bit time_t syscalls to all 32-bit architectures
new async io syscalls got added, see
linux commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e
Add io_uring IO interface
linux commit edafccee56ff31678a091ddb7219aba9b28bc3cb
io_uring: add support for pre-mapped user IO buffers
a new syscall got added that uses the fd of /proc/<pid> as a stable
handle for processes: allows sending signals without pid reuse issues,
intended to eventually replace rt_sigqueueinfo, kill, tgkill and
rt_tgsigqueueinfo, see
linux commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad
signal: add pidfd_send_signal() syscall
on some targets (arm, m68k, s390x, sh) some previously missing syscall
numbers got added as well.
we have to avoid using ebx unconditionally in asm constraints for
i386, because gcc 3 and 4 and possibly other simplistic compilers
(pcc?) implement PIC via making ebx a fixed-use register, and disallow
its use for anything else. rather than hard-coding knowledge of which
compilers work (at least gcc 5+ and clang), perform a configure test;
this should give us the good codegen on any new compilers we don't yet
know about.
swapping ebx and edx is kept for 1- and 2-arg syscalls because it
avoids having any spills/stack-frame at all in small functions. for
6-arg, if ebx is directly usable, the complex shuffling introduced in
commit c8798ef974 can be avoided, and
ebp can be loaded the same way ebx is in 5-arg syscalls for compilers
that don't support direct use of ebx.
commit 22e5bbd0de inlined the i386
syscall mechanism, but wrongly assumed memory operands to the 5- and
6-argument syscall asm would be esp-based. however, nothing in the
constraints prevented them from being ebx- or ebp-based, and in those
cases, ebx and ebp could be clobbered before use of the memory operand
was complete. in the 6-argument case, this prevented restoration of
the original register values before the end of the asm block, breaking
the asm contract since ebx and ebp are not marked as clobbered. (they
can't be, because lots of compilers don't accept these registers in
constraints or clobbers if PIC or frame pointer is enabled).
doing this right is complicated by the fact that, after a single push,
no operands which might be memory operands are usable. if they are
esp-based, the value of esp has changed, rendering them invalid.
introduce some new dances to load the registers. for the 5-arg case,
push the operand that may be a memory operand first, and after that,
it doesn't matter if the operand is invalid, since we'll just use the
newly pushed value. for the 6-arg case, we need to put both operands
in memory to begin with, like the old non-inline code prior to commit
22e5bbd0de accepted, so that there's
only one potentially memory-based operand to the asm. this can then be
saved with a single push, and after that the values can be read off
into the registers they're needed in.
there's some size overhead, but still a lot less execution overhead
than the old out-of-line code. doing it better depends on a modern
compiler that lets you use ebx and ebp in asm constraints without
restriction. the failure modes on compilers where this doesn't work
are inconsistent and dangerous (on at least some gcc versions 4.x and
earlier, wrong codegen!), so this is a delicate matter. it can be
addressed later if needed.
this is the first part of a series of patches intended to make
__syscall fully self-contained in the object file produced using
syscall.h, which will make it possible for crt1 code to perform
syscalls.
the (confusingly named) i386 __vsyscall mechanism, which this commit
removes, was introduced before the presence of a valid thread pointer
was mandatory; back then the thread pointer was setup lazily only if
threads were used. the intent was to be able to perform syscalls using
the kernel's fast entry point in the VDSO, which can use the sysenter
(Intel) or syscall (AMD) instruction instead of int $128, but without
inlining an access to the __syscall global at the point of each
syscall, which would incur a significant size cost from PIC setup
everywhere. the mechanism also shuffled registers/calling convention
around to avoid spills of call-saved registers, and to avoid
allocating ebx or ebp via asm constraints, since there are plenty of
broken-but-supported compiler versions which are incapable of
allocating ebx with -fPIC or ebp with -fno-omit-frame-pointer.
the new mechanism preserves the properties of avoiding spills and
avoiding allocation of ebx/ebp in constraints, but does it inline,
using some fairly simple register shuffling, and uses a field of the
thread structure rather than global data for the vdso-provided syscall
code address.
for now, the external __syscall function is refactored not to use the
old __vsyscall so it can be kept, but the intent is to remove it too.
this will allow the compiler to cache and reuse the result, meaning we
no longer have to take care not to load it more than once for the sake
of archs where the load may be expensive.
depends on commit 1c84c99913 for
correctness, since otherwise the compiler could hoist loads during
stage 3 of dynamic linking before the initial thread-pointer setup.
sys/ptrace.h is target specific, use bits/ptrace.h to add target
specific macro definitions.
these macros are kept in the generic sys/ptrace.h even though some
targets don't support them:
PTRACE_GETREGS
PTRACE_SETREGS
PTRACE_GETFPREGS
PTRACE_SETFPREGS
PTRACE_GETFPXREGS
PTRACE_SETFPXREGS
so no macro definition got removed in this patch on any target. only
s390x has a numerically conflicting macro definition (PTRACE_SINGLEBLOCK).
the PT_ aliases follow glibc headers, otherwise the definitions come
from linux uapi headers except ones that are skipped in glibc and
there is no real kernel support (s390x PTRACE_*_AREA) or need special
type definitions (mips PTRACE_*_WATCH_*) or only relevant for linux
2.4 compatibility (PTRACE_OLDSETOPTIONS).
Update atomic.h to provide a_ctz_l in all cases (atomic_arch.h should
now only provide a_ctz_32 and/or a_ctz_64).
The generic version of a_ctz_32 now takes advantage of a_clz_32 if
available and the generic a_ctz_64 now makes use of a_ctz_32.
PAGESIZE is actually the version defined in POSIX base, with PAGE_SIZE
being in the XSI option. use PAGESIZE as the underlying definition to
facilitate making exposure of PAGE_SIZE conditional.
most of the found naming differences don't matter to musl, because
internally it unifies the syscall names that vary across targets,
but for external code the names should match the kernel uapi.
aarch64:
__NR_fstatat is called __NR_newfstatat in linux.
__NR_or1k_atomic got mistakenly copied from or1k.
arm:
__NR_arm_sync_file_range is an alias for __NR_sync_file_range2
__NR_fadvise64_64 is called __NR_arm_fadvise64_64 in linux,
the old non-arm name is kept too, it should not cause issues.
(powerpc has similar nonstandard fadvise and it uses the
normal name.)
i386:
__NR_madvise1 was removed from linux in commit
303395ac3bf3e2cb488435537d416bc840438fcb 2011-11-11
microblaze:
__NR_fadvise, __NR_fstatat, __NR_pread, __NR_pwrite
had different name in linux.
mips:
__NR_fadvise, __NR_fstatat, __NR_pread, __NR_pwrite, __NR_select
had different name in linux.
mipsn32:
__NR_fstatat is called __NR_newfstatat in linux.
or1k:
__NR__llseek is called __NR_llseek in linux.
the old name is kept too because that's the name musl uses
internally.
powerpc:
__NR_{get,set}res{gid,uid}32 was never present in powerpc linux.
__NR_timerfd was briefly defined in linux but then got renamed.
counts leading zero bits of a 64bit int, undefined on zero input.
(has nothing to do with atomics, added to atomic.h so target specific
helper functions are together.)
there is a logarithmic generic implementation and another in terms of
a 32bit a_clz_32 on targets where that's available.
when _GNU_SOURCE is defined, which is always the case when compiling
c++ with gcc, these macros for the the indices in gregset_t are
exposed and likely to clash with applications. by using enum constants
rather than macros defined with integer literals, we can make the
clash slightly less likely to break software. the macros are still
defined in case anything checks for them with #ifdef, but they're
defined to expand to themselves so that non-file-scope (e.g.
namespaced) identifiers by the same names still work.
for the sake of avoiding mistakes, the changes were generated with sed
via the command:
sed -i -e 's/#define *\(REG_[A-Z_0-9]\{1,\}\) *\([0-9]\{1,\}\)'\
'/enum { \1 = \2 };\n#define \1 \1/' \
arch/i386/bits/signal.h arch/x86_64/bits/signal.h arch/x32/bits/signal.h
this has been slated for removal for a long time. there is
fundamentally no way to implement stdarg without compiler assistance;
any attempt to do so has serious undefined behavior; its working
depends not just (as a common misconception goes) on ABI, but also on
assumptions about compiler code generation internal to a translation
unit, which is not subject to external ABI constraints.
placing the opening brace on the same line as the struct keyword/tag
is the style I prefer and seems to be the prevailing practice in more
recent additions.
these changes were generated by the command:
find include/ arch/*/bits -name '*.h' \
-exec sed -i '/^struct [^;{]*$/{N;s/\n/ /;}' {} +
and subsequently checked by hand to ensure that the regex did not pick
up any false positives.
the syscalls take an additional flag argument, they were added in commit
f17d8b35452cab31a70d224964cd583fb2845449 and a RWF_HIPRI priority hint
flag was added to linux/fs.h in 97be7ebe53915af504fb491fb99f064c7cf3cb09.
the syscall is not allocated for microblaze and sh yet.
commits e24984efd5 and
16b55298dc inadvertently disabled the
a_spin implementations for i386, x86_64, and x32 by defining a macro
named a_pause instead of a_spin. this should not have caused any
functional regression, but it inhibited cpu relaxation while spinning
for locks.
bug reported by George Kulakowski.
it was introduced for offloading copying between regular files
in linux commit 29732938a6289a15e907da234d6692a2ead71855
(microblaze and sh does not yet have the syscall number.)
currently five targets use the same mman.h constants and the rest
share most constants too, so move them to sys/mman.h before the
bits/mman.h include where the differences can be corrected by
redefinition of the macros.
this fixes two minor bugs: POSIX_MADV_DONTNEED was wrong on most
targets (it should be the same as MADV_DONTNEED), and sh defined
the x86-only MAP_32BIT mmap flag.
all bits headers that were identical for a number of 'clean' archs are
moved to the new arch/generic tree. in addition, a few headers that
differed only cosmetically from the new generic version are removed.
additional deduplication may be possible in mman.h and in several
headers (limits.h, posix.h, stdint.h) that mostly depend on whether
the arch is 32- or 64-bit, but they are left alone for now because
greater gains are likely possible with more invasive changes to header
logic, which is beyond the scope of this commit.
they lock faulted pages into memory (useful when a small part of a
large mapped file needs efficient access), new in linux v4.4, commit
b0f205c2a3082dd9081f9a94e50658c5fa906ff1
MLOCK_* is not in the POSIX reserved namespace for sys/mman.h
this is mlock with a flags argument, new in linux commit
a8ca5d0ecbdde5cc3d7accacbd69968b0c98764e
as usual microblaze and sh don't have allocated syscall number yet.
new in linux v4.3 added for aarch64, arm, i386, mips, or1k, powerpc,
x32 and x86_64.
membarrier is a system wide memory barrier, moves most of the
synchronization cost to one side, new in kernel commit
5b25b13ab08f616efd566347d809b4ece54570d1
userfaultfd is useful for qemu and is new in kernel commit
8d2afd96c20316d112e04d935d9e09150e988397
switch_endian is powerpc only for switching endianness, new in commit
529d235a0e190ded1d21ccc80a73e625ebcad09b
new in linux v4.3 commit 9dea5dc921b5f4045a18c63eb92e84dc274d17eb
direct calls instead of socketcall allow better seccomp filtering.
musl continues to use socketcalls internally on i386. (older kernels
would need a fallback mechanism if the direct calls were used.)
only use SYS_socketcall if SYSCALL_USE_SOCKETCALL is defined
internally, otherwise use direct syscalls.
this commit does not change the current behaviour, it is
preparation for adding direct syscall numbers for i386.
this commit mostly makes consistent things like spacing, function
ordering in atomic_arch.h, argument names, use of volatile, etc. the
fake 64-bit and/or atomics are also removed because the shared
atomic.h does a better job of implementing them; it avoids making two
atomic memory accesses when only one 32-bit half needs to be touched.
no major overhaul is needed or possible because x86 actually has
native versions of all the usual atomic operations, rather than using
ll/sc or needing cas loops.
rather than having each arch provide its own atomic.h, there is a new
shared atomic.h in src/internal which pulls arch-specific definitions
from arc/$(ARCH)/atomic_arch.h. the latter can be extremely minimal,
defining only a_cas or new ll/sc type primitives which the shared
atomic.h will use to construct everything else.
this commit avoids making heavy changes to the individual archs'
atomic implementations. definitions which are identical or
near-identical to what the new shared atomic.h would produce have been
removed, but otherwise the changes made are just hooking up the
arch-specific files to the new infrastructure. major changes to take
advantage of the new system will come in subsequent commits.
at least gcc 4.7 claims c++11 support but does not accept the alignas
keyword, causing breakage when stddef.h is included in c++11 mode.
instead, prefer using __attribute__((__aligned__)) on any compiler
with GNU extensions, and only use the alignas keyword as a fallback
for other C++ compilers.
C code should not be affected by this patch.
using the actual mcontext_t definition rather than an overlaid pointer
array both improves correctness/readability and eliminates some ugly
hacks for archs with 64-bit registers bit 32-bit program counter.
also fix UB due to comparison of pointers not in a common array
object.
previously, the call into stage 2 was made by looking up the symbol
name "__dls2" (which was chosen short to be easy to look up) from the
dynamic symbol table. this was no problem for the dynamic linker,
since it always exports all its symbols. in the case of the static pie
entry point, however, the dynamic symbol table does not contain the
necessary symbol unless -rdynamic/-E was used when linking. this
linking requirement is a major obstacle both to practical use of
static-pie as a nommu binary format (since it greatly enlarges the
file) and to upstream toolchain support for static-pie (adding -E to
default linking specs is not reasonable).
this patch replaces the runtime symbolic lookup with a link-time
lookup via an inline asm fragment, which reloc.h is responsible for
providing. in this initial commit, the asm is provided only for i386,
and the old lookup code is left in place as a fallback for archs that
have not yet transitioned.
modifying crt_arch.h to pass the stage-2 function pointer as an
argument was considered as an alternative, but such an approach would
not be compatible with fdpic, where it's impossible to compute
function pointers without already having performed relocations. it was
also deemed desirable to keep crt_arch.h as simple/minimal as
possible.
in principle, archs with pc-relative or got-relative addressing of
static variables could instead load the stage-2 function pointer from
a static volatile object. that does not work for fdpic, and is not
safe against reordering on mips-like archs that use got slots even for
static functions, but it's a valid on i386 and many others, and could
provide a reasonable default implementation in the future.
this error was only found by reading the code, but it seems to have
been causing gcc to produce wrong code in malloc: the same register
was used for the output and the high word of the input. in principle
this could have caused an infinite loop searching for an available
bin, but in practice most x86 models seem to implement the "undefined"
result of the bsf instruction as "unchanged".
despite being strongly ordered, the x86 memory model does not preclude
reordering of loads across earlier stores. while a plain store
suffices as a release barrier, we actually need a full barrier, since
users of a_store subsequently load a waiter count to determine whether
to issue a futex wait, and using a stale count will result in soft
(fail-to-wake) deadlocks. these deadlocks were observed in malloc and
possible with stdio locks and other libc-internal locking.
on i386, an atomic operation on the caller's stack is used as the
barrier rather than performing the store itself using xchg; this
avoids the need to read the cache line on which the store is being
performed. mfence is used on x86_64 where it's always available, and
could be used on i386 with the appropriate cpu model checks if it's
shown to perform better.
conceptually, and on other archs, these functions take a pointer to
int, but in the i386, x86_64, and x32 versions of atomic.h, they took
a pointer to void instead.
this overhaul further reduces the amount of arch-specific code needed
by the dynamic linker and removes a number of assumptions, including:
- that symbolic function references inside libc are bound at link time
via the linker option -Bsymbolic-functions.
- that libc functions used by the dynamic linker do not require
access to data symbols.
- that static/internal function calls and data accesses can be made
without performing any relocations, or that arch-specific startup
code handled any such relocations needed.
removing these assumptions paves the way for allowing libc.so itself
to be built with stack protector (among other things), and is achieved
by a three-stage bootstrap process:
1. relative relocations are processed with a flat function.
2. symbolic relocations are processed with no external calls/data.
3. main program and dependency libs are processed with a
fully-functional libc/ldso.
reduction in arch-specific code is achived through the following:
- crt_arch.h, used for generating crt1.o, now provides the entry point
for the dynamic linker too.
- asm is no longer responsible for skipping the beginning of argv[]
when ldso is invoked as a command.
- the functionality previously provided by __reloc_self for heavily
GOT-dependent RISC archs is now the arch-agnostic stage-1.
- arch-specific relocation type codes are mapped directly as macros
rather than via an inline translation function/switch statement.
while it's the same for all presently supported archs, it differs at
least on sparc, and conceptually it's no less arch-specific than the
other O_* macros. O_SEARCH and O_EXEC are still defined in terms of
O_PATH in the main fcntl.h.
the previous values (2k min and 8k default) were too small for some
archs. aarch64 reserves 4k in the signal context for future extensions
and requires about 4.5k total, and powerpc reportedly uses over 2k.
the new minimums are chosen to fit the saved context and also allow a
minimal signal handler to run.
since the default (SIGSTKSZ) has always been 6k larger than the
minimum, it is also increased to maintain the 6k usable by the signal
handler. this happens to be able to store one pathname buffer and
should be sufficient for calling any function in libc that doesn't
involve conversion between floating point and decimal representations.
x86 (both 32-bit and 64-bit variants) may also need a larger minimum
(around 2.5k) in the future to support avx-512, but the values on
these archs are left alone for now pending further analysis.
the value for PTHREAD_STACK_MIN is not increased to match MINSIGSTKSZ
at this time. this is so as not to preclude applications from using
extremely small thread stacks when they know they will not be handling
signals. unfortunately cancellation and multi-threaded set*id() use
signals as an implementation detail and therefore require a stack
large enough for a signal context, so applications which use extremely
small thread stacks may still need to avoid using these features.
these macros have the same distinct definition on blackfin, frv, m68k,
mips, sparc and xtensa kernels. POLLMSG and POLLRDHUP additionally
differ on sparc.
the memory model we use internally for atomics permits plain loads of
values which may be subject to concurrent modification without
requiring that a special load function be used. since a compiler is
free to make transformations that alter the number of loads or the way
in which loads are performed, the compiler is theoretically free to
break this usage. the most obvious concern is with atomic cas
constructs: something of the form tmp=*p;a_cas(p,tmp,f(tmp)); could be
transformed to a_cas(p,*p,f(*p)); where the latter is intended to show
multiple loads of *p whose resulting values might fail to be equal;
this would break the atomicity of the whole operation. but even more
fundamental breakage is possible.
with the changes being made now, objects that may be modified by
atomics are modeled as volatile, and the atomic operations performed
on them by other threads are modeled as asynchronous stores by
hardware which happens to be acting on the request of another thread.
such modeling of course does not itself address memory synchronization
between cores/cpus, but that aspect was already handled. this all
seems less than ideal, but it's the best we can do without mandating a
C11 compiler and using the C11 model for atomics.
in the case of pthread_once_t, the ABI type of the underlying object
is not volatile-qualified. so we are assuming that accessing the
object through a volatile-qualified lvalue via casts yields volatile
access semantics. the language of the C standard is somewhat unclear
on this matter, but this is an assumption the linux kernel also makes,
and seems to be the correct interpretation of the standard.
this syscall allows fexecve to be implemented without /proc, it is new
in linux v3.19, added in commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
(sh and microblaze do not have allocated syscall numbers yet)
added a x32 fix as well: the io_setup and io_submit syscalls are no
longer common with x86_64, so use the x32 specific numbers.
the definitions are generic for all kernel archs. exposure of these
macros now only occurs on the same feature test as for the function
accepting them, which is believed to be more correct.
these syscalls are new in linux v3.18, bpf is present on all
supported archs except sh, kexec_file_load is only allocted for
x86_64 and x32 yet.
bpf was added in linux commit 99c55f7d47c0dc6fc64729f37bf435abf43f4c60
kexec_file_load syscall number was allocated in commit
f0895685c7fd8c938c91a9d8a6f7c11f22df58d2