Rich Felker [Fri, 5 Oct 2012 01:08:53 +0000 (21:08 -0400)]
remove freeing of dynamic linker data when dlopen/dlsym are not used
this was an optimization to save/recover a minimal amount of extra
memory for use by malloc, that's becoming increasingly costly to keep
around. freeing this data:
1. breaks debugging with gdb (it can't find library symbols)
2. breaks thread-local storage in shared libraries
it would be possible to disable freeing when TLS is used, but in
addition to the above breakages, tracking whether dlopen/dlsym is used
adds a cost to every symbol lookup, possibly making program startup
slower for large programs. combined with the complexity, it's not
worth it. we already save/recover plenty of memory in the dynamic
linker with reclaim_gaps.
Rich Felker [Fri, 5 Oct 2012 01:01:56 +0000 (21:01 -0400)]
beginnings of full TLS support in shared libraries
this code will not work yet because the necessary relocations are not
supported, and cannot be supported without some internal changes to
how relocation processing works (coming soon).
Rich Felker [Thu, 4 Oct 2012 20:35:46 +0000 (16:35 -0400)]
TLS (GNU/C11 thread-local storage) support for static-linked programs
the design for TLS in dynamic-linked programs is mostly complete too,
but I have not yet implemented it. cost is nonzero but still low for
programs which do not use TLS and/or do not use threads (a few hundred
bytes of new code, plus dependency on memcpy). i believe it can be
made smaller at some point by merging __init_tls and __init_security
into __libc_start_main and avoiding duplicate auxv-parsing code.
at the same time, I've also slightly changed the logic pthread_create
uses to allocate guard pages to ensure that guard pages are not
counted towards commit charge.
Rich Felker [Wed, 3 Oct 2012 15:49:58 +0000 (11:49 -0400)]
tell the assembler to mark all files as not requiring executable stack
for some reason this option is undocumented. not sure when it was
added, so I'm using a configure test. gcc was already setting the mark
correctly for C files, but assembler source files would need ugly
.note boilerplate in every single file to achieve this without the
option to the assembler.
blame whoever thought it would be a good idea to make the stack
executable by default rather than doing it the other way around...
Rich Felker [Mon, 1 Oct 2012 00:00:38 +0000 (20:00 -0400)]
add getopt reset support
based on proposed patches by Daniel Cegiełka, with minor changes:
- use a weak symbol for optreset so it doesn't clash with namespace
- also reset optpos (position in multi-option arg like -lR)
- also make getopt_long support reset
this function was overly complicated and not even obviously correct.
avoid using openat/linkat just like in shm_open, and instead expand
pathname using code shared with shm_open. remove bogus (and dangerous,
with priorities) use of spinlocks.
this commit also heavily streamlines the code and ensures there are no
failure cases that can happen after a new semaphore has been created
in the filesystem, since that case is unreportable.
add 'e' modifier (close-on-exec) to fopen and fdopen
this feature will be in the next version of POSIX, and can be used
internally immediately. there are many internal uses of fopen where
close-on-exec is needed to fix bugs.
emulate SOCK_CLOEXEC and SOCK_NONBLOCK for old (pre-2.6.27) kernels
also update syslog to use SOCK_CLOEXEC rather than separate fcntl
step, to make it safe in multithreaded programs that run external
programs.
emulation is not atomic; it could be made atomic by holding a lock on
forking during the operation, but this seems like overkill. my goal is
not to achieve perfect behavior on old kernels (which have plenty of
other imperfect behavior already) but to avoid catastrophic breakage
in (1) syslog, which would give no output on old kernels with the
change to use SOCK_CLOEXEC, and (2) programs built on a new kernel
where configure scripts detected a working SOCK_CLOEXEC, which later
get run on older kernels (they may otherwise fail to work completely).
based on initial work by rdp, with heavy modifications. some features
including threads are untested because qemu app-level emulation seems
to be broken and I do not have a proper system image for testing.
optimize strchrnul/strcspn not to scan string twice on no-match
when strchr fails, and important piece of information already
computed, the string length, is thrown away. have strchrnul (with
namespace protection) be the underlying function so this information
can be kept, and let strchr be a wrapper for it. this also allows
strcspn to be considerably faster in the case where the match set has
a single element that's not matched.
slightly cleaner strlen, also seems to compile to better code
testing with gcc 4.6.3 on x86, -Os, the old version does a duplicate
null byte check after the first loop. this is purely the compiler
being stupid, but the old code was also stupid and unintuitive in how
it expressed the check.
austin group interpretation for defect #529
(http://austingroupbugs.net/view.php?id=529) tightens the
requirements on close such that, if it returns with EINTR, the file
descriptor must not be closed. the linux kernel developers vehemently
disagree with this, and will not change it. we catch and remap EINTR
to EINPROGRESS, which the standard allows close() to return when the
operation was not finished but the file descriptor has been closed.
new behavior can be summarized as:
inputs that parse completely as a decimal number are treated as one,
and rejected only if the result is out of 16-bit range.
inputs that do not parse as a decimal number (where strtoul leaves
anything left over in the input) are searched in /etc/services.
offer REALGCC variable to configure musl-gcc wrapper at runtime
this is useful when the underlying gcc is already a wrapper, which is
the case at least on some uclibc-based system images. it's also useful
for running an older/newer/nondefault version of gcc.
it was determined in discussion that these kind of limits are not
sufficient to protect single-threaded servers against denial of
service attacks from maliciously large round counts. the time scales
simply vary too much; many users will want login passwords with rounds
counts on a scale that gives decisecond latency, while highly loaded
webservers will need millisecond latency or shorter.
still some limit is left in place; the idea is not to protect against
attacks, but to avoid the runtime of a single call to crypt being, for
all practical purposes, infinite, so that configuration errors can be
caught and fixed without bringing down whole systems. these limits are
very high, on the order of minute-long runtimes for modest systems.
if same register is used for input/output, the compiler must be told.
otherwise is generates random junk code that clobbers the result. in
pure syscall-wrapper functions, nothing went wrong, but in more
complex functions where register allocation is non-trivial, things
broke badly.
workaround gcc got-register-reload performance problems in malloc
with this patch, the malloc in libc.so built with -Os is nearly the
same speed as the one built with -O3. thus it solves the performance
regression that resulted from removing the forced -O3 when building
libc.so; now libc.so can be both small and fast.
I originally added -O3 for shared libraries to counteract very bad
behavior by GCC when building PIC code: it insists on reloading the
GOT register in static functions that need it, even if the address of
the function is never leaked from the translation unit and all local
callers of the function have already loaded the GOT register. this
measurably degrades performance in a few key areas like malloc. the
inlining done at -O3 avoids the issue, but that's really not a good
reason for overriding the user's choice of optimization level.
vfork is implemented as the fork syscall (with no atfork handlers run)
on archs where it is not available, so this change does not introduce
any change in behavior or regression for such archs.
I'm not 100% sure that Linux's O_PATH meets the POSIX requirements for
O_SEARCH, but it seems very close if not perfect. and old kernels
ignore it, so O_SEARCH will still work as desired as long as the
caller has read permissions to the directory.
improve mips syscall asm constraints to use immediates, if possible
by using the "ir" constraint (immediate or register) and the carefully
constructed instruction addu $2,$0,%2 which can take either an
immediate or a register for %2, the new inline asm admits maximal
optimization with no register spillage to the stack when the compiler
successfully performs constant propagration, but still works by
allocating a register when the syscall number cannot be recognized as
a constant. in the case of syscalls with 0-3 arguments it barely
matters, but for 4-argument syscalls, using an immediate for the
syscall number avoids creating a stack frame for the syscall wrapper
function.
eliminate assumption that mips syscall restart preserves r25
all past and current kernel versions have done so, but there seems to
be no reason it's necessary and the sentiment from everyone I've asked
has been that we should not rely on it. instead, use r7 (an argument
register) which will necessarily be preserved upon syscall restart.
however this only works for 0-3 argument syscalls, and we have to
resort to the function call for 4-argument syscalls.
for the sake of simplicity, I've only used rep movsb rather than
breaking up the copy for using rep movsd/q. on all modern cpus, this
seems to be fine, but if there are performance problems, there might
be a need to go back and add support for rep movsd/q.
before restrict was added, memove called memcpy for forward copies and
used a byte-at-a-time loop for reverse copies. this was changed to
avoid invoking UB now that memcpy has an undefined copying order,
making memmove considerably slower.
performance is still rather bad, so I'll be adding asm soon.
add LIBCC (compiler runtime) logic and override to configure
this should both fix the issue with ARM needing -lgcc_eh (although
that's really a bug in the libgcc build process that's causing
considerable bloat, which should be fixed) and make it easier to build
musl using clang/llvm in place of gcc. unfortunately I don't know a
good way to detect and support pcc's -lpcc since it's not in pcc's
default library search path...
no syscalls actually use that many arguments; the issue is that some
syscalls with 64-bit arguments have them ordered badly so that
breaking them into aligned 32-bit half-arguments wastes slots with
padding, and a 7th slot is needed for the last argument.
this code was using $10 to save the syscall number, but $10 is not
necessarily preserved by the kernel across syscalls. only mattered for
syscalls that got interrupted by a signal and restarted. as far as i
can tell, $25 is preserved by the kernel across syscalls.
something is wrong with the logic for the argument layout, resulting
in compile errors on mips due to too many args to syscall... further
information on how it's supposed to work will be needed before it can
be reactivated.
now public syscall.h only exposes __NR_* and SYS_* constants and the
variadic syscall function. no macros or inline functions, no
__syscall_ret or other internal details, no 16-/32-bit legacy syscall
renaming, etc. this logic has all been moved to src/internal/syscall.h
with the arch-specific parts in arch/$(ARCH)/syscall_arch.h, and the
amount of arch-specific stuff has been reduced to a minimum.
changes still need to be reviewed/double-checked. minimal testing on
i386 and mips has already been performed.
this is equivalent to posix_fallocate except that it has an extra
mode/flags argument to control its behavior, and stores the error in
errno rather than returning an error code.
default features: make musl usable without feature test macros
the old behavior of exposing nothing except plain ISO C can be
obtained by defining __STRICT_ANSI__ or using a compiler option (such
as -std=c99) that predefines it. the new default featureset is POSIX
with XSI plus _BSD_SOURCE. any explicit feature test macros will
inhibit the default.
installation docs have also been updated to reflect this change.
clang does not presently support the "v" constraint we want to use to
get the result from $3, and trying to use register...__asm__("$3") to
do the same invokes serious compiler bugs. so for now, i'm working
around the issue with an extra temp register and putting $3 in the
clobber list instead of using it as output. when the bugs in clang are
fixed, this issue should be revisited to generate smaller/faster code
like what gcc gets.
previously, it was pretty much random which one of these trees a given
function appeared in. they have now been organized into:
src/linux: non-POSIX linux syscalls (possibly shard with other nixen)
src/legacy: various obsolete/legacy functions, mostly wrappers
src/misc: still mostly uncategorized; some misc POSIX, some nonstd
src/crypt: crypt hash functions
so far, this is the only actual use of loff_t i've found. some
software, including glib, assumes loff_t must exist if splice exists;
this is a reasonable assumption since the official prototype for
splice uses loff_t, as it always works with 64-bit offsets regardless
of the selected libc off_t size. i'm using #define for now rather than
a typedef to make it easy to define in other headers if necessary
(like the LFS64 ugliness), but it may be necessary to add it to
alltypes.h eventually if other functions end up needing it.
further use of _Noreturn, for non-plain-C functions
note that POSIX does not specify these functions as _Noreturn, because
POSIX is aligned with C99, not the new C11 standard. when POSIX is
eventually updated to C11, it will almost surely give these functions
the _Noreturn attribute. for now, the actual _Noreturn keyword is not
used anyway when compiling with a c99 compiler, which is what POSIX
requires; the GCC __attribute__ is used instead if it's available,
however.
in a few places, I've added infinite for loops at the end of _Noreturn
functions to silence compiler warnings. presumably
__buildin_unreachable could achieve the same thing, but it would only
work on newer GCCs and would not be portable. the loops should have
near-zero code size cost anyway.
like the previous _Noreturn commit, this one is based on patches
contributed by philomath.
use restrict everywhere it's required by c99 and/or posix 2008
to deal with the fact that the public headers may be used with pre-c99
compilers, __restrict is used in place of restrict, and defined
appropriately for any supported compiler. we also avoid the form
[restrict] since older versions of gcc rejected it due to a bug in the
original c99 standard, and instead use the form *restrict.
remove dependency of wmemmove on wmemcpy direction
unlike the memmove commit, this one should be fine to leave in place.
wmemmove is not performance-critical, and even if it were, it's
already copying whole 32-bit words at a time instead of bytes.
this commit introduces a performance regression in many uses of
memmove, which will need to be addressed before the next release. i'm
making it as a temporary measure so that the restrict patch can be
committed without invoking undefined behavior when memmove calls
memcpy with overlapping regions.
avoid "inline" in public headers for strict c89 compatibility
while musl itself requires a c99 compiler, some applications insist on
being compiled with c89 compilers, and use of "inline" in the headers
was breaking them. much of this had been avoided already by just
skipping the inline keyword in pre-c99 compilers or modes, but this
new unified solution is cleaner and may/should result in better code
generation in the default gcc configuration.