Rich Felker [Sun, 8 May 2011 03:23:58 +0000 (23:23 -0400)]
overhaul implementation-internal signal protections
the new approach relies on the fact that the only ways to create
sigset_t objects without invoking UB are to use the sig*set()
functions, or from the masks returned by sigprocmask, sigaction, etc.
or in the ucontext_t argument to a signal handler. thus, as long as
sigfillset and sigaddset avoid adding the "protected" signals, there
is no way the application will ever obtain a sigset_t including these
bits, and thus no need to add the overhead of checking/clearing them
when sigprocmask or sigaction is called.
note that the old code actually *failed* to remove the bits from
sa_mask when sigaction was called.
the new implementations are also significantly smaller, simpler, and
faster due to ignoring the useless "GNU HURD signals" 65-1024, which
are not used and, if there's any sanity in the world, never will be
used.
Rich Felker [Sat, 7 May 2011 01:45:48 +0000 (21:45 -0400)]
reduce some ridiculously large spin counts
these should be tweaked according to testing. offhand i know 1000 is
too low and 5000 is likely to be sufficiently high. consider trying to
add futexes to file locking, too...
Rich Felker [Sat, 7 May 2011 00:00:59 +0000 (20:00 -0400)]
completely new barrier implementation, addressing major correctness issues
the previous implementation had at least 2 problems:
1. the case where additional threads reached the barrier before the
first wave was finished leaving the barrier was untested and seemed
not to be working.
2. threads leaving the barrier continued to access memory within the
barrier object after other threads had successfully returned from
pthread_barrier_wait. this could lead to memory corruption or crashes
if the barrier object had automatic storage in one of the waiting
threads and went out of scope before all threads finished returning,
or if one thread unmapped the memory in which the barrier object
lived.
the new implementation avoids both problems by making the barrier
state essentially local to the first thread which enters the barrier
wait, and forces that thread to be the last to return.
Rich Felker [Mon, 2 May 2011 13:18:03 +0000 (09:18 -0400)]
fix fclose return status logic, again
the previous fix was incorrect, as it would prevent f->close(f) from
being called if fflush(f) failed. i believe this was the original
motivation for using | rather than ||. so now let's just use a second
statement to constrain the order of function calls, and to back to
using |.
Rich Felker [Mon, 2 May 2011 02:16:04 +0000 (22:16 -0400)]
workaround for preprocessor bug in pcc
with this patch, musl compiles and mostly works with pcc 1.0.0. a few
tests are still failing and i'm uncertain whether they are due to
portability problems in musl, or bugs in pcc, but i suspect the
latter.
use compiler builtins for variadic macros when available
this slightly cuts down on the degree musl "fights with" gcc, but more
importantly, it fixes a critical bug when gcc inlines a variadic
function and optimizes out the variadic arguments due to noticing that
they were "not used" (by __builtin_va_arg).
we leave the old code in place if __GNUC__ >= 3 is false; it seems
like it might be necessary at least for tinycc support and perhaps if
anyone ever gets around to fixing gcc 2.95.3 enough to make it work..
replace heap sort with smoothsort implementation by Valentin Ochs
Smoothsort is an adaptive variant of heapsort. This version was
written by Valentin Ochs (apo) specifically for inclusion in musl. I
worked with him to get it working in O(1) memory usage even with giant
array element widths, and to optimize it heavily for size and speed.
It's still roughly 4 times as large as the old heap sort
implementation, but roughly 20 times faster given an almost-sorted
array of 1M elements (20 being the base-2 log of 1M), i.e. it really
does reduce O(n log n) to O(n) in the mostly-sorted case. It's still
somewhat slower than glibc's Introsort for random input, but now
considerably faster than glibc when the input is already sorted, or
mostly sorted.
strictly speaking this and a few other ops should be factored into
asm.h or the file should just be renamed to asm.h, but whatever. clean
it up someday.
1. failed match of literal chars from the format string would always
return matching failure rather than input failure at eof, leading to
infinite loops in some programs.
2. unread of eof would wrongly adjust the character counts reported by
%n, yielding an off-by-one error.
fix minor bugs due to incorrect threaded-predicate semantics
some functions that should have been testing whether pthread_self()
had been called and initialized the thread pointer were instead
testing whether pthread_create() had been called and actually made the
program "threaded". while it's unlikely any mismatch would occur in
real-world problems, this could have introduced subtle bugs. now, we
store the address of the main thread's thread descriptor in the libc
structure and use its presence as a flag that the thread register is
initialized. note that after fork, the calling thread (not necessarily
the original main thread) is the new main thread.
the linux documentation for dup2 says it can fail with EBUSY due to a
race condition with open and dup in the kernel. shield applications
(and the rest of libc) from this nonsense by looping until it succeeds
recheck cancellation disabled flag after syscall returns EINTR
we already checked before making the syscall, but it's possible that a
signal handler interrupted the blocking syscall and disabled
cancellation, and that this is the cause of EINTR. in this case, the
old behavior was testably wrong.
like all other syscalls, close should return to the caller if and only
if it successfully performed its action. it is necessary that the
application be able to determine whether the close succeeded.
clean and simple, but fails when the caller does not have permissions
to open the file for reading or when /proc is not available. i may
replace this with a full implementation later, possibly leaving this
version as an optimization to use when it works.
debloat: use __syscall instead of syscall where possible
don't waste time (and significant code size due to function call
overhead!) setting errno when the result of a syscall does not matter
or when it can't fail.
x86_64 was just plain wrong in the cancel-flag-already-set path, and
crashing.
the more subtle error was not clearing the saved stack pointer before
returning to c code. this could result in the signal handler
misidentifying c code as the pre-syscall part of the asm, and acting
on cancellation at the wrong time, and thus resource leak race
conditions.
also, now __cancel (in the c code) is responsible for clearing the
saved sp in the already-cancelled branch. this means we have to use
call rather than jmp to ensure the stack pointer in the c will never
match what the asm saved.
the goal is to be able to use pthread_setcancelstate internally in
the implementation, whenever a function might want to use functions
which are cancellation points but avoid becoming a cancellation point
itself. i could have just used a separate internal function for
temporarily inhibiting cancellation, but the solution in this commit
is better because (1) it's one less implementation-specific detail in
functions that need to use it, and (2) application code can also get
the same benefit.
previously, pthread_setcancelstate dependend on pthread_self, which
would pull in unwanted thread setup overhead for non-threaded
programs. now, it temporarily stores the state in the global libc
struct if threads have not been initialized, and later moves it if
needed. this way we can instead use __pthread_self, which has no
dependencies and assumes that the thread register is already valid.
this patch improves the correctness, simplicity, and size of
cancellation-related code. modulo any small errors, it should now be
completely conformant, safe, and resource-leak free.
the notion of entering and exiting cancellation-point context has been
completely eliminated and replaced with alternative syscall assembly
code for cancellable syscalls. the assembly is responsible for setting
up execution context information (stack pointer and address of the
syscall instruction) which the cancellation signal handler can use to
determine whether the interrupted code was in a cancellable state.
these changes eliminate race conditions in the previous generation of
cancellation handling code (whereby a cancellation request received
just prior to the syscall would not be processed, leaving the syscall
to block, potentially indefinitely), and remedy an issue where
non-cancellable syscalls made from signal handlers became cancellable
if the signal handler interrupted a cancellation point.
x86_64 asm is untested and may need a second try to get it right.
setting errno here is completely valid, but some programs, notably
busybox printf, assume that errno will not be set during output and
treat this as an error condition. in any case, skipping it slightly
reduces code size and saves time.
use a separate signal from SIGCANCEL for SIGEV_THREAD timers
otherwise we cannot support an application's desire to use
asynchronous cancellation within the callback function. this change
also slightly debloats pthread_create.c.
we take advantage of the fact that unless self->cancelpt is 1,
cancellation cannot happen. so just increment it by 2 to temporarily
block cancellation. this drops pthread_create.o well under 1k.
with datagram sockets, depending on fprintf not to flush the output
early was very fragile; the new version simply uses a small fixed-size
buffer. it could be updated to dynamic-allocate large buffers if
needed, but i can't envision any admin being happy about finding
64kb-long lines in their syslog...
it should be noted that flock does not mix well with standard fcntl
locking, but nonetheless some applications will attempt to use flock
instead of fcntl if both exist. options to configure or small patches
may be needed. debian maintainers have plenty of experience with this
unfortunate situation...
after fork, we have a new process and the pid is equal to the tid of
the new main thread. there is no need to make two separate syscalls to
obtain the same number.
we can do this without violating the namespace now that they are
macros/inline functions rather than extern functions. the motivation
is that gcc was generating giant, slow, horrible code for the old
functions, and now generates a single byte-swapping instruction.