Rich Felker [Mon, 6 Feb 2012 19:39:09 +0000 (14:39 -0500)]
add support for init/finit (constructors and destructors)
this is mainly in hopes of supporting c++ (not yet possible for other
reasons) but will also help applications/libraries which use (and more
often, abuse) the gcc __attribute__((__constructor__)) feature in "C"
code.
x86_64 and arm versions of the new startup asm are untested and may
have minor problems.
Rich Felker [Fri, 3 Feb 2012 08:16:07 +0000 (03:16 -0500)]
include dummied-out dlopen and dlsym functions for static binaries
these don't work (or do anything at all) but at least make it possible
to static link programs that insist on "having" dynamic loading
support...as long as they don't actually need to use it.
adding real support for dlopen/dlsym with static linking is going to
be significantly more difficult...
Rich Felker [Thu, 2 Feb 2012 05:11:29 +0000 (00:11 -0500)]
make stdio open, read, and write operations cancellation points
it should be noted that only the actual underlying buffer flush and
fill operations are cancellable, not reads from or writes to the
buffer. this behavior is compatible with POSIX, which makes all
cancellation points in stdio optional, and it achieves the goal of
allowing cancellation of a thread that's "stuck" on IO (due to a
non-responsive socket/pipe peer, slow/stuck hardware, etc.) without
imposing any measurable performance cost.
Rich Felker [Tue, 24 Jan 2012 05:22:27 +0000 (00:22 -0500)]
make gcc wrapper support -shared correctly
it was previously attempting to link start files as part of shared
objects. this is definitely wrong and depending on the platform and
linker could range from just adding extraneous junk to introducing
textrels to making linking fail entirely.
Rich Felker [Sun, 22 Jan 2012 22:19:37 +0000 (17:19 -0500)]
fix cancellation failure in single-threaded programs
even a single-threaded program can be cancellable, e.g. if it's called
pthread_cancel(pthread_self()). the correct predicate to check is not
whether multiple threads have been invoked, but whether pthread_self
has been invoked.
Rich Felker [Fri, 20 Jan 2012 16:14:27 +0000 (11:14 -0500)]
fix dynamic linker not to depend on DYNAMIC ptr in 0th entry of GOT
this fixes an issue using gold instead of gnu ld for linking. it also
should eliminate the need of the startup code to even load/pass the
got address to the dynamic linker.
based on patch submitted by sh4rm4 with minor cosmetic changes.
Rich Felker [Thu, 19 Jan 2012 04:28:48 +0000 (23:28 -0500)]
alias basename to glibc name for it, to meet abi goals
note that regardless of the name used, basename is always conformant.
it never takes on the bogus gnu behavior, unlike glibc where basename
is nonconformant when declared manually without including libgen.h.
Rich Felker [Thu, 17 Nov 2011 04:59:28 +0000 (23:59 -0500)]
fix issue with excessive mremap syscalls on realloc
CHUNK_SIZE macro was defined incorrectly and shaving off at least one
significant bit in the size of mmapped chunks, resulting in the test
for oldlen==newlen always failing and incurring a syscall. fortunately
i don't think this issue caused any other observable behavior; the
definition worked correctly for all non-mmapped chunks where its
correctness matters more, since their lengths are always multiples of
the alignment.
Rich Felker [Sat, 15 Oct 2011 04:28:49 +0000 (00:28 -0400)]
don't define wchar_t on c++
it's a keyword in c++ (wtf). i'm not sure this is the cleanest
solution; it might be better to avoid ever defining __NEED_wchar_t on
c++. but in any case, this works for now.
Rich Felker [Sat, 15 Oct 2011 03:31:04 +0000 (23:31 -0400)]
add dummy __cxa_finalize
musl's dynamic linker does not support unloading dsos, so there's
nothing for this function to do. adding the symbol in case anything
depends on its presence..
Rich Felker [Mon, 10 Oct 2011 02:51:03 +0000 (22:51 -0400)]
fix F_GETOWN return value handling
the fcntl syscall can return a negative value when the command is
F_GETOWN, and this is not an error code but an actual value. thus we
must special-case it and avoid calling __syscall_ret to set errno.
this fix is better than the glibc fix (using F_GETOWN_EX) which only
works on newer kernels and is more complex.
Rich Felker [Mon, 3 Oct 2011 04:19:05 +0000 (00:19 -0400)]
simplify robust mutex unlock code path
right now it's questionable whether this change is an improvement or
not, but if we later want to support priority inheritance mutexes, it
will be important to have the code paths unified like this to avoid
major code duplication.
Rich Felker [Mon, 3 Oct 2011 04:09:08 +0000 (00:09 -0400)]
use count=0 instead of 1 for recursive mutex with only one lock reference
this simplifies the code paths slightly, but perhaps what's nicer is
that it makes recursive mutexes fully reentrant, i.e. locking and
unlocking from a signal handler works even if the interrupted code was
in the middle of locking or unlocking.
Rich Felker [Sat, 1 Oct 2011 13:11:35 +0000 (09:11 -0400)]
fix failure-to-wake in rwlock unlock
a reader unlocking the lock need only wake one waiter (necessarily a
writer, but a writer unlocking the lock must wake all waiters
(necessarily readers). if it only wakes one, the remainder can remain
blocked indefinitely, or at least until the first reader unlocks (in
which case the whole lock becomes serialized and behaves as a mutex
rather than a read lock).
there is no need to send a wake when the lock count does not hit zero,
but when it does, all waiters must be woken (since all with the same
sign are eligible to obtain the lock).
eliminate the sequence number field and instead use the counter as the
futex because of the way the lock is held, sequence numbers are
completely useless, and this frees up a field in the barrier structure
to be used as a waiter count for the count futex, which lets us avoid
some syscalls in the best case.
as of now, self-synchronized destruction and unmapping should be fully
safe. before any thread can return from the barrier, all threads in
the barrier have obtained the vm lock, and each holds a shared lock on
the barrier. the barrier memory is not inspected after the shared lock
count reaches 0, nor after the vm lock is released.
it was assuming the result of the condition it was supposed to be
checking for, i.e. that the thread ptr had already been initialized by
pthread_mutex_lock. use the slower call to be safe.
improve/debloat mutex unlock error checking in pthread_cond_wait
we're not required to check this except for error-checking mutexes,
but it doesn't hurt. the new test is actually simpler/lighter, and it
also eliminates the need to later check that pthread_mutex_unlock
succeeds.
when used with error-checking mutexes, pthread_cond_wait is required
to fail with EPERM if the mutex is not locked by the caller.
previously we relied on pthread_mutex_unlock to generate the error,
but this is not valid, since in the case of such invalid usage the
internal state of the cond variable has already been potentially
corrupted (due to access outside the control of the mutex). thus, we
have to check first.
process-shared barrier support, based on discussion with bdonlan
this implementation is rather heavy-weight, but it's the first
solution i've found that's actually correct. all waiters actually wait
twice at the barrier so that they can synchronize exit, and they hold
a "vm lock" that prevents changes to virtual memory mappings (and
blocks pthread_barrier_destroy) until all waiters are finished
inspecting the barrier.
thus, it is safe for any thread to destroy and/or unmap the barrier's
memory as soon as pthread_barrier_wait returns, without further
synchronization.
fix ctype macros to cast argument to (unsigned) first
issue reported by nsz, but it's actually not just pedantic. the
functions can take input of any arithmetic type, including floating
point, and the behavior needs to be as if the conversion implicit in
the function call took place.
another cond var fix: requeue count race condition
lock out new waiters during the broadcast. otherwise the wait count
added to the mutex might be lower than the actual number of waiters
moved, and wakeups may be lost.
this issue could also be solved by temporarily setting the mutex
waiter count higher than any possible real count, then relying on the
kernel to tell us how many waiters were requeued, and updating the
counts afterwards. however the logic is more complex, and i don't
really trust the kernel. the solution here is also nice in that it
replaces some atomic cas loops with simple non-atomic ops under lock.
due to moving waiters from the cond var to the mutex in bcast, these
waiters upon wakeup would steal slots in the count from newer waiters
that had not yet been signaled, preventing the signal function from
taking any action.
to solve the problem, we simply use two separate waiter counts, and so
that the original "total" waiters count is undisturbed by broadcast
and still available for signal.
the changes to syscall_ret are mostly no-ops in the generated code,
just cleanup of type issues and removal of some implementation-defined
behavior. the one exception is the change in the comparison value,
which is fixed so that 0xf...f000 (which in principle could be a valid
return value for mmap, although probably never in reality) is not
treated as an error return.
testing revealed that the old implementation, while correct, was
giving way too many spurious wakeups due to races changing the value
of the condition futex. in a test program with 5 threads receiving
broadcast signals, the number of returns from pthread_cond_wait was
roughly 3 times what it should have been (2 spurious wakeups for every
legitimate wakeup). moreover, the magnitude of this effect seems to
grow with the number of threads.
the old implementation may also have had some nasty race conditions
with reuse of the cond var with a new mutex.
the new implementation is based on incrementing a sequence number with
each signal event. this sequence number has nothing to do with the
number of threads intended to be woken; it's only used to provide a
value for the futex wait to avoid deadlock. in theory there is a
danger of race conditions due to the value wrapping around after 2^32
signals. it would be nice to eliminate that, if there's a way.
testing showed no spurious wakeups (though they are of course
possible) with the new implementation, as well as slightly improved
performance.
using swap has a race condition: the waiters must be added to the
mutex waiter count *before* they are taken off the cond var waiter
count, or wake events can be lost.
new futex-requeue-based pthread_cond_broadcast implementation
this avoids the "stampede effect" where pthread_cond_broadcast would
result in all waiters waking up simultaneously, only to immediately
contend for the mutex and go back to sleep.
previously, a waiter could miss the 1->0 transition of block if
another thread set block to 1 again after the signal function set
block to 0. we now use the caller's thread id as a unique token to
store in block, which no other thread will ever write there. this
ensures that if block still contains the tid, no signal has occurred.
spurious wakeups will of course occur whenever there is a spurious
return from the futex wait and another thread has begun waiting on the
cond var. this should be a rare occurrence except perhaps in the
presence of interrupting signal handlers.
signal/bcast operations have been improved by noting that they need
not avoid inspecting the cond var's memory after changing the futex
value. because the standard allows spurious wakeups, there is no way
for an application to distinguish between a spurious wakeup just
before another thread called signal/bcast, and the deliberate wakeup
resulting from the signal/bcast call. thus the woken thread must
assume that the signalling thread may still be waiting to act on the
cond var, and therefore it cannot destroy/unmap the cond var.
FD_ISSET must return an int. this is the easiest way.
casting to int would not be correct because high bits could be lost.
mapping the high bits down onto low bits would be costlier in the
common case where the result is just used in a conditional. changing
the type of the bit array elements to int would permute the order of
the bit array on 64-bit big endian systems, so that's not an option
either.
actually this is just to avoid gcc being stupid and refusing to inline
the function version, even when the size cost is essentially identical
whether it's inlined or not.