granicus.if.org Git

Fix stack overflow in vn_rdwr() due to memory reclaim

Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Correctly handle rwsem_is_locked() behavior

A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5.  Details can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=526092

The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked().  The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.

This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present.  The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers.  Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.

Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros.  We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.

Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner().  This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.

The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit.  However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.

The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization.  Since it is no longer used, I removed it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Correctly detect atomic64_cmpxchg support

The RHEL5 2.6.18-194.7.1.el5 kernel added atomic64_cmpxchg to
asm-x86_64/atomic.h.  That macro is defined in terms of cmpxchg which
is provided by asm/system.h. However, asm/system.h is not #included by
atomic.h in this kernel nor by the autoconf test for atomic64_cmpxchg, so
the test failed with "implicit declaration of function 'cmpxchg'". This
leads the build system to erroneously conclude that the kernel does not
define atomic64_cmpxchg and enable the built-in definition.  This in
turn produces a '"atomic64_cmpxchg" redefined' build warning which is fatal
when building with --enable-debug.  This commit fixes this by including
asm/system.h in the autoconf test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix taskq code to not drop tasks when TQ_SLEEP is used.

When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.

However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.

One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Strfree() should call kfree() not kmem_free()

Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.

A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.

Add uninstall Makefile targets

Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.

Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.

Add Debian and Slackware style packaging via alien

The long term fix for Debian and Slackware style packaging is
to add native support for building these packages.  Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.

As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages.  The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:

  make rpm: Create .rpm packages
  make deb: Create .deb packages
  make tgz: Create .tgz packages
  make pkg: Create the right package type for your distribution

The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:

  1) Will not have the correct dependency information.
  2) Will not not include the kernel version in the release.
  3) Will not handle all differences between distributions.

But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.

Ensure kmem_alloc() and vmem_alloc() never fail

The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"

Fix two minor compiler warnings

In cmd/splat.c there was a comparison between an __u32 and an int. To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.

In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.

Remove deadcode caused by removal of format1 arg

Commit 55abb0929e4fbe326a9737650a167a1a988ad86b removed the never
used format1 argument of spl_debug_msg(). That in turn resulted
in some deadcode which should be removed since it's now useless.

Fix max_ncpus definition.

It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.

However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.

Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Display DEBUG keyword during module load when --enable-debug is used.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix buggy kmem_{v}asprintf() functions

When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix bcopy() to allow memory area overlap

Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix compilation error due to undefined ACCESS_ONCE macro.

When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Prefix all SPL debug macros with 'S'

To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.

Split <sys/debug.h> header

To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.

Proposed fix for oops on SIGINT in splat atomic:64-bit test.

The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility. If splat terminates before the threads, accesses to that memory
location by the other threads become invalid. Splat synchronizes with
the threads with the call:

wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));

Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:

wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));

Fix -Werror=format-security compiler option

Noticed under Ubuntu kernel builds we should be passing a
format specifier and the string, not just the string.

Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'

The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.

Proposed fix for low memory ZFS deadlocks

Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory.  The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation.  Thus we end
up deadlocking.

A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS.  This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().

The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP.  This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced.  The caller still should be able to reclaim memory from
the various slab caches.  We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages.  This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.

We shall see.

Add __divdi3(), remove __udivdi3() kernel dependency

Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.

Update config.guess to recognize additional distros

The following distros were added: redhat, fedora, debian,
ubuntu, sles, slackware, and gentoo.

Allow config/build to work with autoconf-2.65

As of autoconf-2.65 the AC_LANG_SOURCE source macro no longer
includes the confdef.h results when expanded. To handle this
simply explicitly include confdef.h in conftest.c. This will
cause two copies to of confdef.h to be added to the test for
earlier autoconf versions but this is not harmful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Require gawk the usermode helper fails with awk

For some reason when awk invoked by the usermode helper the command
always fails.  Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk.  Anyway, the simplest
thing to do here is to just make gawk mandatory.  I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.

Add configure check for user_path_dir()

I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change. I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25. This means
builds against 2.6.25-2.6.26 kernels were broken.

To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.

Use $target_cpu instead of `arch`

We should not be using arch for a few reasons. First off it might
not be installed on their system, and secondly they may be trying
to cross-compile.

Check sourcelink is set before passing to readlink

When no source was found in any of the expected paths treat
this as fatal and provide the user with a hint as to what
they should do.

Implementation of a regression test for TQ_FRONT.

Use 3 threads and 8 tasks.  Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues.  Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored.  Verify that the expected completion
order occurs.

The splat_taskq_test5_order() function may be useful in more than
one test.  This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.

The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram.  This commit corrects the error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Initialize the /dev/splatctl device buffer

On open() and initialize the buffer with the SPL version string. The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Implementation of the TQ_FRONT flag.

Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove AC_DEFINE for DEBUG/NDEBUG

Whoops, I momentarilly forgot I had explicitly set these as CC
options so dependent packages which need to include spl_config.h
would not end up having these defined which can result in
accidentally hanging debug enabled at best, or a build failure
at worst.

Only make compiler warnings fatal with --enable-debug

While in theory I like the idea of compiler warnings always being
fatal.  In practice this causes problems when small harmless errors
cause build failures for end users.  To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure.  This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.

Linux-2.6.33 compat, O_DSYNC flag added

Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag. As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.

Linux-2.6.33 compat, .ctl_name removed from struct ctl_table

As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.

Linux-2.6.33 compat, check <generated/utsrelease.h> for UTS_RELEASE

It seems the upstream community moved the definition of UTS_RELEASE
yet again as of linux-2.6.33. Update the build system to check in
all three possible locations where your kernel version may be defined.

$kernelbuild/include/linux/version.h
$kernelbuild/include/linux/utsrelease.h
$kernelbuild/include/generated/utsrelease.h

Add basic README

A simple README with a short summary of the project and a link
directing people to the online documentation.

Treat mutex->owner as volatile

When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus.  This can result in incorrect behavior from
mutex_owned() and mutex_owner().  This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.

Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.

Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit().  When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().

Finally, improve the mutex regression tests.  For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread.  This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.

As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().

Fix subtle race in threads test case

The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited. This basic thread implementation
was correct, this was simply a flaw in the test case.

Accept but ignore TASKQ_DC_BATCH and TQ_FRONT

For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored. This is harmless for
the moment but it does need to be implemented at some point.

Add kmem_vasprintf function

We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().

Revert "Support TQ_FRONT flag used by taskq_dispatch()"

This reverts commit eb12b3782c94113d2d40d2da22265dc4111a672b.

Update warnings in kmem debug code

This fix was long overdue.  Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call.  However,
probably due to lack of time at the moment that informatin never
made it in to the error message.  This patch fixes that and trys
to standardize the kmem debug messages as well.

Add missing header util/sscanf.h

Include kstat.h from kmem.h

It turns out Solaris incidentally includes kstat.h from kmem.h. As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h. To make like easier for everyone I do the same.

Support TQ_FRONT flag used by taskq_dispatch()

Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.

Minor cleanup and Solaris API additions.

Minor formatting cleanups.

API additions:
* {U}INT8_{MIN,MAX}, {U}INT16_{MIN,MAX} macros.
* id_t typedef
* ddi_get_lbolt(), ddi_get_lbolt64() functions.

Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.

This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup().  They are all implemented as a thin layer which just calls
their Linux counterparts.  As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels.  If the kernel does
not provide it then spl-generic implements it.

Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.

Add xuio_* structures and typedefs.

Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.

Stub out additional missing headers

Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)

Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.

Update META to version 0.5.0

Stack overflow on 64-bit modulus operations on 32-bit architectures.

Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow. This turned out to be due to some
sort of 'optimization' by gcc:

uint64_t __umoddi3(uint64_t dividend, uint64_t divisor)
{
return dividend - divisor * (dividend / divisor);
}

This code was supposed to be using __udivdi3 to implement /, but gcc
instead implemented it via __umoddi3 itself.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Minor 32-bit fix cast to hrtime_t before the mutliply.

It's important to cast to hrtime_t before doing the multiply because
the ts.tv_sec type is only 32-bits and we need to promote it to 64-bits.

Refresh autogen.sh products with automake 1.11.1.

Re-Prep for 0.4.9 tag with a few more fixes and updated ChangeLog

Minor spec file cleanup for RHEL6 package dependency.

Simplify rwlock implementation.

Remove RW_COUNT() from the rwlock implementation.  The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock.  While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes.  In
short it hasn't been worth the trouble.

With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions.  As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation.  It remains to be determined if its
worth the effort of adding a custom implementation for each arch.

Use KM_NODEBUG macro in preference to __GFP_NOWARN.

Disable spl_debug_panic_on_bug by default.

While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis. I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console. This way they
have a good chance of actually saving the stack trace and debug log.

Adjust 'large' object sizes in kmem:slab_large test.

64K objects are large for a kmem based slab (2M slabs)
1M objects are large for a vmem cased slab (32M slabs)

Remove kmem_set_warning() interface replace with __GFP_NOWARN flag.

Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN. This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.

This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise. Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.

Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.

Set default debug log patch to /tmp/spl-log.

Using /tmp/ is a preferable default, it can always be overriden
using the module option on a case-by-case basis.

Additionally standardize some log messages based on the same
default log level used by the kernel.

Minor spec file cleanup for srpm case.

Ensure kdevpkg is defined is srpm case before using it to define
the devel_requires macro. Interestingly this is not an issue for
rpm-4.7.1-4 but it is for rpm-4.4.2.3-18.

Prep for 0.4.9 tag, updated META and ChangeLog

Public Release Prep

Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.

Add 3 missing typedefs.

Add processorid_t, pc_t, index_t.

Add console_*printf() functions.

Add support for the missing console_vprintf() and console_printf()
functions.

Use do_posix_clock_monotonic_gettime() as described by comment.

While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.

Add cv_wait_interruptible() function.

This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux.  The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it.  This makes it impossible to woken by a signal
such as SIGTERM.  The cv_wait_interruptible() function was added
to handle this case.  It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up.  This means you do need to be careful and check issig() after
waking.

Dump log from current process when required

When dumping a debug log first check that it is safe to create
a new thread and block waiting for it. If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.

Assume TQ_SLEEP when not explicitly specified.

Handle the FAPPEND option in vn_rdwr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update vn_set_pwd() to allow user|kernal address for filename

During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case. To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.

Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs. I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.

Disable rw_tryupgrade() for newer kernels

For kernels using the CONFIG_RWSEM_GENERIC_SPINLOCK implementation
nothing has changed.  But if your kernel is building with arch
specific rwsems rw_tryupgrade() has been disabled until it can
be implemented correctly.  In particular, the x86 implementation
now leverages atomic primatives for serialization rather than
spinlocks.  So to get this working again it will need to be
implemented as a cmpxchg for x86 and likely something similiar
for other arches we are interested in.  For now it's safest
to simply disable it.

Add support for 'make -s' silent builds

The cleanest way to do this is to set AM_LIBTOOLFLAGS = --silent. However,
AM_LIBTOOLFLAGS is not honored by automake-1.9.6-2.1 which is what I have
been using. To cleanly handle this I am updating to automake-1.11-3 which
is why it looks like there is a lot of churn in the Makefiles.

Allow spl_config.h to be included by dependant packages (updated)

We need dependent packages to be able to include spl_config.h to
build properly.  This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed.  This solution
works as long as the spl_config.h is included before your projects
config.h.  That turns out to be easier said than done.  In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.

To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command.  This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file.  This eliminates
the problem entirely and makes header safe for inclusion.

Also in this change I have removed the few places in the code where
spl_config.h is included.  It is now added to the gcc compile line
to ensure the config results are always available.

Finally, I have also disabled the verbose kernel builds.  If you
want them back you can always build with 'make V=1'.  Since things
are working now they don't need to be on by default.

Reduce max kmem based slab size

Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks. To help prvent this limit
max kmem based slab size to MAX_ORDER-3. Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.

Prep for 0.4.8 tag, updated META and ChangeLog

Ignore unsigned module build products

Along with the addition of signed kernel modules in newer kernel
we have a few new build products we need to ignore. LKLM has the
whole thread for those interested: http://lkml.org/lkml/2007/2/14/164

Fix definitions for the unknown distro/installation

If the distro/installation really is unsupported (i.e. unknown) we should
not make it look like a known distribution (i.e. RHEL) complete with
dependencies on other RPMs and trying to find kenrel source in the RH
standard location.

Additionally add 'k' prefix for kernel requires for consistency.

When no kernel source has been pointed to, first attempt to use
/lib/modules/$(uname -r)/source. This will likely fail when building
under a mock (http://fedoraproject.org/wiki/Projects/Mock) chroot
environment since `uname -r` will report the running kernel which
likely is not the kernel in your chroot. To cleanly handle this
we fallback to using the first kernel in your chroot.

The kernel-devel package which contains all the kernel headers and
a few build products such as Module.symver{s} is all the is required.
Full source is not needed.

Remove Module.markers and Module.symver{s} in clean target

Split 'modules' and 'clean' Makefile targets to allow us to
cleanly remove the Module.* build products with a 'make clean'.

Linux 2.6.32 compat, proc_handler() API change

As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype. It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.

I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions. It's not pretty but API compat changes rarely are.

sun-misc-gitignore

Add .gitignore files.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>

sun-fix-whitespace

Whitespace fixes.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>

sun-fix-panic-str

Fix panic() string, which was being used as a format string, instead of an already-formatted string.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>

Added splat taskq task ordering test case.

This test case verifies the correct behavior of taskq_wait_id().
In particular it ensure the the following two cases are handled
properly:

1) Task ids larger than the waited for task id can run and
complete as long as there is an available worker thread.
2) All task ids lower than the waited one must complete before
unblocking even if the waited task id itself has completed.

Optimize lowest outstanding taskqid calculation in taskq_lowest_id()

In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid.  At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary.  Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry.  I added a large comment to the source
to explain this.  But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.

The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long.  This resolves that problems and frees up
that CPU time for something useful.

Strip __GFP_ZERO from kmalloc it is not available for older kernels.

This is needed to avoid a BUG_ON() on RHEL5.4 kernel 2.6.18-164.6.1,
since __GFP_ZERO is not a valid flag for kmalloc().

Fix kmem:slab_overcommit regression test locking

This regression test could crash in splat_kmem_cache_test_reclaim()
due to a race between the slab relclaim and the normal exiting of
the thread.  Specifically, the kct structure could be free'd by
the thread performing the allocations while the reclaim function
was also working on that's threads kct structure.  The simplest
fix is to extend the kcp->kcp_lock over the reclaim to prevent
the kct from being freed.  A better fix would be to ref count
these structures, but since is just a regression this locking
change is enough.  Surprisingly this was only observed commonly
under RHEL5.4 but all platform could have hit this.

Check for changed gaurd macro in 2.6.28+ for rwsem implementation.

As part of the 2.6.28 cleanup which moved all the linux/include/asm/
headers in to linux/arch, the guard headers for many header files
changed. The i386 rwsem implementation keys off this header to
ensure the internal members of the rwsem structure are interpreted
correctly. This change checks for the new guard macro in addition
to the only one, the implementation of the rwsem has not changed
for i386 so this is safe and correct.

Add skc_flags and full header to /proc/spl/kmem/slab.

Splat vnode tests must return negative error codes.

I must have been in a hurry when I wrote the vnode regression tests
because the error code handling is not correct.  The Solaris vnode
API returns positive errno's, these need to be converted to negative
errno's for Linux before being passed back to user space.  Otherwise
the test hardness with report the failure but errno will not be set
with the correct error code.

Additionally tests 3, 4, 6, and 7 may fail in the test file already
exists.  To avoid false positives a user mode helper has added to
remove the test files in /tmp/ before running the actual test.

Atomic64 compatibility for 32-bit systems without kernel support.

This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing.  This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.

Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations.  Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total.  To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.

The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available.  On 32-bit archs this may
not be true, and actually that's just fine.  In that case the kernel will
will never be able to allocate more the 32-bits worth anyway.  So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.

Correctly handle division on 32-bit RHEL5 systems by returning dividend.

When using x86 specific rwsem correctly intepret rwsem->count.

Only run the kmem overcommit test on 64-bit systems.

Add missing atomic64 compat helpers for 32-bit systems.

The use of these functions was added with the recent atomic work
and not tested on 32-bit systems. Add the missing compat functions:
atomic64_inc, atomic64_dec, atomic64_add_return, atomic64_sub_return,
atomic64_inc_return, atomic64_dec_return.

Type long expected explicitly cast for 32-bit systems.