granicus.if.org Git

Create a new thread during recursive taskq dispatch if necessary

When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472

Restructure uio to accommodate bio_vec

Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".

Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes #468

Linux 4.2 compat: vfs_rename()

Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error.  Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface.  As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.

  vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653

Include other sources of freeable memory in the freemem calculation

Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes #471

Remove needfree, desfree, lotsfree #defines

This patch reverts 77ab5dd. This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places. Those places have subsequently been updated to use
the Linux equivalent Linux functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637

Invert minclsyspri and maxclsyspri

On Linux the meaning of a processes priority is inverted with respect
to illumos.  High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

Note this change also reverts some of the priorities changes in prior
commit 62aa81a.  The rational is as follows:

spl_kmem_cache    - High priority may result in blocked memory allocs
spl_system_taskq  - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607

Remove skc_ref from alloc/free paths

As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy. They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.

Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463

Add defclsyspri macro

Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority.  Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority.  This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.

All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri.  The vast majority of callers were
part of the test suite which won't have an external impact.  The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri.  This makes it more likely the process will be scheduled
which may help performance.

To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added.  When disabled (0) all newly created kernel
threads will use the default kernel thread priority.  When enabled (1)
the specified taskq priority will be used.  By default this value is
enabled (1).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Default to --disable-debug-kmem

The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:

Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,

Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463

Support parallel build trees (VPATH builds)

Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082

Add memory compatibility wrappers

The function vmem_qcache_reap() and global variables 'needfree',
'desfree', and 'lotsfree' are all used in the upstream. While
these variables have no meaning under Linux they're being defined
as 0's to avoid needing to make additional changes to the ARC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Set TASKQ_DYNAMIC for kmem and system taskqs

Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system. Additional
threads will be created on demand up to the previous maximum
thread counts. This should have minimal, if any, impact on
performance.

This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.

The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458

Add TASKQ_DYNAMIC feature

Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics.  Initially only a single worker thread will be created
to service tasks dispatched to the queue.  As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'.  When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.

Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.

* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned.  Default 4.

* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system.  This is provided
for the purposes of debugging and troubleshooting.  Default 1
(enabled).

This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458

Add IMPLY() and EQUIV() macros

Added for upstream compatibility, they are of the form:

* IMPLY(a, b) - if (a) then (b)
* EQUIV(a, b) - if (a) then (b) *AND* if (b) then (a)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Rename cv_wait_interruptible() to cv_wait_sig()

Commit f752b46e added the cv_wait_interruptible() function to allow
condition variables to be woken by signals. This function and its
timed wait counterpart should have been named cv_wait_sig() to match
the illumos interface which provides the same functionality.

This patch renames the symbol but leaves a #define compatibility
wrapper in place until the ZFS code can be moved to the correct
name.

This patch also makes a small number of cosmetic changes to make
the condvar source and header cstyle clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #456

Retire rwsem_is_locked() compat

Stock Linux 2.6.32 and earlier kernels contained a broken version of
rwsem_is_locked() which could return an incorrect value. Because of
this compatibility code was added to detect the broken implementation
and replace it with our own if needed.

The fix for this issue was merged in to the mainline Linux kernel as
of 2.6.33 and the major enterprise distributions based on 2.6.32 have
all backported the fix. Therefore there is no longer a need to carry
this code and it can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #454

Make taskq_wait() block until the queue is empty

Under Illumos taskq_wait() returns when there are no more tasks
in the queue.  This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete.  New tasks
added whilst taskq_wait() is running will be ignored.

This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.

The previous behavior remains available through the
taskq_wait_outstanding() interface.  Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #455

Add boot_ncpus macro

For compatibility define boot_ncpus as num_online_cpus().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix cstyle issues in spl-tsd.c

This patch only addresses the issues identified by the style checker
in spl-tsd.c. It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make tsd_set(key, NULL) remove the tsd entry for current thread

To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd
entry for the current thread. This is alright since tsd_get() returns NULL
when the entry doesn't exist.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #443

Implement areleasef()

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #449

vn_getf/vn_releasef should not accept negative file descriptors

C type coercion rules require that negative numbers be converted into
positive numbers via wraparound such that a negative -1 becomes a
positive 1. This causes vn_getf to return a file handle when it should
return NULL whenever a positive file descriptor existed with the same
value. We should check for a negative file descriptor and return NULL
instead.

This was caught by ClusterHQ's unit testing.

Reference:
http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #450

Tag spl-0.6.4

META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Clear PF_FSTRANS over vfs_sync()

When layered on XFS the following warning will be emitted under CentOS7
when entering vfs_fsync() with PF_FSTRANS already set. This is not an
issue for other stock Linux file systems and the warning was removed
for newer kernels. However, to avoid triggering this error PF_FSTRANS
is cleared and then reset in vn_fsync().

WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0

Call Trace:
[<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs]
[<ffffffff8114b833>] __writepage+0x13/0x50
[<ffffffff8114c341>] write_cache_pages+0x251/0x4d0
[<ffffffff8114c60d>] generic_writepages+0x4d/0x80
[<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8114d68e>] do_writepages+0x1e/0x40
[<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80
[<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs]
[<ffffffff811df54b>] vfs_fsync+0x2b/0x40
[<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl]
[<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs]
[<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs]
[<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs]
[<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs]
[<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs]
[<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs]
[<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs]
[<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0
[<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0
[<ffffffff815f3219>] system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Don't allow shrinking a PF_FSTRANS context

Avoid deadlocks when entering the shrinker from a PF_FSTRANS context.

This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS. Its
use has been deprecated within ZFS as it was an ineffective mechanism
to eliminate deadlocks. Among other things, it introduced the need for
strict ordering of mutex locking and unlocking in order that the
PF_FSTRANS flag wouldn't set incorrectly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #446

Add crgetzoneid() stub

Illumos 3897 introduces a dependency on crgetzoneid(). Stub it out until
such time as zones are implemented.

References:
https://www.illumos.org/issues/3897
https://github.com/illumos/illumos-gate/commit/fb7001f

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #444

Add RHEL style kmod packages

Provide a Redhat specific spl-kmod.spec file which uses the old style
kmods (not kmods2) packaging. By using the provided kmodtool script
packages can be built which support weak modules. This allows for the
kernel to be updated without having to rebuild the SPL kernel modules.

Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:

$ ./configure --with-spec=redhat
$ make rpms

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove rpm/fedora directory

Originally it was thought that custom spec files might be required
for Fedora. Happily that has turns out not to be the case. Since
this directory just contains symlinks to the generic spec files it
can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix warning about AM_INIT_AUTOMAKE arguments

As of automake 1.14.2, currently shipped with Ubuntu 14.04, automake
warns about AM_INIT_AUTOMAKE having more than one argument:

configure.ac:41: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated. For more info, see:
configure.ac:41: http://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation

This commit fixes the warnings by following above link's advice, so
AM_INIT gets called with the package's name and version. As both are
defined in the META file we're parsing it with `grep`, `cut` and `tr`.

NOTE: autoconf < 1.14 not supporting m4_esyscmd_s so m4_esyscmd was
used and modified `tr` to truncate newlines, too.

Signed-off-by: Hajo M<C3><B6>ller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #438

Set HAVE_FS_STRUCT_SPINLOCK correctly when CONFIG_FRAME_WARN==1024

If kernel lock debugging is enabled, the fs_struct structure exceeds the
typical 1024 byte limit of CONFIG_FRAME_WARN and isn't enabled when it
otherwise should be.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #440

Add mutex_enter_nested() which maps to mutex_lock_nested()

Also add support for the "name" parameter in mutex_init(). The name
allows for better diagnostics, namely in /proc/lock_stats when
lock debugging is enabled. Nested mutexes are necessary to support
CONFIG_PROVE_LOCKING. ZoL can use mutex_enter_nested()'s "class" argument
to to convey the locking hierarchy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #439

Reduce splat_taskq_test2_impl() stack frame size

Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl().  To address
this the tq_args have been moved to the heap.

  cc1: warnings being treated as errors
  spl-0.6.3/module/splat/splat-taskq.c:358:
  error: the frame size of 1040 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Add MUTEX_FSTRANS mutex type

There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Retire MUTEX_OWNER checks

To minimize the size of a kmutex_t a MUTEX_OWNER check was added.
It allowed the kmutex_t wrapper to leverage the mutex owner which was
already stored in the mutex for certain kernel configurations.

The upside to this was that it reduced the size of the kmutex_t wrapper
structure by the size of a task_struct pointer (4/8 bytes).  The
downside was that two mutex implementations needed to be maintained.
Depending on your exact kernel configuration the correct one would
be selected.

Over the years this solution worked but it could be fragile since it
depending heavily on assumed kernel mutex implementation details.  For
example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to
be added when the kernel changed how the owner was stored.  It also
made the code more complicated than it needed to be.

Therefore, in the name of simplicity and portability this optimization
is being retired.  It will slightly increase the memory requirements
for a kmutex_t but only very slightly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Fix cstyle issue in mutex.h

This patch only addresses the issues identified by the style checker
in mutex.h. It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Retire spl_module_init()/spl_module_fini()

In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985

Fix spl_hostid module parameter

Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #427

Optimize vmem_alloc() retry path

For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max.  This
allows for more concurrency in the system and less contention of
the virtual address space.  Generally, this is a good thing.

However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again.  It will likely fail
in exactly the same way.  A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #428

Fix GFP_KERNEL allocations flags

The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags. This ensures that they can be safely called in any
context and the correct flags will be used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426

Merge branch 'kmem-rework'

The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms. This simplifies the process of porting changes to
Linux from other platforms. This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.

The key reason this is now possible is due to the addition of
Linux features such as the thread-specific PF_FSTRANS bit which
was introduced for XFS.

This patch stack also performs some refactoring and cleanup
designed to make the code more maintainable and understandable.
Finally, in the context of making and testing these changes
several bugs were identified and resolved resulting in a
more robust implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #414

Use __get_free_pages() for emergency objects

The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored. This is due to
kmalloc() being layered on the generic Linux slab caches. It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix kmem cache deadlock logic

The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation.  This is
done because large slabs must be allocated using vmalloc().  It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored.  This can result in a deadlock.

Therefore, a deadlock detection strategy was implemented to deal with
this case.  When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab.  Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.

This logic was correct but an optimization in the code allowed for a
deadlock.  Because only slabs backed by vmalloc() can deadlock in the
way described above.  An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches.  This had the
advantage of making it easy to distinguish these objects when they
were freed.

But this isn't strictly safe.  If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either.  This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.

The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches.  Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.

For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable.  This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked.  However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Refine slab cache sizing

This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reduce kmem cache deadlock threshold

Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10. The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly. By
reducing it we allow the system to resolve the issue more quickly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update spl-module-parameters(5) man page

The spl-module-parameters(5) was not kept up to date. Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make slab reclaim more aggressive

Many people have noticed that the kmem cache implementation is slow
to release its memory. This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty. Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enforce architecture-specific barriers around clear_bit()

The comment above the Linux 3.16 kernel's clear_bit() states:

/**
* clear_bit - Clears a bit in memory
* @nr: Bit to clear
* @addr: Address to start counting from
*
* clear_bit() is atomic and may not be reordered. However, it does
* not contain a memory barrier, so if it is used for locking purposes,
* you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
* in order to ensure changes are visible on other processors.
*/

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add hooks for disabling direct reclaim

The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Refactor generic memory allocation interfaces

This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to the newly-added spl_kmem_alloc_max
   parameter.  This ensures that small allocations will not contend
   on a single global lock, large allocations can still be handled,
   and potentially limited virtual address space will not be squandered.
   This behavior is entirely different than under Illumos due to
   different memory management strategies employed by the respective
   kernels.  However, this functionally provides the semantics required.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   use KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers but there
are a few caveats:

1. Because the headers were refactored and extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix kmem cstyle issues

Address all cstyle issues in the kmem, vmem, and kmem_cache source
and headers. This will done to make it easier to review subsequent
changes which will rework the kmem/vmem implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Refactor existing code

This change introduces no functional changes to the memory management
interfaces. It only restructures the existing codes by separating the
kmem, vmem, and kmem cache implementations in the separate source and
header files.

Splitting this functionality in to separate files required the addition
of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions.

Additionally, several minor changes to the #include's were required to
accommodate the removal of extraneous header from kmem.h.

But again, while large this patch introduces no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Revert "Add PF_NOFS debugging flag"

This reverts commit eb0f407a2b9089113ef6f2402ebd887511315b43 in
preperation for updating the kmem/vmem infrastructure to use the
PF_FSTRANS flag.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use current_kernel_time() in the time compatibility wrappers

Since the Linux kernel's utimens family of functions uses
current_kernel_time(), we need to do the same in the context of ZFS
or else there can be discrepencies in timestamps (they go backward)
if userland code does:

fd = creat(FNAME, 0600);
(void) futimens(fd, NULL);

The getnstimeofday() function generally returns a slightly lower time
value.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3006

Fix debug object on stack warning

When running the SPLAT tests on a kernel with CONFIG_DEBUG_OBJECTS=y
enabled the following warning is generated.

  ODEBUG: object is on stack, but not annotated
  WARNING: at lib/debugobjects.c:300 __debug_object_init+0x221/0x480()

This is caused by the test cases placing a debug object on the stack
rather than the heap.  This isn't harmful since they are small objects
but to make CONFIG_DEBUG_OBJECTS=y happy the objects have been relocated
to the heap.  This impacted taskq tests 1, 3, and 7.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #424

mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the zfsonlinux/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

zfsonlinux/zfs#401
zfsonlinux/zfs#2523
zfsonlinux/zfs#2679
zfsonlinux/zfs#2684
zfsonlinux/zfs#2704
zfsonlinux/zfs#2708
zfsonlinux/zfs#2517
zfsonlinux/zfs#2827
zfsonlinux/zfs#2850
zfsonlinux/zfs#2891
zfsonlinux/zfs#2897
zfsonlinux/zfs#2247
zfsonlinux/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421

Remove compat includes from sys/types.h

Don't include the compatibility code in linux/*_compat.h in the public
header sys/types.h. This causes problems when an external code base
includes the ZFS headers and has its own conflicting compatibility code.
Lustre, in particular, defined SHRINK_STOP for compatibility with
pre-3.12 kernels in a way that conflicted with the SPL's definition.
Because Lustre ZFS OSD includes ZFS headers it fails to build due to a
'"SHRINK_STOP" redefined' compiler warning. To avoid such conflicts
only include the compat headers from .c files or private headers.

Also, for consistency, include sys/*.h before linux/*.h then sort by
header name.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #411

Retire legacy debugging infrastructure

When the SPL was originally written Linux tracepoints were still
in their infancy.  Therefore, an entire debugging subsystem was
added to facilite tracing which served us well for many years.

Now that Linux tracepoints have matured they provide all the
functionality of the previous tracing subsystem.  Rather than
maintain parallel functionality it makes sense to fully adopt
tracepoints.  Therefore, this patch retires the legacy debugging
infrastructure.

See zfsonlinux/zfs@bc9f413 for the tracepoint changes.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #408

Lower minimum objects/slab threshold

As long as we can fit a minimum of one object/slab there's no reason
to prevent the creation of the cache. This effectively pushes the
maximum object size up to 32MB. The splat cache tests were extended
accordingly to verify this functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add config/compile to config/.gitignore

This file may be added by automake and therefore should be added
to config/.gitignore. For the full list of possible auxiliary
programs see the full automake documentation.

http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs

Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix modules installation directory

When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.

Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2822

Kernel header installation should respect --prefix

This is the upstream component of work that enables preliminary support
for building Gentoo's ZFS packaging on other Linux systems via Gentoo
Prefix.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #384

kmem_cache: Call constructor/destructor on each alloc/free

This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #369

Linux 3.12 compat: shrinker semantics

The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects. It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics
and updates the splat shrinker test case appropriately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #403

Merge branch 'cleanup'

Over the years the SPL code bases has accumulated compatibly code
to allow it to build against a wide range of Linux kernels. In
general this is desirable because it makes the code flexible.
However, once support for these old kernels is no longer needed
and is no longer being actively tested it should be removed. This
helps keep the code simple and understandable.

The spl-0.6.x releases have supported kernels all the way back to
2.6.26. This patch stack moves that cut off up to 2.6.32 and newer
kernels. This ensures we still support all the major enterprise
distributions which are largely locked in to 2.6.32 based kernels.
And at the same time we can shed a large amount of compatibility
code which simplifies maintenance and new development.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #395

Remove vfs_fsync() wrapper

The vfs_fsync() function has been available since Linux 2.6.29.
There is no longer a need to maintain this compatibility code.
However, the HAVE_2ARGS_VFS_FSYNC check was left in place
since that change occured after 2.6.32.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove kern_path() wrapper

The kern_path() function has been available since Linux 2.6.28.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove kvasprintf() wrapper

The kvasprintf() function has been available since Linux 2.6.22.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove proc_handler() wrapper

As of Linux 2.6.32 the proc handlers where updated to expect only
five arguments. Therefore there is no longer a need to maintain
this compatibility code and this infrastructure can be simplified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update put_task_struct() comments

Update the comments to correctly reflect when this interface was
added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove credential configure checks.

The groups_search() function was never exported by a mainline kernel
therefore we drop this compatibility code and always provide our own
implementation.

Additionally, the cred_t structure has been available since 2.6.29
so there is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add vfs_unlink() and vfs_rename() comments

Just for consistency with the other autoconf checks a small comment
block was added before these checks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove set_fs_pwd() configure check

This function has never been exported by any mainline and was only
briefly available under RHEL5. Therefore this check is being removed
and the code update to always use the wrapper function.

The next step will be to eliminate all this code. If ZFS were updated
not to assume that it's pwd was / there would be no need for this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove user_path_dir() wrapper

The user_path_dir() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove kallsyms_lookup_name() wrapper

After the removable of get_vmalloc_info(), the unused global memory
variables, and the optional dcache/icache shrinkers there is no
longer a need for the kallsyms compatibility code. This allows
us to eliminate another brittle area of the code by removing the
kernel upcall this functionality depended on for older kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove shrink_{i,d}node_cache() wrappers

This is optional functionality which may or may not be useful to
ZFS when using older kernels. It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove global memory variables

Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.

In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS. As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.

Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve. This greatly simplifies the memory management
code and eliminates a common area of confusion.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove get_vmalloc_info() wrapper

The get_vmalloc_info() function was used to back the vmem_size()
function. This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.

However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove on_each_cpu() wrapper

The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove mutex_lock_nested() wrapper

The mutex_lock_nested() function has been available since Linux 2.6.18.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove i_mutex() configure check

The inode structure has used i_mutex as its internal locking
primitive since 2.6.16. The compatibility code to check for
the previous semaphore primitive has been removed. However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove kmalloc_node() compatibility code

The kmalloc_node() function has been available since Linux 2.6.12.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove linux/uaccess.h header check

The uaccess header has been available in the same location since
Linux 2.6.18. There is no longer a need to maintain this
compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove uintptr_t typedef

The uintptr_t typedef has been available since Linux 2.6.24.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove atomic64_xchg() wrappers

The atomic64_xchg() and atomic64_cmpxchg() functions have been
available since Linux 2.6.24. There is no longer a need to
maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Simplify the time compatibility wrappers

Many of the time functions had grown overly complex in order to
handle kernel compatibility issues. However, as of Linux 2.6.26
all the required functionality is available. This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Map highbit64() to fls64()

The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64(). This allows us
to provide an optimized implementation and simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove CTL_UNNUMBERED sysctl interface

Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19. There is no longer any reason to maintain this
compatibility code. There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove register_sysctl() compatibility code

The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove utsname() wrapper

There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly. This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove sysctl_vfs_cache_pressure assumption

The generic SPL cache shrinkers make the assumption that the
caches only contain VFS cache data and therefore should be scaled
based on vfs_cache_pressure. This is not strictly true and it
should not be assumed.

Removing this tuning should not have any impact on the stock
behavior because vfs_cache_pressure=100 by default. This means
that no scaling will take place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove adaptive mutex implementation

Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove patches directory

There is no longer a need to carry these stale patches in the
SPL source tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update code to use misc_register()/misc_deregister()

When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions.  Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.

As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions.  This interface
for registering character devices has remained stable, is simple,
and provides everything we need.

Therefore the code has been reworked to use this interface.  The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update SPLAT to use kmutex_t for portability

For consistency throughout the code update the SPLAT infrastructure
to use the wrapped mutex interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make license compatibility checks consistent

Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Install header during post-build rather than post-install.

New versions of dkms clean up the build directory after installing.

It appears that this was always intended, but had rm -rf "/path/to/build/*"
(note the quotes), which prevented it from working.

Also, the build step is already installing stuff into the directory where
these files go, so installing our stuff there as part of build rather than
install makes sense.

Signed-off-by: Tom Prince <tom.prince@clusterhq.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #399

Fix bug in SPLAT taskq:front

While running SPLAT on a kernel with CONFIG_DEBUG_ATOMIC_SLEEP
enabled the taskq:front was flagged as a test which might sleep
which in an unsafe context. Specifically, the splat_vprint()
function which internally takes a mutex was being called under
a spin lock. Moving the log function outside the spin lock
cleanly solves this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.16 compat: smp_mb__after_clear_bit()

The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic.  Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().

This has the advantage of being a stable functionally equivalent
interface.  On many architectures smp_mb__after_clear_bit() expands
to smp_wmb().  Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #386

Avoid PAGESIZE redefinition

Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.

Signed-off-by: stf <s@ctrlc.hu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #382

Cleanup vn_rename() and vn_remove()

zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #370