granicus.if.org Git

Add rw_tryupgrade()

This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms.  It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held.  On other platforms the lock is never released during
the upgrade process.  This is necessary under Linux because the kernel
does not provide an upgrade function.

There are currently no callers in the ZFS code where this change in
behavior is a problem.  In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534

Remove RPM package restriction

ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64
architectures. Given this the artificial architecture restriction in
the packaging has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Changes to support zfs encryption

Unused modlinkage struct removed and ntohll functions added.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #533

random_get_pseudo_bytes() need not provide cryptographic strength entropy

Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2. Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`). In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372

Allow kicking a taskq to spawn more threads

This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529

Ensure spl/ only occurs once in core-y

Update copy-builtin so it may be run multiple times against
the kernel source tree. This change makes sed more discriminating
to ensure spl/ only occurs once in core-y.

Signed-off-by: Chip Parker <aparker@enthought.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #526

Remove RLIM64_INFINITY assert in vn_rdwr()

Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY. The vn_rdwr() function
needs to be updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513

kobj_read_file: Return -1 on vn_rdwr() error

I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.

Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513

Create spl-kmod-debuginfo rpm with redhat spec file

Correct the redhat specfile so that working debuginfo rpms are created
for the kernel modules. The generic specfile already does the right
thing.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#4224

Use tsd to store tq for taskq_member

To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #500
Closes #504
Closes #505

Linux 4.5 compat: pfn_t typedef

The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers. This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #524

Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark

In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #523

Don't hold mutex until release cv in cv_wait

If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106

Add spl_kmem_cache_kmem_threads man page entry

The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation. Add it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #512

_ILP32 is always defined on SPARC

Signed-off-by: Alex McWhirter <alexmcwhirter@triadic.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520

Fix do_div() types in condvar:timeout

The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516

Use spl_fstrans_mark instead of memalloc_noio_save

For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #515
Issue zfsonlinux/zfs#4111

Provide kstat for taskqs

This patch provides 2 new kstats to display task queues:

/proc/spl/taskqs-all - Display all task queues
/proc/spl/taskqs - Display only "active" task queues

A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.

If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.

If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.

If the task queue has any waiters, displays each waiting task's pid.

Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491

Skip GPL-only symbols test when cross-compiling

Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075

Revert "Skip GPL-only symbols test when cross-compiling"

This reverts commit 61bbbd9a775a5517af513e5014edbdd73a32f7e4 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507

Fix cstyle issues in spl-taskq.c and taskq.h

This patch only addresses the issues identified by the style checker.
It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Don't use tq->tq_lock_flags

The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #506

Subclass tq_lock to eliminate a lockdep warning

When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking.  This is
a false positive.

One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch

The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads.  The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock.  Without subclassing, lockdep believes these
could potentially be the same lock and complains.  A similar case occurs
when taskq_dispatch() then calls task_alloc().

This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:

subclass              taskq
TQ_LOCK_DYNAMIC       dynamic_taskq
TQ_LOCK_GENERAL       any other

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

Fix lockdep warning in spl_inode_{lock,unlock}

spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep.  The warning is a false positive.

The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file.  The inode
lock is taken on both the parent inode and the file inode.

VFS provides an enum to subclass the lock.  This patch changes the
spin_lock call to _nested version and uses the provided enum.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

Add new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP

When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.

When lockdep detects these conditions, it disables further lock analysis
for all locks. This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.

This commit creates a new type of mutex, MUTEX_NOLOCKDEP. This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().

This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.

MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

Skip GPL-only symbols test when cross-compiling

This test depends on being able to execute the resulting binary
which will be impossible when cross-compiling. Instead make a
worst case assumption which allows the build to continue as
recommended by the autoconf manual.

https://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/Runtime.html

Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075

Fix build issue on some configured kernels

The SPL fails to build with some "Configured" kernels (ex. openSUSE
xen Kernel) this change should make same binaries with C compiler
optimization.

Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #510

Either _ILP32 or _LP64 must be defined

For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048

Revert "Make taskq_member() use ->journal_info"

This reverts commit a430c11f0b1ef16ca5edf3059e4082709277376c. Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500

Make taskq_member() use ->journal_info

The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.

This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #500

Fix race between getf() and areleasef()

If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.

We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #492

Prevent rm modules.* when make install

This was originally in e80cd06b8e0428f3ca2c62e4cb0e4ec54fda1d5c, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #501

Add a script to display SPL slab cache statistics

Useful when looking for the info on ZFS/SPL related memory consumption.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #460

Additional dkio support for TRIM/Discard

Replace DKIOCTRIM with DKIOCFREE and add additional support required
for Nextenta's TRIM support.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469

spl-kmem-cache: include linux/prefetch.h for prefetchw()

This is needed for architectures that do not have a builtin prefetchw()

Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #502

Fix --enable-linux-builtin

Adding VPATH support, commit 37d7cd9, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.

The code which adds the `spl/` directory to the top level
Makefile was failing due to the addition of the `certs/` path.
The search pattern has been adjusted to be more tolerant.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #481
Issue #498

Limit maximum object size in kmem tests

Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests. Large values can result in out of memory errors
for systems with less the 512M of memory. Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove superfluous `newline` character

Remove superfluous `newline` character from spl_kmem_cache_magazine_size
module parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #499

sysmacros: Make P2ROUNDUP not trigger int overflow

The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.

Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.

Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.

Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes #488

Fix taskq dynamic spawning

Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.

For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy

This will easily deadlock before this patch.

Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #496

Don't call kmem_cache_shrink from shrinker

Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes #487

Fix CPU hotplug

Allocate a kmem cache magazine for every possible CPU which might
be added to the system. This ensures that when one of these CPUs
is enabled it can be safely used immediately.

For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint. In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482

Use tab indent in rwlock.h

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473

rwsem use kernel provided owner when possible

If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner
field, so we don't need our own.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473

Don't take spin lock on rwlock owner

The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.

Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473

Fix spl-dkms uninstall/update

Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 4cdcdbf which verifies the existence of the
spl.release build product would never be true.

This patch resolves the issue by updating the conditional to check
in the explicitly installed spl_config.h file for the version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #478

Fix PAX Patch/Grsec SLAB_USERCOPY panic

Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.

Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.

Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796

Tag spl-0.6.5

META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Disable direct reclaim in taskq worker threads on Linux 3.9+

Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes #474

Linux 4.2 compat: misc_deregister()

The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Create a new thread during recursive taskq dispatch if necessary

When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472

Revert "Create a new thread during recursive taskq dispatch if necessary"

This reverts commit 076821e due to a locking issue uncovered in
subsequent testing. An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock. The patch will need to be reworked.

VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472

Create a new thread during recursive taskq dispatch if necessary

When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472

Restructure uio to accommodate bio_vec

Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".

Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes #468

Linux 4.2 compat: vfs_rename()

Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error.  Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface.  As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.

  vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653

Include other sources of freeable memory in the freemem calculation

Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes #471

Remove needfree, desfree, lotsfree #defines

This patch reverts 77ab5dd. This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places. Those places have subsequently been updated to use
the Linux equivalent Linux functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637

Invert minclsyspri and maxclsyspri

On Linux the meaning of a processes priority is inverted with respect
to illumos.  High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

Note this change also reverts some of the priorities changes in prior
commit 62aa81a.  The rational is as follows:

spl_kmem_cache    - High priority may result in blocked memory allocs
spl_system_taskq  - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607

Remove skc_ref from alloc/free paths

As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy. They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.

Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463

Add defclsyspri macro

Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority.  Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority.  This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.

All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri.  The vast majority of callers were
part of the test suite which won't have an external impact.  The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri.  This makes it more likely the process will be scheduled
which may help performance.

To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added.  When disabled (0) all newly created kernel
threads will use the default kernel thread priority.  When enabled (1)
the specified taskq priority will be used.  By default this value is
enabled (1).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Default to --disable-debug-kmem

The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:

Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,

Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463

Support parallel build trees (VPATH builds)

Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082

Add memory compatibility wrappers

The function vmem_qcache_reap() and global variables 'needfree',
'desfree', and 'lotsfree' are all used in the upstream. While
these variables have no meaning under Linux they're being defined
as 0's to avoid needing to make additional changes to the ARC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Set TASKQ_DYNAMIC for kmem and system taskqs

Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system. Additional
threads will be created on demand up to the previous maximum
thread counts. This should have minimal, if any, impact on
performance.

This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.

The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458

Add TASKQ_DYNAMIC feature

Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics.  Initially only a single worker thread will be created
to service tasks dispatched to the queue.  As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'.  When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.

Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.

* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned.  Default 4.

* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system.  This is provided
for the purposes of debugging and troubleshooting.  Default 1
(enabled).

This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458

Add IMPLY() and EQUIV() macros

Added for upstream compatibility, they are of the form:

* IMPLY(a, b) - if (a) then (b)
* EQUIV(a, b) - if (a) then (b) *AND* if (b) then (a)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Rename cv_wait_interruptible() to cv_wait_sig()

Commit f752b46e added the cv_wait_interruptible() function to allow
condition variables to be woken by signals. This function and its
timed wait counterpart should have been named cv_wait_sig() to match
the illumos interface which provides the same functionality.

This patch renames the symbol but leaves a #define compatibility
wrapper in place until the ZFS code can be moved to the correct
name.

This patch also makes a small number of cosmetic changes to make
the condvar source and header cstyle clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #456

Retire rwsem_is_locked() compat

Stock Linux 2.6.32 and earlier kernels contained a broken version of
rwsem_is_locked() which could return an incorrect value. Because of
this compatibility code was added to detect the broken implementation
and replace it with our own if needed.

The fix for this issue was merged in to the mainline Linux kernel as
of 2.6.33 and the major enterprise distributions based on 2.6.32 have
all backported the fix. Therefore there is no longer a need to carry
this code and it can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #454

Make taskq_wait() block until the queue is empty

Under Illumos taskq_wait() returns when there are no more tasks
in the queue.  This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete.  New tasks
added whilst taskq_wait() is running will be ignored.

This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.

The previous behavior remains available through the
taskq_wait_outstanding() interface.  Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #455

Add boot_ncpus macro

For compatibility define boot_ncpus as num_online_cpus().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix cstyle issues in spl-tsd.c

This patch only addresses the issues identified by the style checker
in spl-tsd.c. It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make tsd_set(key, NULL) remove the tsd entry for current thread

To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd
entry for the current thread. This is alright since tsd_get() returns NULL
when the entry doesn't exist.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #443

Implement areleasef()

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #449

vn_getf/vn_releasef should not accept negative file descriptors

C type coercion rules require that negative numbers be converted into
positive numbers via wraparound such that a negative -1 becomes a
positive 1. This causes vn_getf to return a file handle when it should
return NULL whenever a positive file descriptor existed with the same
value. We should check for a negative file descriptor and return NULL
instead.

This was caught by ClusterHQ's unit testing.

Reference:
http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #450

Tag spl-0.6.4

META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Clear PF_FSTRANS over vfs_sync()

When layered on XFS the following warning will be emitted under CentOS7
when entering vfs_fsync() with PF_FSTRANS already set. This is not an
issue for other stock Linux file systems and the warning was removed
for newer kernels. However, to avoid triggering this error PF_FSTRANS
is cleared and then reset in vn_fsync().

WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0

Call Trace:
[<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs]
[<ffffffff8114b833>] __writepage+0x13/0x50
[<ffffffff8114c341>] write_cache_pages+0x251/0x4d0
[<ffffffff8114c60d>] generic_writepages+0x4d/0x80
[<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8114d68e>] do_writepages+0x1e/0x40
[<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80
[<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs]
[<ffffffff811df54b>] vfs_fsync+0x2b/0x40
[<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl]
[<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs]
[<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs]
[<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs]
[<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs]
[<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs]
[<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs]
[<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs]
[<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0
[<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0
[<ffffffff815f3219>] system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Don't allow shrinking a PF_FSTRANS context

Avoid deadlocks when entering the shrinker from a PF_FSTRANS context.

This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS. Its
use has been deprecated within ZFS as it was an ineffective mechanism
to eliminate deadlocks. Among other things, it introduced the need for
strict ordering of mutex locking and unlocking in order that the
PF_FSTRANS flag wouldn't set incorrectly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #446

Add crgetzoneid() stub

Illumos 3897 introduces a dependency on crgetzoneid(). Stub it out until
such time as zones are implemented.

References:
https://www.illumos.org/issues/3897
https://github.com/illumos/illumos-gate/commit/fb7001f

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #444

Add RHEL style kmod packages

Provide a Redhat specific spl-kmod.spec file which uses the old style
kmods (not kmods2) packaging. By using the provided kmodtool script
packages can be built which support weak modules. This allows for the
kernel to be updated without having to rebuild the SPL kernel modules.

Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:

$ ./configure --with-spec=redhat
$ make rpms

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove rpm/fedora directory

Originally it was thought that custom spec files might be required
for Fedora. Happily that has turns out not to be the case. Since
this directory just contains symlinks to the generic spec files it
can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix warning about AM_INIT_AUTOMAKE arguments

As of automake 1.14.2, currently shipped with Ubuntu 14.04, automake
warns about AM_INIT_AUTOMAKE having more than one argument:

configure.ac:41: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated. For more info, see:
configure.ac:41: http://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation

This commit fixes the warnings by following above link's advice, so
AM_INIT gets called with the package's name and version. As both are
defined in the META file we're parsing it with `grep`, `cut` and `tr`.

NOTE: autoconf < 1.14 not supporting m4_esyscmd_s so m4_esyscmd was
used and modified `tr` to truncate newlines, too.

Signed-off-by: Hajo M<C3><B6>ller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #438

Set HAVE_FS_STRUCT_SPINLOCK correctly when CONFIG_FRAME_WARN==1024

If kernel lock debugging is enabled, the fs_struct structure exceeds the
typical 1024 byte limit of CONFIG_FRAME_WARN and isn't enabled when it
otherwise should be.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #440

Add mutex_enter_nested() which maps to mutex_lock_nested()

Also add support for the "name" parameter in mutex_init(). The name
allows for better diagnostics, namely in /proc/lock_stats when
lock debugging is enabled. Nested mutexes are necessary to support
CONFIG_PROVE_LOCKING. ZoL can use mutex_enter_nested()'s "class" argument
to to convey the locking hierarchy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #439

Reduce splat_taskq_test2_impl() stack frame size

Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl().  To address
this the tq_args have been moved to the heap.

  cc1: warnings being treated as errors
  spl-0.6.3/module/splat/splat-taskq.c:358:
  error: the frame size of 1040 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Add MUTEX_FSTRANS mutex type

There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Retire MUTEX_OWNER checks

To minimize the size of a kmutex_t a MUTEX_OWNER check was added.
It allowed the kmutex_t wrapper to leverage the mutex owner which was
already stored in the mutex for certain kernel configurations.

The upside to this was that it reduced the size of the kmutex_t wrapper
structure by the size of a task_struct pointer (4/8 bytes).  The
downside was that two mutex implementations needed to be maintained.
Depending on your exact kernel configuration the correct one would
be selected.

Over the years this solution worked but it could be fragile since it
depending heavily on assumed kernel mutex implementation details.  For
example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to
be added when the kernel changed how the owner was stored.  It also
made the code more complicated than it needed to be.

Therefore, in the name of simplicity and portability this optimization
is being retired.  It will slightly increase the memory requirements
for a kmutex_t but only very slightly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Fix cstyle issue in mutex.h

This patch only addresses the issues identified by the style checker
in mutex.h. It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435

Retire spl_module_init()/spl_module_fini()

In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985

Fix spl_hostid module parameter

Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #427

Optimize vmem_alloc() retry path

For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max.  This
allows for more concurrency in the system and less contention of
the virtual address space.  Generally, this is a good thing.

However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again.  It will likely fail
in exactly the same way.  A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #428

Fix GFP_KERNEL allocations flags

The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags. This ensures that they can be safely called in any
context and the correct flags will be used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426

Merge branch 'kmem-rework'

The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms. This simplifies the process of porting changes to
Linux from other platforms. This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.

The key reason this is now possible is due to the addition of
Linux features such as the thread-specific PF_FSTRANS bit which
was introduced for XFS.

This patch stack also performs some refactoring and cleanup
designed to make the code more maintainable and understandable.
Finally, in the context of making and testing these changes
several bugs were identified and resolved resulting in a
more robust implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #414

Use __get_free_pages() for emergency objects

The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored. This is due to
kmalloc() being layered on the generic Linux slab caches. It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix kmem cache deadlock logic

The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation.  This is
done because large slabs must be allocated using vmalloc().  It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored.  This can result in a deadlock.

Therefore, a deadlock detection strategy was implemented to deal with
this case.  When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab.  Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.

This logic was correct but an optimization in the code allowed for a
deadlock.  Because only slabs backed by vmalloc() can deadlock in the
way described above.  An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches.  This had the
advantage of making it easy to distinguish these objects when they
were freed.

But this isn't strictly safe.  If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either.  This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.

The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches.  Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.

For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable.  This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked.  However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Refine slab cache sizing

This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reduce kmem cache deadlock threshold

Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10. The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly. By
reducing it we allow the system to resolve the issue more quickly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update spl-module-parameters(5) man page

The spl-module-parameters(5) was not kept up to date. Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make slab reclaim more aggressive

Many people have noticed that the kmem cache implementation is slow
to release its memory. This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty. Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enforce architecture-specific barriers around clear_bit()

The comment above the Linux 3.16 kernel's clear_bit() states:

/**
* clear_bit - Clears a bit in memory
* @nr: Bit to clear
* @addr: Address to start counting from
*
* clear_bit() is atomic and may not be reordered. However, it does
* not contain a memory barrier, so if it is used for locking purposes,
* you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
* in order to ensure changes are visible on other processors.
*/

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add hooks for disabling direct reclaim

The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>