Ubuntu [Fri, 28 Oct 2016 20:56:38 +0000 (20:56 +0000)]
Fix vmem_size()
Add a minimal implementation of vmem_size() which accounts for the
virtual memory usage of the SPL's kmem cache. This functionality
is only useful on 32-bit systems with a small virtual address space.
The following assumptions are made:
1) The major SPL consumer of virtual memory is the kmem cache.
2) Memory allocated with vmem_alloc() is short lived and can be ignored.
3) Allow a 4MB floor as a generous pad given normal consumption.
4) The spl_kmem_cache_sem only contends with cache create/destroy.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Chunwei Chen [Wed, 19 Oct 2016 00:30:41 +0000 (17:30 -0700)]
Linux 4.9 compat: group_info changes
In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via
->blocks to 1d array via ->gid. We change the spl cred functions accordingly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #581
Chunwei Chen [Tue, 18 Oct 2016 22:52:30 +0000 (15:52 -0700)]
Fix crgetgroups out-of-bound and misc cred fix
init_groups has 0 nblocks, therefore calling the current crgetgroups with
init_groups would result in out-of-bound access. We fix this by returning NULL
when nblocks is 0.
Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return
blocks[0].
Also, remove all get_group_info. The cred already holds reference on the
group_info, and cred is not mutable. So there's no reason to hold extra
reference, if we hold cred.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556
tuxoko [Sat, 8 Oct 2016 03:59:46 +0000 (20:59 -0700)]
Fix out-of-bound in per_cpu in spl_random_init
When iterating per_cpu values, we need to use for_each_possible_cpu. While
NR_CPUS indicates the number of CPU supported by the kernel, it might not
initialize all of them if the kernel decides it's not possible to use them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #578
tuxoko [Sat, 8 Oct 2016 03:53:58 +0000 (20:53 -0700)]
Linux 4.8 compat: Fix RW_READ_HELD
Linux 4.8, starting from torvalds/linux@19c5d690e, will set owner to 1 when
read held instead of leave it NULL. So we change the condition to
`rw_owner(rwp) <= 1` in RW_READ_HELD.
Due to changes in the task_struct the following warning is occurs
when initializing the global p0. Since this structure only exists
for it's address to be taken initialize it in a manor which isn't
sensitive to internal changes to the structure.
module/spl/spl-generic.c:58:1: error: missing braces around
initializer [-Werror=missing-braces]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #576
Brian Behlendorf [Wed, 21 Sep 2016 20:44:32 +0000 (13:44 -0700)]
Fix automatically generated release number
When building from the head of a branch a release number is
automatically generated with `git describe` using the last tag
on that branch as the base. For this to work the last tag on the
branch needs to be predictable given the current META file.
This logic was accidentally broken when an -rcX tag was added to
the branch. Update it to search for a VERSION or VERSION-RELEASE
tag.
Reviewed-by: Chris Siebenmann <cks.git01@cs.toronto.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#5105
Closes #572
Brian Behlendorf [Sat, 17 Sep 2016 00:10:36 +0000 (17:10 -0700)]
Increase spl_kmem_alloc_warn limit
In order to support ABD with large blocks the spl_kmem_alloc_warn
limit needs to be increased to 64K.
A 16M block requires that pointers be stored for 4096 4K-pages
on an x86_64 system. Each of these pointers is 8 bytes requiring
an allocation of 8*4096=32,768 bytes. The addition of a small
header to this structure pushes the allocation over the default
32K warning threshold.
In addition, fix a small bug where MAX was used instead of MIN
when setting the default. This ensures a reasonable limit is
still set on systems with page sizes larger then 4K.
Reviewed-by: David Quigley <david.quigley@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #571
GeLiXin [Fri, 19 Aug 2016 06:50:21 +0000 (14:50 +0800)]
Fix: handle NULL case in spl_kmem_free_track()
When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all
the buffers alloced by kmem_alloc() and kmem_zalloc(). If a NULL
pointer which indicates no track info in SPL is passed to
spl_kmem_free_track, we just ignore it.
Tim Chase [Mon, 1 Aug 2016 13:19:19 +0000 (08:19 -0500)]
Fix HAVE_MUTEX_OWNER test for kernels prior to 4.6
Recent 4.X kernels prior to 4.6 require #include of spinlock.h in
order to get the definition of __ARCH_SPIN_LOCK_UNLOCKED which is
used by DEFINE_MUTEX().
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #566
Nikolay Borisov [Fri, 29 Jul 2016 15:48:33 +0000 (18:48 +0300)]
Add handling for kernel 4.7's CONFIG_TRIM_UNUSED_KSYMS
Kernel 4.7 added the option to trim the unused exported symbols. In
my testing this showed to be problematic since the PDE_DATA function
was considered unused and as such was trimmed. This in turn caused the
respective test during spl's configure stage to falsely detect that
PDE_DATA is not defined, which in turn caused build failures later.
Handle this situation by adding detection whether CONFIG_TRIM_UNUSED_KSYMS
is enabled and refuse to build against a kernel which has it enabled
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565
Brian Behlendorf [Tue, 26 Jul 2016 23:37:46 +0000 (23:37 +0000)]
Linux 4.8 compat: rw_semaphore atomic_long_t count
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type. A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro. See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #563
Tom Caputi [Thu, 14 Jul 2016 19:51:24 +0000 (15:51 -0400)]
Added highbit() and lowbit() macros
Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #562
Jinshan Xiong [Thu, 19 May 2016 17:59:40 +0000 (10:59 -0700)]
Improve spl slab cache alloc
The policy is to try to allocate with KM_NOSLEEP, which will lead to
memory allocation with GFP_ATOMIC, and if it fails, it will launch
an taskq to expand slab space.
This way it should be able to get better NUMA memory locality and
reduce the overhead of context switch.
Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #551
Chunwei Chen [Wed, 25 May 2016 23:35:42 +0000 (16:35 -0700)]
Implement a proper rw_tryupgrade
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.
This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes #554
Chunwei Chen [Mon, 23 May 2016 21:12:22 +0000 (14:12 -0700)]
Fix taskq_wait_outstanding re-evaluate tq_next_id
wait_event is a macro, so the current implementation will cause re-
evaluation of tq_next_id every time it wakes up. This would cause
taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Chunwei Chen [Sat, 21 May 2016 01:04:03 +0000 (18:04 -0700)]
Fix race between taskq_destroy and dynamic spawning thread
While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
does not implies the thread being spawned is up and running. This will cause
taskq to be freed before the thread can exit.
We fix this by using tq_nspawn to indicate how many threads are being spawned
before they are inserted to the thread list. And have taskq_destroy to wait
for it to drop to zero.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Closes #550
Chunwei Chen [Fri, 20 May 2016 23:35:52 +0000 (16:35 -0700)]
Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires
In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Tim Chase [Tue, 26 Apr 2016 11:33:52 +0000 (06:33 -0500)]
Clear PF_FSTRANS over spl_filp_fallocate()
The problem described in 2a5d574 also applies to XFS's file or inode
fallocate method. Both paths may trigger writeback and expose this
issue, see the full stack below.
When layered on XFS a warning will be emitted under CentOS7 when entering
either the file or inode fallocate method with PF_FSTRANS already set.
To avoid triggering this error PF_FSTRANS is cleared and then reset
in vn_space().
WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes zfsonlinux/zfs#4529
Tim Chase [Sun, 24 Apr 2016 23:29:03 +0000 (18:29 -0500)]
Use vmem_free() in dfl_free() and add dfl_alloc()
This change was lost, somehow, in e5f9a9a. Since the arrays can be
rather large, they need to be allocated with vmem_zalloc() via dfl_alloc()
and freed with vmem_free() via dfl_free().
The new dfl_alloc() function should be used to allocate object of type
dkioc_free_list_t in order that they're allocated from vmem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes #543
To reduce mutex footprint, we detect the existence of owner in kernel mutex,
and rely on it if it exists.
Note that before Linux 3.0, mutex owner is of type thread_info. Also note
that, in Linux 3.18, the condition for owner is changed from
CONFIG_DEBUG_MUTEXES || CONFIG_SMP to
CONFIG_DEBUG_MUTEXES || CONFIG_MUTEX_SPIN_ON_OWNER
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #540
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #537
Tim Chase [Mon, 8 Feb 2016 19:20:05 +0000 (13:20 -0600)]
Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq
When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
thread to be spawned. This will cause TQ_NOQUEUE to behave similarly
as it does with non-dynamic taskqs.
Add support for TQ_NOQUEUE to taskq_dispatch_ent().
Signed-off-by: Tim Chase <tim@onlight.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #530
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms. It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held. On other platforms the lock is never released during
the upgrade process. This is necessary under Linux because the kernel
does not provide an upgrade function.
There are currently no callers in the ZFS code where this change in
behavior is a problem. In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534
Brian Behlendorf [Thu, 10 Mar 2016 17:10:29 +0000 (09:10 -0800)]
Remove RPM package restriction
ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64
architectures. Given this the artificial architecture restriction in
the packaging has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Yao [Fri, 11 Jul 2014 22:36:28 +0000 (18:36 -0400)]
random_get_pseudo_bytes() need not provide cryptographic strength entropy
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.
The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.
Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:
http://xorshift.di.unimi.it/#speed
The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.
Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.
We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:
1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.
2. Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.
3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.
4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.
5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.
6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:
https://en.wikipedia.org/wiki/Seqlock
The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`). In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372
Chunwei Chen [Thu, 28 Jan 2016 00:55:14 +0000 (16:55 -0800)]
Allow kicking a taskq to spawn more threads
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529
Chip Parker [Tue, 26 Jan 2016 01:13:50 +0000 (19:13 -0600)]
Ensure spl/ only occurs once in core-y
Update copy-builtin so it may be run multiple times against
the kernel source tree. This change makes sed more discriminating
to ensure spl/ only occurs once in core-y.
Signed-off-by: Chip Parker <aparker@enthought.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #526
Brian Behlendorf [Sat, 23 Jan 2016 19:13:08 +0000 (11:13 -0800)]
Remove RLIM64_INFINITY assert in vn_rdwr()
Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY. The vn_rdwr() function
needs to be updated accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513
Richard Yao [Tue, 15 Dec 2015 16:48:19 +0000 (11:48 -0500)]
kobj_read_file: Return -1 on vn_rdwr() error
I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.
Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513
Chunwei Chen [Wed, 2 Dec 2015 22:52:46 +0000 (14:52 -0800)]
Use tsd to store tq for taskq_member
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.
Brian Behlendorf [Tue, 19 Jan 2016 16:59:47 +0000 (08:59 -0800)]
Linux 4.5 compat: pfn_t typedef
The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers. This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #524
Chunwei Chen [Mon, 18 Jan 2016 22:41:45 +0000 (14:41 -0800)]
Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark
In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #523
Chunwei Chen [Thu, 7 Jan 2016 03:05:24 +0000 (19:05 -0800)]
Don't hold mutex until release cv in cv_wait
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.
We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.
This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.
Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106
Chunwei Chen [Fri, 18 Dec 2015 02:31:58 +0000 (18:31 -0800)]
Use spl_fstrans_mark instead of memalloc_noio_save
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.
Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.
This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #515
Issue zfsonlinux/zfs#4111
Tim Chase [Mon, 19 Oct 2015 12:47:52 +0000 (07:47 -0500)]
Provide kstat for taskqs
This patch provides 2 new kstats to display task queues:
/proc/spl/taskqs-all - Display all task queues
/proc/spl/taskqs - Display only "active" task queues
A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.
If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.
If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.
If the task queue has any waiters, displays each waiting task's pid.
Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491
Brian Behlendorf [Sat, 12 Dec 2015 00:57:05 +0000 (16:57 -0800)]
Revert "Skip GPL-only symbols test when cross-compiling"
This reverts commit 61bbbd9a775a5517af513e5014edbdd73a32f7e4 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507
Chunwei Chen [Thu, 3 Dec 2015 23:06:03 +0000 (15:06 -0800)]
Don't use tq->tq_lock_flags
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #506
Olaf Faaland [Tue, 13 Oct 2015 23:56:51 +0000 (16:56 -0700)]
Subclass tq_lock to eliminate a lockdep warning
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking. This is
a false positive.
One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch
The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads. The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock. Without subclassing, lockdep believes these
could potentially be the same lock and complains. A similar case occurs
when taskq_dispatch() then calls task_alloc().
This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:
subclass taskq
TQ_LOCK_DYNAMIC dynamic_taskq
TQ_LOCK_GENERAL any other
Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
Olaf Faaland [Wed, 14 Oct 2015 06:08:44 +0000 (23:08 -0700)]
Fix lockdep warning in spl_inode_{lock,unlock}
spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep. The warning is a false positive.
The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file. The inode
lock is taken on both the parent inode and the file inode.
VFS provides an enum to subclass the lock. This patch changes the
spin_lock call to _nested version and uses the provided enum.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
Olaf Faaland [Tue, 6 Oct 2015 21:01:46 +0000 (14:01 -0700)]
Add new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
When lockdep detects these conditions, it disables further lock analysis
for all locks. This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.
This commit creates a new type of mutex, MUTEX_NOLOCKDEP. This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().
This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.
MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
Kamil Domański [Thu, 10 Dec 2015 10:14:08 +0000 (11:14 +0100)]
Skip GPL-only symbols test when cross-compiling
This test depends on being able to execute the resulting binary
which will be impossible when cross-compiling. Instead make a
worst case assumption which allows the build to continue as
recommended by the autoconf manual.
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048
Richard Yao [Thu, 3 Dec 2015 19:15:16 +0000 (14:15 -0500)]
Make taskq_member() use ->journal_info
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.
This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #500
Richard Yao [Wed, 4 Nov 2015 21:41:13 +0000 (16:41 -0500)]
Fix race between getf() and areleasef()
If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.
We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #492
Adding VPATH support, commit 37d7cd9, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.
The code which adds the `spl/` directory to the top level
Makefile was failing due to the addition of the `certs/` path.
The search pattern has been adjusted to be more tolerant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #481
Issue #498
Brian Behlendorf [Mon, 16 Nov 2015 22:45:42 +0000 (14:45 -0800)]
Limit maximum object size in kmem tests
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests. Large values can result in out of memory errors
for systems with less the 512M of memory. Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Jason Zaman [Sat, 24 Oct 2015 06:15:58 +0000 (14:15 +0800)]
sysmacros: Make P2ROUNDUP not trigger int overflow
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes #488
tuxoko [Fri, 6 Nov 2015 23:00:55 +0000 (15:00 -0800)]
Fix taskq dynamic spawning
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.
For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy
This will easily deadlock before this patch.
Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #496
Chunwei Chen [Sat, 24 Oct 2015 00:17:57 +0000 (17:17 -0700)]
Don't call kmem_cache_shrink from shrinker
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)
Brian Behlendorf [Mon, 12 Oct 2015 19:31:05 +0000 (12:31 -0700)]
Fix CPU hotplug
Allocate a kmem cache magazine for every possible CPU which might
be added to the system. This ensures that when one of these CPUs
is enabled it can be safely used immediately.
For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint. In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482
The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.
Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
Brian Behlendorf [Wed, 30 Sep 2015 16:26:21 +0000 (09:26 -0700)]
Fix spl-dkms uninstall/update
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 4cdcdbf which verifies the existence of the
spl.release build product would never be true.
This patch resolves the issue by updating the conditional to check
in the explicitly installed spl_config.h file for the version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #478
Brian Behlendorf [Mon, 28 Sep 2015 16:08:11 +0000 (09:08 -0700)]
Fix PAX Patch/Grsec SLAB_USERCOPY panic
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.
Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.
Richard Yao [Mon, 7 Sep 2015 16:35:21 +0000 (12:35 -0400)]
Disable direct reclaim in taskq worker threads on Linux 3.9+
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes #474
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tim Chase [Thu, 27 Aug 2015 16:13:20 +0000 (11:13 -0500)]
Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
Revert "Create a new thread during recursive taskq dispatch if necessary"
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing. An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock. The patch will need to be reworked.
VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
Tim Chase [Thu, 27 Aug 2015 16:13:20 +0000 (11:13 -0500)]
Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".
Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.
Brian Behlendorf [Wed, 19 Aug 2015 21:48:21 +0000 (14:48 -0700)]
Linux 4.2 compat: vfs_rename()
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error. Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface. As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.
Brian Behlendorf [Mon, 27 Jul 2015 22:05:47 +0000 (15:05 -0700)]
Remove needfree, desfree, lotsfree #defines
This patch reverts 77ab5dd. This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places. Those places have subsequently been updated to use
the Linux equivalent Linux functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637
Brian Behlendorf [Fri, 24 Jul 2015 17:32:55 +0000 (10:32 -0700)]
Invert minclsyspri and maxclsyspri
On Linux the meaning of a processes priority is inverted with respect
to illumos. High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
Note this change also reverts some of the priorities changes in prior
commit 62aa81a. The rational is as follows:
spl_kmem_cache - High priority may result in blocked memory allocs
spl_system_taskq - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
Brian Behlendorf [Thu, 23 Jul 2015 20:45:31 +0000 (13:45 -0700)]
Remove skc_ref from alloc/free paths
As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy. They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.
Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463
Brian Behlendorf [Thu, 23 Jul 2015 18:21:08 +0000 (11:21 -0700)]
Add defclsyspri macro
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority. Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority. This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.
All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri. The vast majority of callers were
part of the test suite which won't have an external impact. The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri. This makes it more likely the process will be scheduled
which may help performance.
To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added. When disabled (0) all newly created kernel
threads will use the default kernel thread priority. When enabled (1)
the specified taskq priority will be used. By default this value is
enabled (1).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Mon, 20 Jul 2015 19:18:56 +0000 (12:18 -0700)]
Default to --disable-debug-kmem
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
Brian Behlendorf [Mon, 29 Jun 2015 16:25:29 +0000 (09:25 -0700)]
Add memory compatibility wrappers
The function vmem_qcache_reap() and global variables 'needfree',
'desfree', and 'lotsfree' are all used in the upstream. While
these variables have no meaning under Linux they're being defined
as 0's to avoid needing to make additional changes to the ARC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Wed, 24 Jun 2015 16:53:47 +0000 (09:53 -0700)]
Set TASKQ_DYNAMIC for kmem and system taskqs
Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system. Additional
threads will be created on demand up to the previous maximum
thread counts. This should have minimal, if any, impact on
performance.
This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.
The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458