]> granicus.if.org Git - spl/log
spl
7 years agoRemove misguided HAVE_MUTEX_OWNER check
Oleg Drokin [Wed, 2 Aug 2017 18:45:16 +0000 (14:45 -0400)]
Remove misguided HAVE_MUTEX_OWNER check

It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) ==
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #632
Closes #633

7 years agoFix aarch64 build
Brian Behlendorf [Sat, 29 Jul 2017 20:24:39 +0000 (13:24 -0700)]
Fix aarch64 build

Add aarch64 to the list of architecture which do not sanitize the
LDFLAGS from the environment.  See e0aacd9b for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #635

7 years agoTag spl-0.7.0 spl-0.7.0
Brian Behlendorf [Wed, 26 Jul 2017 17:08:57 +0000 (10:08 -0700)]
Tag spl-0.7.0

META file and changelog updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
7 years agoModule parameter to enable spl_panic() to panic the kernel
Oleg Drokin [Wed, 26 Jul 2017 06:03:12 +0000 (02:03 -0400)]
Module parameter to enable spl_panic() to panic the kernel

In unattended operations it's often more useful to have node
panic and reboot when it encounters problems as opposed to
sit there indefinitely waiting for somebody to discover it.

This implements an spl_panic_crash module parameter, set it
to nonzero to cause spl_panic() to call panic().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #634

7 years agoAvoid WARN() from procfs on kstat collision
LOLi [Mon, 24 Jul 2017 17:52:53 +0000 (19:52 +0200)]
Avoid WARN() from procfs on kstat collision

When we load a ZFS pool having spa_name equals to some existing kstat
we would have to create a duplicate entry, which procfs doesn't like.

For instance a ZFS pool named "zil" would have its kstat "txgs"
(module "zfs/zil") intalled under "/proc/spl/kstat/zfs/zil":
unfortunately we already have a kstat named "zil" (module "zfs")
installed in the same procfs location.

Avoid this issue by skipping the duplicate entry creation in procfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #628

7 years agoLinux 4.13 compat: wait queues
Brian Behlendorf [Mon, 24 Jul 2017 02:32:14 +0000 (19:32 -0700)]
Linux 4.13 compat: wait queues

Commit torvalds/linux@ac6424b9
- Renamed struct wait_queue -> struct wait_queue_entry.

Commit torvalds/linux@2055da97
- Renamed wait_queue_head::task_list -> wait_queue_head::head
- Renamed wait_queue_entry::task_list -> wait_queue_entry::entry

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #629

7 years agoTag 0.7.0-rc5 spl-0.7.0-rc5
Brian Behlendorf [Thu, 13 Jul 2017 19:07:59 +0000 (12:07 -0700)]
Tag 0.7.0-rc5

Fifth release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
7 years agoDon't cache the system hostid
Brian Behlendorf [Mon, 10 Jul 2017 19:24:52 +0000 (15:24 -0400)]
Don't cache the system hostid

Historically the SPL cached the system hostid the first time it
was accessed.  This was done to speed up subsequent accesses.
But in practice the system host id is rarely accessed and its
inconvenient that it doesn't promptly detect /etc/hostid
configuration changes.  Therefore, zone_get_hostid() has been
updated to always refresh the system hostid reported.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #626

7 years agoAdd ASSERT3B/VERIFY3B/USEC2NSEC/NSEC2USEC macros
Prakash Surya [Mon, 10 Jul 2017 19:44:23 +0000 (12:44 -0700)]
Add ASSERT3B/VERIFY3B/USEC2NSEC/NSEC2USEC macros

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #627

7 years agoFix RWSEM_SPINLOCK_IS_RAW check failed
Chunwei Chen [Mon, 19 Jun 2017 18:02:20 +0000 (11:02 -0700)]
Fix RWSEM_SPINLOCK_IS_RAW check failed

Initialize dummy_lock to fix the build error in gcc 7.1.1 with:
  error: ‘dummy_lock’ is used uninitialized in this function

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #622

7 years agoconfig: allow --with-linux without --with-linux-obj
Chunwei Chen [Wed, 24 May 2017 22:42:34 +0000 (15:42 -0700)]
config: allow --with-linux without --with-linux-obj

Don't use `uname -r` to determine kernel build directory when the user
specified kernel source with --with-linux. Otherwise, the user is forced
to use --with-linux-obj even if they are the same directory, which is
very counterintuitive.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
7 years agoImprove gitignore
Chunwei Chen [Wed, 24 May 2017 22:23:37 +0000 (15:23 -0700)]
Improve gitignore

Exclude Makefile.in in module/ and fix the gitignore in cmd/
Also, ignore *.patch and *.orig files

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
7 years agoFix cv_timedwait timeout
Brian Behlendorf [Thu, 25 May 2017 17:01:44 +0000 (10:01 -0700)]
Fix cv_timedwait timeout

Perform the already past expiration time check before updating
cvp->cv_mutex with the provided mutex.  This check only depends
on local state.  Doing it first ensures that cvp->cv_mutex will not
be updated in the timeout case or if it's ever called with an
expire_time <= now.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #616

7 years agoLinux 4.12 compat: PF_FSTRANS was removed
Chunwei Chen [Tue, 9 May 2017 17:36:54 +0000 (10:36 -0700)]
Linux 4.12 compat: PF_FSTRANS was removed

Change SPL_FSTRANS to optionally contains PF_FSTRANS. Also, add
__spl_pf_fstrans_check for the checks specifically for PF_FSTRANS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #614

7 years agoTag 0.7.0-rc4 spl-0.7.0-rc4
Brian Behlendorf [Fri, 5 May 2017 16:23:03 +0000 (09:23 -0700)]
Tag 0.7.0-rc4

Fourth release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
7 years agoglibc 2.25 compat: remove assert(X=Y)
Olaf Faaland [Mon, 3 Apr 2017 20:33:49 +0000 (13:33 -0700)]
glibc 2.25 compat: remove assert(X=Y)

The assert() related definitions in glibc 2.25 were altered to warn
about assert(X=Y) when -Wparentheses is used.  See
https://abi-laboratory.pro/tracker/changelog/glibc/2.25/log.html

lib/list.c used this construct to set the value of a magic field which
is defined only when debugging.

Replaced the assert()s with #ifndef/#endifs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #610

7 years agoLinux 4.11 compat: remove stub for __put_task_struct
Olaf Faaland [Mon, 13 Mar 2017 17:37:10 +0000 (10:37 -0700)]
Linux 4.11 compat: remove stub for __put_task_struct

Before kernel 2.6.29 credentials were embedded in task_structs, and zfs had
cases where one thread would need to refer to the credential of another thread,
forcing it to take a hold on the foreign thread's task_struct to ensure it was
not freed.

Since 2.6.29, the credential has been moved out of the task_struct into a
cred_t.

In addition, the mainline kernel originally did not export __put_task_struct()
but the RHEL5 kernel did, according to zfsonlinux/spl@e811949a570.  As of
2.6.39 the mainline kernel exports it.

There is no longer zfs code that takes or releases holds on a task_struct, and
so there is no longer any reference to __put_task_struct().

This affects the linux 4.11 kernel because the prototype for
__put_task_struct() is in a new include file (linux/sched/task.h) and so the
config check failed to detect the exported symbol.

Removing the unnecessary stub and corresponding config check.  This works on
kernels since the oldest one currently supported, 2.6.32 as shipped with
Centos/RHEL.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608

7 years agoLinux 4.11 compat: add linux/sched/signal.h
Olaf Faaland [Tue, 7 Mar 2017 23:33:50 +0000 (15:33 -0800)]
Linux 4.11 compat: add linux/sched/signal.h

In Linux 4.11, torvalds/linux@2a1f062, signal handling related functions
were moved from sched.h into sched/signal.h.

Add configure checks to detect this and include the new file where
needed.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608

7 years agoLinux 4.11 compat: vfs_getattr() takes 4 args
Olaf Faaland [Tue, 7 Mar 2017 21:18:53 +0000 (13:18 -0800)]
Linux 4.11 compat: vfs_getattr() takes 4 args

There are changes to vfs_getattr() in torvalds/linux@a528d35.  The new
interface is:

int vfs_getattr(const struct path *path, struct kstat *stat,
               u32 request_mask, unsigned int query_flags)

The request_mask argument indicates which field(s) the caller intends to
use.  Fields the caller does not specify via request_mask may be set in
the returned struct anyway, but their values may be approximate.

The query_flags argument indicates whether the filesystem must update
the attributes from the backing store.

This patch uses the query_flags which result in vfs_getattr behaving the same
as it did with the 2-argument version which the kernel provided before
Linux 4.11.

Members blksize and blocks are now always the same size regardless of
arch.  They match the size of the equivalent members in vnode_t.

The configure checks are modified to ensure that the appropriate
vfs_getattr() interface is used.

A more complete fix, removing the ZFS dependency on vfs_getattr()
entirely, is deferred as it is a much larger project.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608

7 years agoFix powerpc build
Brian Behlendorf [Mon, 6 Mar 2017 17:16:22 +0000 (09:16 -0800)]
Fix powerpc build

Unlike other architectures which sanitize the LDFLAGS from the
environment in arch/<arch>/Makefile.  The powerpc Makefile
allows LDFLAGS to be passed through resulting in the following
build failure.

  /usr/bin/ld: unrecognized option '-Wl,-z,relro'

LDFLAGS is set in /usr/lib/rpm/redhat/macros by default.  Clear
the environment variable when building kmods for powerpc.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #607

7 years agoLinux 4.11 compat: set_task_state() removed
Olaf Faaland [Thu, 23 Feb 2017 17:52:08 +0000 (09:52 -0800)]
Linux 4.11 compat: set_task_state() removed

Replace uses of set_task_state(current, STATE) with
set_current_state(STATE).

In Linux 4.11, torvalds/linux@642fa44, set_task_state() is removed.

All spl uses are of the form set_task_state(current, STATE).
set_current_state(STATE) is equivalent and has been available since
Linux 2.2.26.

Furthermore, set_current_state(STATE) is already used in about 15
locations within spl.  This change should have no impact other than
removing an unnecessary dependency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #603

7 years agoUse kernel slab for vn_cache and vn_file_cache
Chunwei Chen [Tue, 31 Jan 2017 21:44:01 +0000 (13:44 -0800)]
Use kernel slab for vn_cache and vn_file_cache

Resolve a false positive in the kmemleak checker by shifting to the
kernel slab.  It shows up because vn_file_cache is using KMC_KMEM
which is directly allocated using __get_free_pages, which is not
automatically tracked by kmemleak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #599

7 years agoAdd a PAGESHIFT definition
David Quigley [Tue, 31 Jan 2017 18:36:18 +0000 (11:36 -0700)]
Add a PAGESHIFT definition

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #598

8 years agoTag 0.7.0-rc3 spl-0.7.0-rc3
Brian Behlendorf [Fri, 20 Jan 2017 18:16:32 +0000 (10:16 -0800)]
Tag 0.7.0-rc3

Third release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoReimplement rt_mutex_owner to fix build with DEBUG & PREEMPT_RT_FULL
clefru [Thu, 19 Jan 2017 22:41:38 +0000 (23:41 +0100)]
Reimplement rt_mutex_owner to fix build with DEBUG & PREEMPT_RT_FULL

rt_mutex_owner is internal to kernel/locking/rtmutex_common.h and
inaccessible for SPL via the public kernel headers. The way of
accessing the owner has been stable since at least 3.13 ([1], [2]),
which is masking the lowest bit in the owner pointer in rt_mutex. We
do the same.

[1] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=3.13#L99
[2] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=4.9#L78

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes #593

8 years agoRemove identical if statements in module/spl/spl-vnode.c
George Melikov [Thu, 19 Jan 2017 22:32:45 +0000 (01:32 +0300)]
Remove identical if statements in module/spl/spl-vnode.c

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #594

8 years agoAdd support for recent kmem_cache_create_usercopy
Kevin Tanguy [Tue, 17 Jan 2017 20:05:14 +0000 (21:05 +0100)]
Add support for recent kmem_cache_create_usercopy

SLAB_USERCOPY flag was used to indicate PAX
not to kill copies from kernel to userland.

With recent grsecurity patchset and
CONFIG_GRKERNSEC_HIDESYM that enables
CONFIG_PAX_USERCOPY zfs would panic.

Handle newer API while keeping old one functional.

Tested-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: spendergrsec <spender@grsecurity.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kevin Tanguy <kevin.tanguy@ovh.net>
Closes #595

8 years agoUpdate struct member intializers to C89
RageLtMan [Fri, 13 Jan 2017 22:12:42 +0000 (17:12 -0500)]
Update struct member intializers to C89

When building SPL within the kernel tree, C99 initializers cause
build failures and need to be converted to C89 as kernel CFLAGS
specify -std=gnu89.

This fix was provided by @behlendorf in #595 discussion notes and
manually implemented in the current master revision.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Closes #597

8 years agoAdd support for rw semaphore under PREEMPT_RT_FULL
Clemens Fruhwirth [Sat, 17 Dec 2016 16:09:57 +0000 (17:09 +0100)]
Add support for rw semaphore under PREEMPT_RT_FULL

The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread.  All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.

The implementation can be best understood by inspecting the RT
patch.  rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.

Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support.  Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.

Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes zfsonlinux/zfs#5491
Closes #589
Closes #308

8 years agoRemove stale comment from rw_tryupgrade()
Clemens Fruhwirth [Sat, 17 Dec 2016 16:10:25 +0000 (17:10 +0100)]
Remove stale comment from rw_tryupgrade()

Commit f58040c0fc8bc6490fcc75db7fc3e709dfc3c656 should have removed
this comment which is no longer relevant.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Issue #589

8 years agoRefactor some splat macro to function
Chunwei Chen [Thu, 15 Dec 2016 02:24:47 +0000 (18:24 -0800)]
Refactor some splat macro to function

Refactor the code by making splat_test_{init,fini}, splat_subsystem_{init,fini}
into functions. They don't have reason to be macro and it would be too bloated
to inline every call.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
8 years agoFix splat memleak
Chunwei Chen [Thu, 15 Dec 2016 19:12:50 +0000 (11:12 -0800)]
Fix splat memleak

SPLAT_TEST_FINI didn't call kfree causing memleak.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
8 years agoAdd system_delay_taskq for long delay
Chunwei Chen [Thu, 8 Dec 2016 21:00:20 +0000 (13:00 -0800)]
Add system_delay_taskq for long delay

Add a dedicated system_delay_taskq for long delay like spa_deadman and
zpl_posix_acl_free. This will allow us to use system_taskq in the manner of
dispatch multiple tasks and call taskq_wait_outstanding.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #588

8 years agoLimit number of tasks shown in taskq proc
Chunwei Chen [Thu, 1 Dec 2016 18:06:27 +0000 (10:06 -0800)]
Limit number of tasks shown in taskq proc

To prevent holding tq_lock for too long.

Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq
would easily cause a lockup. While that bug has been fixed. It's probably
still a good idea to do this just in case task lists grow too large.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #586

8 years agoAdd TASKQID_INVALID and TASKQID_INITIAL macros
Ubuntu [Fri, 28 Oct 2016 21:23:30 +0000 (21:23 +0000)]
Add TASKQID_INVALID and TASKQID_INITIAL macros

Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the
taskq implementation and test cases to use them.  This is solely
for the purposes of readability and introduces no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoFix vmem_size()
Ubuntu [Fri, 28 Oct 2016 20:56:38 +0000 (20:56 +0000)]
Fix vmem_size()

Add a minimal implementation of vmem_size() which accounts for the
virtual memory usage of the SPL's kmem cache.  This functionality
is only useful on 32-bit systems with a small virtual address space.

The following assumptions are made:

  1) The major SPL consumer of virtual memory is the kmem cache.
  2) Memory allocated with vmem_alloc() is short lived and can be ignored.
  3) Allow a 4MB floor as a generous pad given normal consumption.
  4) The spl_kmem_cache_sem only contends with cache create/destroy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoTag 0.7.0-rc2 spl-0.7.0-rc2
Brian Behlendorf [Tue, 25 Oct 2016 20:13:49 +0000 (13:13 -0700)]
Tag 0.7.0-rc2

Second release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoLinux 4.9 compat: group_info changes
Chunwei Chen [Wed, 19 Oct 2016 00:30:41 +0000 (17:30 -0700)]
Linux 4.9 compat: group_info changes

In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via
->blocks to 1d array via ->gid. We change the spl cred functions accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #581

8 years agoFix splat-cred.c cred usage
Chunwei Chen [Wed, 19 Oct 2016 00:29:26 +0000 (17:29 -0700)]
Fix splat-cred.c cred usage

No need to crhold current_cred(), fix possible leak in splat_cred_test2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556

8 years agoFix crgetgroups out-of-bound and misc cred fix
Chunwei Chen [Tue, 18 Oct 2016 22:52:30 +0000 (15:52 -0700)]
Fix crgetgroups out-of-bound and misc cred fix

init_groups has 0 nblocks, therefore calling the current crgetgroups with
init_groups would result in out-of-bound access. We fix this by returning NULL
when nblocks is 0.

Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return
blocks[0].

Also, remove all get_group_info. The cred already holds reference on the
group_info, and cred is not mutable. So there's no reason to hold extra
reference, if we hold cred.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556

8 years agoFix out-of-bound in per_cpu in spl_random_init
tuxoko [Sat, 8 Oct 2016 03:59:46 +0000 (20:59 -0700)]
Fix out-of-bound in per_cpu in spl_random_init

When iterating per_cpu values, we need to use for_each_possible_cpu. While
NR_CPUS indicates the number of CPU supported by the kernel, it might not
initialize all of them if the kernel decides it's not possible to use them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #578

8 years agoLinux 4.8 compat: Fix RW_READ_HELD
tuxoko [Sat, 8 Oct 2016 03:53:58 +0000 (20:53 -0700)]
Linux 4.8 compat: Fix RW_READ_HELD

Linux 4.8, starting from torvalds/linux@19c5d690e, will set owner to 1 when
read held instead of leave it NULL. So we change the condition to
`rw_owner(rwp) <= 1` in RW_READ_HELD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes zfsonlinux/zfs#5233
Closes #577

8 years agoFix p0 initializer
Brian Behlendorf [Wed, 5 Oct 2016 00:26:36 +0000 (17:26 -0700)]
Fix p0 initializer

Due to changes in the task_struct the following warning is occurs
when initializing the global p0.  Since this structure only exists
for it's address to be taken initialize it in a manor which isn't
sensitive to internal changes to the structure.

  module/spl/spl-generic.c:58:1: error: missing braces around
  initializer [-Werror=missing-braces]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #576

8 years agoFix aarch64 type warning
Brian Behlendorf [Sun, 2 Oct 2016 01:33:01 +0000 (18:33 -0700)]
Fix aarch64 type warning

Explicitly cast type in splat-rwlock.c test case to silence
the following warning.

  warning: format ‘%ld’ expects argument of type ‘long int’,
  but argument N has type ‘int’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574

8 years agoFix automatically generated release number
Brian Behlendorf [Wed, 21 Sep 2016 20:44:32 +0000 (13:44 -0700)]
Fix automatically generated release number

When building from the head of a branch a release number is
automatically generated with `git describe` using the last tag
on that branch as the base.  For this to work the last tag on the
branch needs to be predictable given the current META file.

This logic was accidentally broken when an -rcX tag was added to
the branch.  Update it to search for a VERSION or VERSION-RELEASE
tag.

Reviewed-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#5105
Closes #572

8 years agoIncrease spl_kmem_alloc_warn limit
Brian Behlendorf [Sat, 17 Sep 2016 00:10:36 +0000 (17:10 -0700)]
Increase spl_kmem_alloc_warn limit

In order to support ABD with large blocks the spl_kmem_alloc_warn
limit needs to be increased to 64K.

A 16M block requires that pointers be stored for 4096 4K-pages
on an x86_64 system.  Each of these pointers is 8 bytes requiring
an allocation of 8*4096=32,768 bytes.  The addition of a small
header to this structure pushes the allocation over the default
32K warning threshold.

In addition, fix a small bug where MAX was used instead of MIN
when setting the default.  This ensures a reasonable limit is
still set on systems with page sizes larger then 4K.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #571

8 years agoFix spl check.sh script
legend-hua [Thu, 15 Sep 2016 00:17:00 +0000 (08:17 +0800)]
Fix spl check.sh script

Update splat_cmd to reference the correct location of the splat utility.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Liu Hua<liu.hua130@zte.com.cn>
Closes #570

8 years agoCleanup in cred.h
tuxoko [Wed, 14 Sep 2016 23:59:31 +0000 (16:59 -0700)]
Cleanup in cred.h

Remove the code that doesn't make any sense.

Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #569

8 years agoTag 0.7.0-rc1 spl-0.7.0-rc1
Brian Behlendorf [Wed, 7 Sep 2016 17:33:21 +0000 (10:33 -0700)]
Tag 0.7.0-rc1

First release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoFix: handle NULL case in spl_kmem_free_track()
GeLiXin [Fri, 19 Aug 2016 06:50:21 +0000 (14:50 +0800)]
Fix: handle NULL case in spl_kmem_free_track()

When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all
the buffers alloced by kmem_alloc() and kmem_zalloc().  If a NULL
pointer which indicates no track info in SPL is passed to
spl_kmem_free_track, we just ignore it.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4967
Closes #567

8 years agoFix HAVE_MUTEX_OWNER test for kernels prior to 4.6
Tim Chase [Mon, 1 Aug 2016 13:19:19 +0000 (08:19 -0500)]
Fix HAVE_MUTEX_OWNER test for kernels prior to 4.6

Recent 4.X kernels prior to 4.6 require #include of spinlock.h in
order to get the definition of __ARCH_SPIN_LOCK_UNLOCKED which is
used by DEFINE_MUTEX().

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #566

8 years agoAdd handling for kernel 4.7's CONFIG_TRIM_UNUSED_KSYMS
Nikolay Borisov [Fri, 29 Jul 2016 15:48:33 +0000 (18:48 +0300)]
Add handling for kernel 4.7's CONFIG_TRIM_UNUSED_KSYMS

Kernel 4.7 added the option to trim the unused exported symbols. In
my testing this showed to be problematic since the PDE_DATA function
was considered unused and as such was trimmed. This in turn caused the
respective test during spl's configure stage to falsely detect that
PDE_DATA is not defined, which in turn caused build failures later.

Handle this situation by adding detection whether CONFIG_TRIM_UNUSED_KSYMS
is enabled and refuse to build against a kernel which has it enabled

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565

8 years agoAdd gitignore entry for spl-*.o.d files
Nikolay Borisov [Fri, 29 Jul 2016 15:48:04 +0000 (18:48 +0300)]
Add gitignore entry for spl-*.o.d files

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #565

8 years agoLinux 4.8 compat: rw_semaphore atomic_long_t count
Brian Behlendorf [Tue, 26 Jul 2016 23:37:46 +0000 (23:37 +0000)]
Linux 4.8 compat: rw_semaphore atomic_long_t count

For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type.  A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro.  See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #563

8 years agoAdded highbit() and lowbit() macros
Tom Caputi [Thu, 14 Jul 2016 19:51:24 +0000 (15:51 -0400)]
Added highbit() and lowbit() macros

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #562

8 years agoAdd _ALIGNMENT_REQUIRED to isa_defs.h for checksums
Tony Hutter [Wed, 15 Jun 2016 00:36:39 +0000 (17:36 -0700)]
Add _ALIGNMENT_REQUIRED to isa_defs.h for checksums

_ALIGNMENT_REQUIRED needs to be #defined in isa_defs.h in order to
port the Illumos checksum code to ZoL:

4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #561

8 years agoImprove spl slab cache alloc
Jinshan Xiong [Thu, 19 May 2016 17:59:40 +0000 (10:59 -0700)]
Improve spl slab cache alloc

The policy is to try to allocate with KM_NOSLEEP, which will lead to
memory allocation with GFP_ATOMIC, and if it fails, it will launch
an taskq to expand slab space.

This way it should be able to get better NUMA memory locality and
reduce the overhead of context switch.

Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #551

8 years agoFix use-after-free in splat_taskq_test7
Chunwei Chen [Sat, 28 May 2016 00:28:12 +0000 (17:28 -0700)]
Fix use-after-free in splat_taskq_test7

This splat_vprint is using tq_arg->name after tq_arg is freed.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #557

8 years agoImplement a proper rw_tryupgrade
Chunwei Chen [Wed, 25 May 2016 23:35:42 +0000 (16:35 -0700)]
Implement a proper rw_tryupgrade

Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes #554

8 years agoAdd isa_defs for MIPS
YunQiang Su [Sat, 28 May 2016 11:30:36 +0000 (19:30 +0800)]
Add isa_defs for MIPS

GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #558

8 years agoFix taskq_wait_outstanding re-evaluate tq_next_id
Chunwei Chen [Mon, 23 May 2016 21:12:22 +0000 (14:12 -0700)]
Fix taskq_wait_outstanding re-evaluate tq_next_id

wait_event is a macro, so the current implementation will cause re-
evaluation of tq_next_id every time it wakes up. This would cause
taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553

8 years agoFix race between taskq_destroy and dynamic spawning thread
Chunwei Chen [Sat, 21 May 2016 01:04:03 +0000 (18:04 -0700)]
Fix race between taskq_destroy and dynamic spawning thread

While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
does not implies the thread being spawned is up and running. This will cause
taskq to be freed before the thread can exit.

We fix this by using tq_nspawn to indicate how many threads are being spawned
before they are inserted to the thread list. And have taskq_destroy to wait
for it to drop to zero.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Closes #550

8 years agoRestore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires
Chunwei Chen [Fri, 20 May 2016 23:35:52 +0000 (16:35 -0700)]
Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires

In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553

8 years agoLinux 4.7 compat: inode_lock() and friends
Chunwei Chen [Wed, 18 May 2016 18:28:46 +0000 (11:28 -0700)]
Linux 4.7 compat: inode_lock() and friends

Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.

We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.

We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes #549

8 years agoAdd cv_timedwait_sig_hires to allow interruptible sleep
Chunwei Chen [Wed, 11 May 2016 23:51:29 +0000 (16:51 -0700)]
Add cv_timedwait_sig_hires to allow interruptible sleep

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #548

8 years agoAdd a macro to convert seconds to nanoseconds and vice-versa
David Quigley [Thu, 5 May 2016 23:10:46 +0000 (19:10 -0400)]
Add a macro to convert seconds to nanoseconds and vice-versa

Required infrastructure for zfsonlinux/zfs#4600.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #546

8 years agoClear PF_FSTRANS over spl_filp_fallocate()
Tim Chase [Tue, 26 Apr 2016 11:33:52 +0000 (06:33 -0500)]
Clear PF_FSTRANS over spl_filp_fallocate()

The problem described in 2a5d574 also applies to XFS's file or inode
fallocate method.  Both paths may trigger writeback and expose this
issue, see the full stack below.

When layered on XFS a warning will be emitted under CentOS7 when entering
either the file or inode fallocate method with PF_FSTRANS already set.
To avoid triggering this error PF_FSTRANS is cleared and then reset
in vn_space().

WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0

Call Trace:
 [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0
 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
 [<ffffffff81173ed7>] __writepage+0x17/0x40
 [<ffffffff81176f81>] write_cache_pages+0x251/0x530
 [<ffffffff811772b1>] generic_writepages+0x51/0x80
 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs]
 [<ffffffff81177300>] do_writepages+0x20/0x30
 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100
 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0
 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs]
 [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs]
 [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl]
 [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs]
 [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs]
 [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs]
 [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl]
 [<ffffffff810c1777>] kthread+0xd7/0xf0
 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes zfsonlinux/zfs#4529

8 years agoUse vmem_free() in dfl_free() and add dfl_alloc()
Tim Chase [Sun, 24 Apr 2016 23:29:03 +0000 (18:29 -0500)]
Use vmem_free() in dfl_free() and add dfl_alloc()

This change was lost, somehow, in e5f9a9a.  Since the arrays can be
rather large, they need to be allocated with vmem_zalloc() via dfl_alloc()
and freed with vmem_free() via dfl_free().

The new dfl_alloc() function should be used to allocate object of type
dkioc_free_list_t in order that they're allocated from vmem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes #543

8 years agoUse kernel provided mutex owner
Chunwei Chen [Tue, 12 Apr 2016 19:05:14 +0000 (12:05 -0700)]
Use kernel provided mutex owner

To reduce mutex footprint, we detect the existence of owner in kernel mutex,
and rely on it if it exists.

Note that before Linux 3.0, mutex owner is of type thread_info. Also note
that, in Linux 3.18, the condition for owner is changed from
CONFIG_DEBUG_MUTEXES || CONFIG_SMP to
CONFIG_DEBUG_MUTEXES || CONFIG_MUTEX_SPIN_ON_OWNER

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #540

8 years agoAdd support for s390[x].
Dimitri John Ledkov [Wed, 16 Mar 2016 21:32:08 +0000 (21:32 +0000)]
Add support for s390[x].

Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #537

8 years agoAllow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq
Tim Chase [Mon, 8 Feb 2016 19:20:05 +0000 (13:20 -0600)]
Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq

When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
thread to be spawned.  This will cause TQ_NOQUEUE to behave similarly
as it does with non-dynamic taskqs.

Add support for TQ_NOQUEUE to taskq_dispatch_ent().

Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #530

8 years agoAdd rw_tryupgrade()
Brian Behlendorf [Wed, 9 Mar 2016 22:20:48 +0000 (14:20 -0800)]
Add rw_tryupgrade()

This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms.  It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held.  On other platforms the lock is never released during
the upgrade process.  This is necessary under Linux because the kernel
does not provide an upgrade function.

There are currently no callers in the ZFS code where this change in
behavior is a problem.  In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534

8 years agoRemove RPM package restriction
Brian Behlendorf [Thu, 10 Mar 2016 17:10:29 +0000 (09:10 -0800)]
Remove RPM package restriction

ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64
architectures.  Given this the artificial architecture restriction in
the packaging has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
8 years agoChanges to support zfs encryption
Tom Caputi [Thu, 18 Feb 2016 23:24:29 +0000 (18:24 -0500)]
Changes to support zfs encryption

Unused modlinkage struct removed and ntohll functions added.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #533

8 years agorandom_get_pseudo_bytes() need not provide cryptographic strength entropy
Richard Yao [Fri, 11 Jul 2014 22:36:28 +0000 (18:36 -0400)]
random_get_pseudo_bytes() need not provide cryptographic strength entropy

Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2.  Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372

8 years agoAllow kicking a taskq to spawn more threads
Chunwei Chen [Thu, 28 Jan 2016 00:55:14 +0000 (16:55 -0800)]
Allow kicking a taskq to spawn more threads

This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529

9 years agoEnsure spl/ only occurs once in core-y
Chip Parker [Tue, 26 Jan 2016 01:13:50 +0000 (19:13 -0600)]
Ensure spl/ only occurs once in core-y

Update copy-builtin so it may be run multiple times against
the kernel source tree.  This change makes sed more discriminating
to ensure spl/ only occurs once in core-y.

Signed-off-by: Chip Parker <aparker@enthought.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #526

9 years agoRemove RLIM64_INFINITY assert in vn_rdwr()
Brian Behlendorf [Sat, 23 Jan 2016 19:13:08 +0000 (11:13 -0800)]
Remove RLIM64_INFINITY assert in vn_rdwr()

Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY.  The vn_rdwr() function
needs to be updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513

9 years agokobj_read_file: Return -1 on vn_rdwr() error
Richard Yao [Tue, 15 Dec 2015 16:48:19 +0000 (11:48 -0500)]
kobj_read_file: Return -1 on vn_rdwr() error

I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.

Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513

9 years agoCreate spl-kmod-debuginfo rpm with redhat spec file
Olaf Faaland [Tue, 19 Jan 2016 18:15:24 +0000 (10:15 -0800)]
Create spl-kmod-debuginfo rpm with redhat spec file

Correct the redhat specfile so that working debuginfo rpms are created
for the kernel modules.  The generic specfile already does the right
thing.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#4224

9 years agoUse tsd to store tq for taskq_member
Chunwei Chen [Wed, 2 Dec 2015 22:52:46 +0000 (14:52 -0800)]
Use tsd to store tq for taskq_member

To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #500
Closes #504
Closes #505

9 years agoLinux 4.5 compat: pfn_t typedef
Brian Behlendorf [Tue, 19 Jan 2016 16:59:47 +0000 (08:59 -0800)]
Linux 4.5 compat: pfn_t typedef

The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers.  This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #524

9 years agoTurn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark
Chunwei Chen [Mon, 18 Jan 2016 22:41:45 +0000 (14:41 -0800)]
Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark

In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #523

9 years agoDon't hold mutex until release cv in cv_wait
Chunwei Chen [Thu, 7 Jan 2016 03:05:24 +0000 (19:05 -0800)]
Don't hold mutex until release cv in cv_wait

If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106

9 years agoAdd spl_kmem_cache_kmem_threads man page entry
Brian Behlendorf [Mon, 14 Dec 2015 17:59:28 +0000 (09:59 -0800)]
Add spl_kmem_cache_kmem_threads man page entry

The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation.  Add it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #512

9 years ago_ILP32 is always defined on SPARC
Alex McWhirter [Fri, 8 Jan 2016 08:55:57 +0000 (03:55 -0500)]
_ILP32 is always defined on SPARC

Signed-off-by: Alex McWhirter <alexmcwhirter@triadic.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520

9 years agoFix do_div() types in condvar:timeout
Brian Behlendorf [Tue, 22 Dec 2015 17:26:10 +0000 (09:26 -0800)]
Fix do_div() types in condvar:timeout

The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516

9 years agoUse spl_fstrans_mark instead of memalloc_noio_save
Chunwei Chen [Fri, 18 Dec 2015 02:31:58 +0000 (18:31 -0800)]
Use spl_fstrans_mark instead of memalloc_noio_save

For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #515
Issue zfsonlinux/zfs#4111

9 years agoProvide kstat for taskqs
Tim Chase [Mon, 19 Oct 2015 12:47:52 +0000 (07:47 -0500)]
Provide kstat for taskqs

This patch provides 2 new kstats to display task queues:

  /proc/spl/taskqs-all - Display all task queues
  /proc/spl/taskqs - Display only "active" task queues

A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.

If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.

If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.

If the task queue has any waiters, displays each waiting task's pid.

Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491

9 years agoSkip GPL-only symbols test when cross-compiling
Kamil Domanski [Sat, 12 Dec 2015 12:35:49 +0000 (13:35 +0100)]
Skip GPL-only symbols test when cross-compiling

Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075

9 years agoRevert "Skip GPL-only symbols test when cross-compiling"
Brian Behlendorf [Sat, 12 Dec 2015 00:57:05 +0000 (16:57 -0800)]
Revert "Skip GPL-only symbols test when cross-compiling"

This reverts commit 61bbbd9a775a5517af513e5014edbdd73a32f7e4 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507

9 years agoFix cstyle issues in spl-taskq.c and taskq.h
Brian Behlendorf [Sat, 12 Dec 2015 00:15:50 +0000 (16:15 -0800)]
Fix cstyle issues in spl-taskq.c and taskq.h

This patch only addresses the issues identified by the style checker.
It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoDon't use tq->tq_lock_flags
Chunwei Chen [Thu, 3 Dec 2015 23:06:03 +0000 (15:06 -0800)]
Don't use tq->tq_lock_flags

The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #506

9 years agoSubclass tq_lock to eliminate a lockdep warning
Olaf Faaland [Tue, 13 Oct 2015 23:56:51 +0000 (16:56 -0700)]
Subclass tq_lock to eliminate a lockdep warning

When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking.  This is
a false positive.

One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch

The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads.  The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock.  Without subclassing, lockdep believes these
could potentially be the same lock and complains.  A similar case occurs
when taskq_dispatch() then calls task_alloc().

This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:

subclass              taskq
TQ_LOCK_DYNAMIC       dynamic_taskq
TQ_LOCK_GENERAL       any other

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

9 years agoFix lockdep warning in spl_inode_{lock,unlock}
Olaf Faaland [Wed, 14 Oct 2015 06:08:44 +0000 (23:08 -0700)]
Fix lockdep warning in spl_inode_{lock,unlock}

spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep.  The warning is a false positive.

The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file.  The inode
lock is taken on both the parent inode and the file inode.

VFS provides an enum to subclass the lock.  This patch changes the
spin_lock call to _nested version and uses the provided enum.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

9 years agoAdd new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP
Olaf Faaland [Tue, 6 Oct 2015 21:01:46 +0000 (14:01 -0700)]
Add new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP

When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.

When lockdep detects these conditions, it disables further lock analysis
for all locks.  This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.

This commit creates a new type of mutex, MUTEX_NOLOCKDEP.  This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().

This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.

MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480

9 years agoSkip GPL-only symbols test when cross-compiling
Kamil Domański [Thu, 10 Dec 2015 10:14:08 +0000 (11:14 +0100)]
Skip GPL-only symbols test when cross-compiling

This test depends on being able to execute the resulting binary
which will be impossible when cross-compiling.  Instead make a
worst case assumption which allows the build to continue as
recommended by the autoconf manual.

https://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/Runtime.html

Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075

9 years agoFix build issue on some configured kernels
zgock [Thu, 10 Dec 2015 10:20:33 +0000 (19:20 +0900)]
Fix build issue on some configured kernels

The SPL fails to build with some "Configured" kernels (ex. openSUSE
xen Kernel) this change should make same binaries with C compiler
optimization.

Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #510

9 years agoEither _ILP32 or _LP64 must be defined
Brian Behlendorf [Wed, 9 Dec 2015 22:46:59 +0000 (14:46 -0800)]
Either _ILP32 or _LP64 must be defined

For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined.  Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048

9 years agoRevert "Make taskq_member() use ->journal_info"
Brian Behlendorf [Wed, 9 Dec 2015 01:04:31 +0000 (17:04 -0800)]
Revert "Make taskq_member() use ->journal_info"

This reverts commit a430c11f0b1ef16ca5edf3059e4082709277376c.  Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500