]> granicus.if.org Git - zfs/log
zfs
9 years agoFix removal of SA in sa_modify_attrs()
Tim Chase [Sun, 19 Oct 2014 03:50:01 +0000 (22:50 -0500)]
Fix removal of SA in sa_modify_attrs()

The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR().  The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028

9 years agoUse kmem_vasprintf() in log_internal()
Richard Yao [Sat, 11 Oct 2014 15:01:37 +0000 (11:01 -0400)]
Use kmem_vasprintf() in log_internal()

An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.

It also brings this function almost back in sync with Illumos.  This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781

9 years agoLinux 3.12 compat: split shrinker has s_shrink
Tim Chase [Thu, 18 Dec 2014 16:08:47 +0000 (10:08 -0600)]
Linux 3.12 compat: split shrinker has s_shrink

The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.

This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2975

9 years agoMerge branch 'kmem-rework'
Brian Behlendorf [Fri, 16 Jan 2015 22:42:46 +0000 (14:42 -0800)]
Merge branch 'kmem-rework'

The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms.  This simplifies the process of porting changes to
Linux from other platforms.  This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.  The following key Linux specific
changes have been reverted.

* KM_PUSHPAGE changed back to KM_SLEEP.  All contexts where
  it is unsafe to perform IO have been marked with PF_FSTRANS.
  This context specific mechanism is now used exclusively
  and the KM_PUSHPAGE mechanism has been retired.

* The KM_NODEBUG flag has been retired.  Allocations larger
  than 32K should use vmem_alloc()/vmem_free().  Depending
  on the size of the allocation either kmalloc() or vmalloc()
  will be used internally, but no warning will be printed.

* Pre-allocated vdev IO buffers and the dedicated SA spill
  block cache have been retired.  It is now safe and reliable
  to allocate buffers of the needed size without fear of
  deadlocking.  This reduces our memory footprint and paves
  the way for larger block sizes.

Depends on zfsonlinux/spl#414.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2918

9 years agoRevert "SA spill block cache"
Brian Behlendorf [Tue, 16 Dec 2014 19:44:24 +0000 (11:44 -0800)]
Revert "SA spill block cache"

The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoRevert "Pre-allocate vdev I/O buffers"
Brian Behlendorf [Sat, 13 Dec 2014 00:40:21 +0000 (16:40 -0800)]
Revert "Pre-allocate vdev I/O buffers"

Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoAdd kmem_cache.h include to default context
Brian Behlendorf [Tue, 9 Dec 2014 00:03:50 +0000 (19:03 -0500)]
Add kmem_cache.h include to default context

As part of the spl kmem/vmem refactoring the kmem_cache_* functions
were split in to their own kmem_cache.h header.  This was done in
part so that kmem_* consumers would not be forced to include the
kmem_cache_* functions which mask several Linux SLAB/SLAB functions.

Because of this we now much explicitly include kmem_cache.h in the
zfs_context.h.  However, consumers such as Lustre which need access
to the KM_FLAGS but not the kmem_cache_* functions can now safely
just include kmem.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoChange KM_PUSHPAGE -> KM_SLEEP
Brian Behlendorf [Fri, 21 Nov 2014 00:09:39 +0000 (19:09 -0500)]
Change KM_PUSHPAGE -> KM_SLEEP

By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoRetire KM_NODEBUG
Brian Behlendorf [Wed, 3 Dec 2014 19:56:32 +0000 (14:56 -0500)]
Retire KM_NODEBUG

Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoUse is_vmalloc_addr() in vdev_disk.c
Richard Yao [Mon, 3 Nov 2014 14:42:44 +0000 (09:42 -0500)]
Use is_vmalloc_addr() in vdev_disk.c

The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoMark IO pipeline with PF_FSTRANS
Brian Behlendorf [Sun, 13 Jul 2014 18:35:19 +0000 (14:35 -0400)]
Mark IO pipeline with PF_FSTRANS

In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
9 years agoFix zfs_putpage() lock inversion (again)
Brian Behlendorf [Wed, 7 Jan 2015 00:54:57 +0000 (16:54 -0800)]
Fix zfs_putpage() lock inversion (again)

This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range().  Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read().  The page must be unlocked before taking
the range lock.  This patch corrects that issue.

In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976

9 years agoDocument zfs_flags module parameter
Ned Bass [Tue, 23 Dec 2014 00:54:43 +0000 (16:54 -0800)]
Document zfs_flags module parameter

Add a table describing the debugging flags that can be set in the zfs_flags
module parameter.  Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.

zfs_flags (int)
            Set  additional debugging flags. The following flags may be
            bitwise-or'd together.

            +-------------------------------------------------------+
            |Value   Symbolic Name                                  |
            |        Description                                    |
            +-------------------------------------------------------+
            |    1   ZFS_DEBUG_DPRINTF                              |
            |        Enable dprintf entries in the debug log.       |
            +-------------------------------------------------------+
            |    2   ZFS_DEBUG_DBUF_VERIFY *                        |
            |        Enable extra dbuf verifications.               |
            +-------------------------------------------------------+
            |    4   ZFS_DEBUG_DNODE_VERIFY *                       |
            |        Enable extra dnode verifications.              |
            +-------------------------------------------------------+
            |    8   ZFS_DEBUG_SNAPNAMES                            |
            |        Enable snapshot name verification.             |
            +-------------------------------------------------------+
            |   16   ZFS_DEBUG_MODIFY                               |
            |        Check for illegally modified ARC buffers.      |
            +-------------------------------------------------------+
            |   32   ZFS_DEBUG_SPA                                  |
            |        Enable spa_dbgmsg entries in the debug log.    |
            +-------------------------------------------------------+
            |   64   ZFS_DEBUG_ZIO_FREE                             |
            |        Enable verification of block frees.            |
            +-------------------------------------------------------+
            |  128   ZFS_DEBUG_HISTOGRAM_VERIFY                     |
            |        Enable extra spacemap histogram verifications. |
            +-------------------------------------------------------+
            * Requires debug build.

            Default value: 0.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2988

9 years agoDon't use AC_LANG_SOURCE for conftest.h source
Ned Bass [Mon, 15 Dec 2014 21:53:00 +0000 (13:53 -0800)]
Don't use AC_LANG_SOURCE for conftest.h source

Using AC_LANG_SOURCE with some versions of autoconf is problematic if
the given source is to be written to a header file. Such versions assume
the contents are to be written to conftest.c and generate shell code to
that effect. The contents of the test program to detect support for
Linux tracepoints were consequently malformed (containing the source for
conftest.h) so the build system incorrectly disabled tracepoints
support. Fix this in ZFS_LINUX_TRY_COMPILE_HEADER by passing the header
source directly to ZFS_LINUX_COMPILE_IFELSE.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953

9 years agoRemove duplicate typedefs from trace.h
Ned Bass [Sat, 13 Dec 2014 02:07:39 +0000 (18:07 -0800)]
Remove duplicate typedefs from trace.h

Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.

Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.

These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953

10 years agoFix zfs_putpage() lock inversion
Brian Behlendorf [Fri, 19 Dec 2014 20:57:54 +0000 (12:57 -0800)]
Fix zfs_putpage() lock inversion

There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock.  To
prevent this we must always manipulate the writeback bit while
holding the range lock.  The exact deadlock is as follows:

------ Process A ------        ------ Process B ------
zpl_writepages                 zpl_fallocate
write_cache_pages              zpl_fallocate_common
zpl_putpage                    zfs_space
zfs_putpage (set bit)          zfs_freesp
zfs_range_lock (wait on lock)  zfs_free_range (take lock)
[has not yet initiated I/O,    truncate_inode_pages_range
the bit will not be cleared]   wait_on_page_writeback (wait on bit)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976

10 years agovdev_id: use mawk-compatible regular expression
Ned Bass [Wed, 17 Dec 2014 19:01:42 +0000 (11:01 -0800)]
vdev_id: use mawk-compatible regular expression

Slot mapping in vdev_id doesn't work on systems using mawk as the 'awk'
alternative. A regular expression in map_slot() contains an unquoted
empty string following the alternation (|) operator, which results in an
"missing operand" error with mawk. The solution is to rearrange the
expression so the alternation has two operands.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/pkg-zfs#136
Closes zfsonlinux/zfs#2965

10 years agoFix cstyle issue from c66989b
Brian Behlendorf [Fri, 19 Dec 2014 19:57:52 +0000 (11:57 -0800)]
Fix cstyle issue from c66989b

Commit c66989b accidentally introduced a cstyle issue which went
unnoticed.  This tiny patch corrects that oversight.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agoCorrect error returns to unify cross-pool operation error handling
Boris Protopopov [Wed, 19 Nov 2014 17:08:08 +0000 (12:08 -0500)]
Correct error returns to unify cross-pool operation error handling

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2911

10 years agoFix typo in %post scriptlet lines
Andy Bakun [Mon, 15 Dec 2014 04:23:25 +0000 (20:23 -0800)]
Fix typo in %post scriptlet lines

Missing space made the %post directive be part of the package
%description and not have a %post scriptlet defined.

Signed-off-by: Andy Bakun <github@thwartedefforts.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2961

10 years agozpool upgrade return errors to stderr instead of stdout
Jacek Fefliński [Wed, 10 Dec 2014 12:24:14 +0000 (13:24 +0100)]
zpool upgrade return errors to stderr instead of stdout

Signed-off-by: Jacek Feflinski <feflik@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2955

10 years agoImprove systemd script to not leave stale sharetab
Dan Swartzendruber [Sun, 7 Dec 2014 17:23:00 +0000 (12:23 -0500)]
Improve systemd script to not leave stale sharetab

The systemd script zfs-share.service does 'zfs share -a' to share
any required datasets.  Unfortunately, /etc/dfs/sharetab is stale
from the previous boot.  Delete it before we share.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2883

10 years agoFix snapshots with dirty inodes
Brian Behlendorf [Mon, 20 Oct 2014 21:37:47 +0000 (14:37 -0700)]
Fix snapshots with dirty inodes

Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode.  This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812

10 years agoFix systemd config for zfs-share.service
Dan Swartzendruber [Thu, 13 Nov 2014 19:49:51 +0000 (14:49 -0500)]
Fix systemd config for zfs-share.service

The zfs-share.service rule needs to be modified to ensure that it
does not execute before zfs-mount.service.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ralf Ertzinger <ralf@skytale.net>
Closes #2893

10 years agobio_alloc() with __GFP_WAIT never returns NULL
Isaac Huang [Tue, 21 Oct 2014 18:20:10 +0000 (12:20 -0600)]
bio_alloc() with __GFP_WAIT never returns NULL

Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL.  However, we want to keep
the error handling in case this behavior changes in the futre.

Plus fix a small style issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2703

10 years agoExplicitly include SPL compat headers
Ned Bass [Fri, 14 Nov 2014 18:21:53 +0000 (10:21 -0800)]
Explicitly include SPL compat headers

Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages.  Include a
few compatiblity headers explicitly to cope with that change.  Also,
sort some linux-specific inclusions alphabetically.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2898

10 years agoFix improper null-byte termination handling
Ned Bass [Fri, 7 Nov 2014 02:18:32 +0000 (18:18 -0800)]
Fix improper null-byte termination handling

Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.

- The snprintf() function always produces a null-byte terminated string
  for non-negative return values, so it is not necessary to write out a
  null-byte as a separate step.

- Also, it is unsafe to use the return value of snprintf() as an offset
  for placing a null-byte, because if the output was truncated the return
  value is the number of bytes that _would_ have been written had enough
  space been available. Therefore the return value may index beyond the
  array boundaries.

- Finally, snprintf() accounts for the null-byte when limiting its output
  size, so there is no need to pass it a size parameter that is one less
  than the buffer size.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2875

10 years agoPrevent ZFS leaking pool free space
smh [Thu, 16 Oct 2014 02:23:27 +0000 (02:23 +0000)]
Prevent ZFS leaking pool free space

When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.

References:
  https://github.com/freebsd/freebsd/commit/48431b7
  http://svnweb.freebsd.org/base?view=revision&revision=273158
  https://reviews.csiden.org/r/132/

Porting notes:

This issue was filed as illumos 5347 and a more comprehensive fix is
under review.  Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2896

10 years agoUndirty freed spill blocks.
Tim Chase [Tue, 11 Nov 2014 05:26:33 +0000 (23:26 -0600)]
Undirty freed spill blocks.

If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

setfattr -n user.test -v very_very_very..._long_value  <file>
setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884

10 years agoMerge branch 'b_tracepoints'
Brian Behlendorf [Mon, 17 Nov 2014 19:14:24 +0000 (11:14 -0800)]
Merge branch 'b_tracepoints'

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2874

10 years agoSwap DTRACE_PROBE* with Linux tracepoints
Prakash Surya [Fri, 13 Jun 2014 17:54:48 +0000 (10:54 -0700)]
Swap DTRACE_PROBE* with Linux tracepoints

This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.

The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:

    * A number of external tools can make use of our tracepoints
      "automatically" (e.g. perf, systemtap)
    * Tracepoints are designed to be extremely cheap when disabled
    * It's one of the "accepted" ways to export this kind of
      information; many other kernel subsystems use tracepoints too.

Unfortunately, though, there are a few caveats as well:

    * Linux tracepoints appear to only be available to GPL licensed
      modules due to the way certain kernel functions are exported.
      Thus, to actually make use of the tracepoints introduced by this
      patch, one might have to patch and re-compile the kernel;
      exporting the necessary functions to non-GPL modules.

    * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
      tracepoints are not available for unsigned kernel modules
      (tracepoints will get disabled due to the module's 'F' taint).
      Thus, one either has to sign the zfs kernel module prior to
      loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
      newer.

Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):

    # list all zfs tracepoints available

    $ ls /sys/kernel/debug/tracing/events/zfs
    enable              filter              zfs_arc__delete
    zfs_arc__evict      zfs_arc__hit        zfs_arc__miss
    zfs_l2arc__evict    zfs_l2arc__hit      zfs_l2arc__iodone
    zfs_l2arc__miss     zfs_l2arc__read     zfs_l2arc__write
    zfs_new_state__mfu  zfs_new_state__mru

    # enable all zfs tracepoints, clear the tracepoint ring buffer

    $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
    $ echo 0 > /sys/kernel/debug/tracing/trace

    # import zpool called 'tank', inspect tracepoint data (each line was
    # truncated, they're too long for a commit message otherwise)

    $ zpool import tank
    $ cat /sys/kernel/debug/tracing/trace | head -n35
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 1219/1219   #P:8
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
            lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
          z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
          z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
          z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
          z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
          z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
          z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
          z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...

To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:

    lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
        hdr {
            dva 0x1:0x40082
            birth 15491
            cksum0 0x163edbff3a
            flags 0x640
            datacnt 1
            type 1
            size 2048
            spa 3133524293419867460
            state_type 0
            access 0
            mru_hits 0
            mru_ghost_hits 0
            mfu_hits 0
            mfu_ghost_hits 0
            l2_hits 0
            refcount 1
        } bp {
            dva0 0x1:0x40082
            dva1 0x1:0x3000e5
            dva2 0x1:0x5a006e
            cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
            lsize 2048
        } zb {
            objset 0
            object 0
            level -1
            blkid 0
        }

For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.

For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:

    * http://lwn.net/Articles/379903/
    * http://lwn.net/Articles/381064/
    * http://lwn.net/Articles/383362/

I should also node the more "boring" aspects of this patch:

    * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
       support a sixth paramter. This parameter is used to populate the
       contents of the new conftest.h file. If no sixth parameter is
       provided, conftest.h will be empty.

    * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
      This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
      except it has support for a fifth option that is then passed as
      the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.

These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).

    * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
      is to determine if we can safely enable the Linux tracepoint
      functionality. We need to selectively disable the tracepoint code
      due to the kernel exporting certain functions as GPL only. Without
      this check, the build process will fail at link time.

In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agocstyle: allow right paren on its own line
Ned Bass [Thu, 6 Nov 2014 21:34:17 +0000 (13:34 -0800)]
cstyle: allow right paren on its own line

Make the style checker script accept right parentheses on their own
lines. This is motivated by the Linux tracepoints macro
DECLARE_EVENT_CLASS.

The code within TP_fast_assign() (a parameter of DECLARE_EVENT_CLASS)
is normal C assignments terminated by semicolons.  But the style
checker forbids us from following a semicolon with a non-blank and
from preceding a right parenthesis with white space.  Therefore the
closing parenthesis must go on the next line, yet the style checker
foribs us from indenting it for readability.  Relaxing the
no-non-blank-after-semicolon rule would open the door to too many bad
style practices. So instead we relax the
no-white-space-before-right-paren rule if the parenthesis is on its
own line.  The relaxation is overriden with the -p option so we still
have a way to catch misuse of this style.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agoFix dprintf format specifiers
Ned Bass [Thu, 23 Oct 2014 23:59:27 +0000 (16:59 -0700)]
Fix dprintf format specifiers

Fix a few dprintf format specifiers that disagreed with their argument
types.  These came to light as compiler errors when converting dprintf
to use the Linux trace buffer.  Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agoMove a few internal ARC strucutres to arc_impl.h
Ned Bass [Wed, 22 Oct 2014 00:59:33 +0000 (17:59 -0700)]
Move a few internal ARC strucutres to arc_impl.h

Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agoFix small spelling mistake
Randall Mason [Fri, 7 Nov 2014 10:34:11 +0000 (12:34 +0200)]
Fix small spelling mistake

recieve becomes receive

Signed-off-by: Randall Mason <ClashTheBunny@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2877

10 years agoIllumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO
Prakash Surya [Mon, 6 Oct 2014 14:32:36 +0000 (16:32 +0200)]
Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO

5213 panic in metaslab_init due to space_map_open returning ENXIO
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com

References:
  https://www.illumos.org/issues/5213
  https://reviews.csiden.org/r/110

Porting notes:

For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2745

10 years agoPrint header properly when terminal resizes
Isaac Huang [Wed, 29 Oct 2014 03:35:10 +0000 (21:35 -0600)]
Print header properly when terminal resizes

Added a handler for SIGWINCH, so that one header
is printed per screen even when the terminal resizes.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2847

10 years agoFix inaccurate field descriptions
Isaac Huang [Thu, 30 Oct 2014 19:29:58 +0000 (13:29 -0600)]
Fix inaccurate field descriptions

The field descriptions from arcstat.py -v for the demand accesses
are inaccurate. They all begin with "Demand Data" yet the fields
actually covered both demand data and demand meta-data accesses.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2842

10 years agoReduce buf/dbuf mutex contention
Chris Wedgwood [Thu, 23 Oct 2014 23:00:41 +0000 (16:00 -0700)]
Reduce buf/dbuf mutex contention

Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.

This increase in hash table size adds approximating 0.5M to
our fixed memory footprint.  This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized.  In the meanwhile, this small
change significantly reduces contention for certain workloads.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #1291

10 years agoExport symbols for ZIL interface
Alex Zhuravlev [Thu, 13 Nov 2014 18:09:05 +0000 (10:09 -0800)]
Export symbols for ZIL interface

These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL.  In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.

Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2892

10 years agoImprove zvol symlink handling.
Dan Swartzendruber [Wed, 29 Oct 2014 01:29:53 +0000 (21:29 -0400)]
Improve zvol symlink handling.

Change the zvol helper program to replace any embedded spaces
in the pool or dataset names with '+' to ensure we have valid
symlinks.

The '+' character was choosen because it is not a valid character
for a dataset name but it is allowed by udev.  This ensures that
all dataset names with an embedded space will be translated to
a unique /dev/zvol/ symlink.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2834

10 years agoAdd config/compile to config/.gitignore
Marcel Wysocki [Wed, 29 Oct 2014 11:11:43 +0000 (12:11 +0100)]
Add config/compile to config/.gitignore

This file may be added by automake and therefore should be added
to config/.gitignore.  For the full list of possible auxiliary
programs see the full automake documentation.

http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs

Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2848

10 years agoFix modules installation directory
Alexander Pyhalov [Wed, 22 Oct 2014 18:02:45 +0000 (22:02 +0400)]
Fix modules installation directory

When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.

Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2822

10 years agoMake systemd-modules-load.service file directory configurable
Richard Yao [Thu, 18 Sep 2014 13:47:16 +0000 (09:47 -0400)]
Make systemd-modules-load.service file directory configurable

Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. We could handle this by disabling systemd support on
prefix because systemd does not check these paths, but the Gentoo
Council decided that small files such as these should be installed.
That means disabling systemd support on prefix is not an acceptable
workaround. As a consequence, we need some way of control the directory
into which these files are installed.

Making this configurable increases our compliance with the
freedesktop.org specification, which allows these files to be installed
into /etc/modules-load.d:

http://www.freedesktop.org/software/systemd/man/modules-load.d.html

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641

10 years agoMake directory into which mount.zfs is installed configurable
Richard Yao [Fri, 29 Aug 2014 18:16:41 +0000 (14:16 -0400)]
Make directory into which mount.zfs is installed configurable

Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. I could script a workaround inside the ebuild, but it
seemed to make more sense to make this more configurable.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641

10 years agoSearch /usr/local/src for SPL Object Directory
Richard Yao [Fri, 29 Aug 2014 17:09:52 +0000 (13:09 -0400)]
Search /usr/local/src for SPL Object Directory

Since we changed the default location for the kernel headers to respect
--prefix in the SPL, we must search that location to prevent user builds
from breaking.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641

10 years agoKernel header installation should respect --prefix
Richard Yao [Fri, 29 Aug 2014 15:53:09 +0000 (11:53 -0400)]
Kernel header installation should respect --prefix

This is the upstream component of work that enables preliminary support
for building Gentoo's ZFS packaging on other Linux systems via Gentoo
Prefix.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641

10 years agoLinux 3.12 compat: shrinker semantics
Tim Chase [Thu, 2 Oct 2014 12:21:08 +0000 (07:21 -0500)]
Linux 3.12 compat: shrinker semantics

The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2837

10 years agoIllumos 5164-5165 - space map fixes
Matthew Ahrens [Sat, 13 Sep 2014 13:40:05 +0000 (15:40 +0200)]
Illumos 5164-5165 - space map fixes

5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
     space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5164
  https://www.illumos.org/issues/5165
  https://github.com/illumos/illumos-gate/commit/b1be289

Porting Notes:

The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.

The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz.  The upstream commit incorrectly
used space_map_blksize.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697

10 years agoIllumos 4958 zdb trips assert on pools with ashift >= 0xe
Alex Reece [Mon, 22 Sep 2014 23:42:03 +0000 (01:42 +0200)]
Illumos 4958 zdb trips assert on pools with ashift >= 0xe

4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4958
  https://github.com/illumos/illumos-gate/commit/2a104a5

Porting notes:

Keep the ZIO_FLAG_FASTWRITE define.  This is for a feature present
in Linux but not yet in *BSD.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697

10 years agoFix zdb segfault
Brian Behlendorf [Thu, 23 Oct 2014 22:26:49 +0000 (15:26 -0700)]
Fix zdb segfault

On 32-bit systems setting 'zfs_arc_max = 256M' in zdb results in the
following segmentation fault.  Rather than reverting 0ec0724 which
introduced this flaw this code is only used for 64-bit builds.

Segmentation fault (core dumped)
ztest: '/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 139
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
10 years agoIllumos 5169-5171 - zdb fixes
Matthew Ahrens [Tue, 16 Sep 2014 20:24:48 +0000 (22:24 +0200)]
Illumos 5169-5171 - zdb fixes

5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5169
  https://www.illumos.org/issues/5170
  https://www.illumos.org/issues/5171
  https://github.com/illumos/illumos-gate/commit/06be980

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2707

10 years agoIllumos 5178 - zdb -vvvvv on old-format pool fails in dump_deadlist()
Matthew Ahrens [Wed, 17 Sep 2014 07:14:39 +0000 (09:14 +0200)]
Illumos 5178 - zdb -vvvvv on old-format pool fails in dump_deadlist()

5178 zdb -vvvvv on old-format pool fails in dump_deadlist()
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5178
  https://github.com/illumos/illumos-gate/commit/90c76c6

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2713

10 years agoFix zpool create -t ENOENT bug.
ilovezfs [Sat, 4 Oct 2014 05:20:43 +0000 (22:20 -0700)]
Fix zpool create -t ENOENT bug.

In userland we need to switch over to the temporary name once the
pool has been created, otherwise the root dataset won't mount
and the error "cannot open 'the_real_name': dataset does not exist"
is printed.

Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2760

10 years agoHandle block pointers with a corrupt logical size
Brian Behlendorf [Wed, 10 Sep 2014 18:59:03 +0000 (11:59 -0700)]
Handle block pointers with a corrupt logical size

The general strategy used by ZFS to verify that blocks are valid is
to checksum everything.  This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block.  If a blocks checksum is valid then its contents are
trusted by the higher layers.

This system works exceptionally well as long as bad data is never
written with a valid checksum.  If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.

One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.

To prevent this from happening the arc_read() function has been
updated to detect this specific case.  If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2678

10 years agoRemove checks for mandatory locks
Ned Bass [Thu, 16 Oct 2014 20:52:56 +0000 (13:52 -0700)]
Remove checks for mandatory locks

The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804

10 years agoIllumos 5162 - zfs recv should use loaned arc buffer to avoid copy
Matthew Ahrens [Sat, 13 Sep 2014 14:02:18 +0000 (16:02 +0200)]
Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy

5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5162
  https://github.com/illumos/illumos-gate/commit/8a90470

Porting notes:
  Fix spelling error 's/arena/area/' in dmu.c.
  In restore_write() declare bonus and abuf at the top of the function.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2696

10 years agoIllumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness
Matthew Ahrens [Fri, 12 Sep 2014 03:45:50 +0000 (05:45 +0200)]
Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness

When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!

Steps to reproduce:

  * zpool create test c1t1d0
  * zfs create test/fs
  * zfs snapshot test/fs@a
  * zfs clone test/fs@a test/clone
  * zfs destroy -d test/fs@a
  * zfs clone test/fs@a test/clone2
  * zfs snapshot test/clone2@a
  * zfs hold hld test/clone2@a
  * zfs release hld test/clone2@a
  * zfs list -r -t all test

  <test/clone2@a has been destroyed>

We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.

5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5150
  https://github.com/illumos/illumos-gate/commit/42fcb65

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2690

10 years agoIllumos 3693 - restore_object uses at least two transactions to restore an object
Matthew Ahrens [Fri, 12 Sep 2014 03:28:35 +0000 (05:28 +0200)]
Illumos 3693 - restore_object uses at least two transactions to restore an object

Restore_object should not use two transactions to restore an object:
  * one transaction is used for dmu_object_claim
  * another transaction is used to set compression, checksum and most
    importantly bonus data
  * furthermore dmu_object_reclaim internally uses multiple transactions
  * dmu_free_long_range frees chunks in separate transactions
  * dnode_reallocate is executed in a distinct transaction

The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups.  Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.

3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon

References:
  https://www.illumos.org/issues/3693
  https://github.com/illumos/illumos-gate/commit/e77d42e

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2689

10 years agoDon't perform ACL-to-mode translation on empty ACL
Tim Chase [Tue, 7 Oct 2014 13:01:01 +0000 (08:01 -0500)]
Don't perform ACL-to-mode translation on empty ACL

In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL.  If no ACL exists, this
processing shouldn't be done.  Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:

# create a file with typical mode of 664
echo test > test
chown anyuser test
ls -l test

and the mode will show up as all zeroes.  Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1264

10 years agoIllumos 4924 - LZ4 Compression for metadata
Daniil Lunev [Sat, 18 Oct 2014 15:58:11 +0000 (11:58 -0400)]
Illumos 4924 - LZ4 Compression for metadata

Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://github.com/illumos/illumos-gate/commit/b8289d2
  https://www.illumos.org/issues/3756

Porting notes:

The static function zfs_prop_activate_feature() was removed because
this change removes the only caller.  The function was not removed
from Illumos but instead left as dead code.  However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1540

10 years agoSuppress AIO kmem warnings
Brian Behlendorf [Thu, 9 Oct 2014 00:10:45 +0000 (17:10 -0700)]
Suppress AIO kmem warnings

The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO.  While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2774

10 years agoLet `zpool import` ignore a missing hostid record.
Darik Horn [Mon, 13 Oct 2014 03:57:49 +0000 (22:57 -0500)]
Let `zpool import` ignore a missing hostid record.

Change the zpool program to skip its hostid mismatch check in the
same way that libzfs already does.

Invoked imports fail if the ZPOOL_CONFIG_HOSTID nvpair is missing in
the /etc/zfs/zpool.cache file, which can happen as of the /etc/hostid
deprecation in commit zfsonlinux/spl@acf0ade362cb8b26d67770114ee6fa17816e6b65.

Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2794

10 years agoHandle NULL mirror child vdev
Brian Behlendorf [Sat, 11 Oct 2014 01:12:47 +0000 (18:12 -0700)]
Handle NULL mirror child vdev

When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev.  In
this case the child should be skipped and the read issues to
another member of the mirror.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1744

10 years agoUpdate utsname support
Brian Behlendorf [Wed, 1 Oct 2014 19:02:12 +0000 (15:02 -0400)]
Update utsname support

Modify the code to use the utsname() kernel function rather than
a global variable.  This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2).  This means that it
will behave consistently in both contexts.

This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions.  And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.

Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757

10 years agoRemove shrink_dcache_memory() and shrink_icache_memory()
Brian Behlendorf [Fri, 3 Oct 2014 20:00:53 +0000 (13:00 -0700)]
Remove shrink_dcache_memory() and shrink_icache_memory()

This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface.  Therefore, this functionality is being dropped
for earlier kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757

10 years agoUpdate code to use misc_register()/misc_deregister()
Brian Behlendorf [Tue, 30 Sep 2014 23:24:04 +0000 (19:24 -0400)]
Update code to use misc_register()/misc_deregister()

When ZPIOS was originally written it was designed to use the
device_create() and device_destroy() functions.  Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.

As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions.  This interface
for registering character devices has remained stable, is simple,
and provides everything we need.

Therefore the code has been reworked to use this interface.  The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757

10 years agoMake license compatibility checks consistent
Brian Behlendorf [Fri, 3 Oct 2014 17:58:47 +0000 (10:58 -0700)]
Make license compatibility checks consistent

Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757

10 years agoztest: print backtrace on SIGSEGV and SIGABRT
Ned Bass [Sat, 11 Oct 2014 01:05:54 +0000 (18:05 -0700)]
ztest: print backtrace on SIGSEGV and SIGABRT

Add signal handlers to print a backtrace if we crash or assert.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2788

10 years agoFix source_tree variable in dkms build
Brian Behlendorf [Mon, 13 Oct 2014 17:35:01 +0000 (10:35 -0700)]
Fix source_tree variable in dkms build

The source_tree variable in the previous commit had an extra $.
Remove it so that source_tree is expanded properly.  An identical
fix has been applied in the original patch to the stable branch.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2776

10 years agoPoint dkms build at installed source tree, rather than build directory.
Tom Prince [Thu, 9 Oct 2014 17:24:03 +0000 (14:24 -0300)]
Point dkms build at installed source tree, rather than build directory.

Signed-off-by: Tom Prince <tom.prince@clusterhq.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2776

10 years agoInstall header during post-build rather than post-install.
Tom Prince [Thu, 9 Oct 2014 17:22:59 +0000 (14:22 -0300)]
Install header during post-build rather than post-install.

New versions of dkms clean up the build directory after installing.

It appears that this was always intended, but had rm -rf "/path/to/build/*"
(note the quotes), which prevented it from working.

Also, the build step is already installing stuff into the directory where
these files go, so installing our stuff there as part of build rather than
install makes sense.

Signed-off-by: Tom Prince <tom.prince@clusterhq.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2776

10 years agoAdd a stern warning about dedup
Turbo Fredriksson [Thu, 2 Oct 2014 14:09:51 +0000 (16:09 +0200)]
Add a stern warning about dedup

Users intending to use dedup should be clearly advised about
its memory requirements and the risks involved.

Thanx to Sachiru for comments and suggestions.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2754

10 years agoImprove VERIFY() error in dmu_write()
Brian Behlendorf [Fri, 3 Oct 2014 23:24:34 +0000 (16:24 -0700)]
Improve VERIFY() error in dmu_write()

This is a debug patch designed to ensure an error code is logged
to the console when this VERIFY() is hit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #1440

10 years agoFix CPU_SEQID use in preemptible context
Brian Behlendorf [Tue, 7 Oct 2014 20:20:49 +0000 (13:20 -0700)]
Fix CPU_SEQID use in preemptible context

Commit e022864 introduced a regression for kernels which are built
with CONFIG_DEBUG_PREEMPT.  The use of CPU_SEQID in a preemptible
context causes zio_nowait() to trigger the BUG.  Since CPU_SEQID
is simply being used as a random index the usage here is safe. To
resolve the issue preempt is disable while calling CPU_SEQID.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2769

10 years agoAdd an example for 'zfs bookmark' to the Example section.
Turbo Fredriksson [Wed, 1 Oct 2014 14:24:54 +0000 (16:24 +0200)]
Add an example for 'zfs bookmark' to the Example section.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2762

10 years agoIllumos 5176 - lock contention on godfather zio
Matthew Ahrens [Wed, 17 Sep 2014 06:59:43 +0000 (08:59 +0200)]
Illumos 5176 - lock contention on godfather zio

5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5176
  https://github.com/illumos/illumos-gate/commit/6f834bc

Porting notes:

Under Linux max_ncpus is defined as num_possible_cpus().  This is
largest number of cpu ids which might be available during the life
time of the system boot.  This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2711

10 years agoAmend Dracut module to export ZFS root on shutdown
Lukas Wunner [Mon, 6 Oct 2014 11:08:33 +0000 (13:08 +0200)]
Amend Dracut module to export ZFS root on shutdown

Make use of Dracut's ability to restore the initramfs on shutdown and
pivot to it, allowing for a clean unmount and export of the ZFS root.
No need to force-import on every reboot anymore.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2195
Issue #2476
Issue #2498
Issue #2556
Issue #2563
Issue #2575
Issue #2600
Issue #2755
Issue #2766

10 years agoCleanup struct zed_conf vars in zed_conf_destroy
Chris Dunlap [Wed, 1 Oct 2014 21:56:52 +0000 (14:56 -0700)]
Cleanup struct zed_conf vars in zed_conf_destroy

Reset struct zed_conf file descriptors to -1 after close(),
and pointers to NULL after free().

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2756

10 years agoObtain advisory lock on ZED PID file
Chris Dunlap [Wed, 1 Oct 2014 21:56:07 +0000 (14:56 -0700)]
Obtain advisory lock on ZED PID file

ZED uses an advisory lock on its state file to protect against
multiple instances running concurrently.  However, work is planned
to move this state information into the kernel, and ZED will still
need to protect against starting multiple instances.

This commit adds an advisory lock on the PID file to protect against
starting multiple instances.  A lock failure can be overridden with
the "-f" (force) command-line option.  The advisory lock on the state
file is being retained for as long as the state information is stored
in the state file.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2756

10 years agozfs send -p send properties only for snapshots that are actually sent
Andriy Gapon [Thu, 15 May 2014 08:42:19 +0000 (11:42 +0300)]
zfs send -p send properties only for snapshots that are actually sent

... as opposed to sending properties of all snapshots of the relevant
filesystem.  The previous behavior results in properties being set on
all snapshots on the receiving side, which is quite slow.

Behavior of zfs send -R is not changed.

References:
  http://thread.gmane.org/gmane.comp.file-systems.openzfs.devel/346

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2729
Issue #2210

10 years agoFreeBSD PR kern/172259: Fixes zfs receive errors
smh [Thu, 13 Dec 2012 22:03:07 +0000 (22:03 +0000)]
FreeBSD PR kern/172259: Fixes zfs receive errors

FreeBSD PR kern/172259: Fixes zfs receive errors caused by snapshot
replication being processed in a random order instead of creation
order.

Eliminates needless filesystem renames caused by removed parent
snapshots which subsequently causes many more errors.

PR: kern/172259
Submitted by: Steven Hartland
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks

References:
  https://github.com/freebsd/freebsd/commit/4995789

Porting notes:

Minor whitespace fixes were made to conform with style requirements:

lib/libzfs/libzfs_sendrecv.c: 2269: indent by spaces instead of tabs
lib/libzfs/libzfs_sendrecv.c: 2270: indent by spaces instead of tabs

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2729

10 years agoImplement -t option to zpool create for temporary pool names
Richard Yao [Fri, 20 Jun 2014 23:00:11 +0000 (19:00 -0400)]
Implement -t option to zpool create for temporary pool names

Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.

26b42f3f9d03f85cc7966dc2fe4dfe9216601b0e introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.

This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417

10 years agozpool import -t should not update cachefile
Richard Yao [Mon, 23 Jun 2014 18:26:47 +0000 (14:26 -0400)]
zpool import -t should not update cachefile

zpool import's -t parameter is intended for use with -R when operating
on pools that belong to other systems. Like -R, pools imported in this
way should not update the cachefile unless explicitly requested. The
initial implementation allowed the cachefile to be updated when -R was
not used. This went uncaught during testing because -R had implicitly
disabled use of the cachefile.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417

10 years agoAdd add_prop_list_default helper
Richard Yao [Mon, 23 Jun 2014 18:12:53 +0000 (14:12 -0400)]
Add add_prop_list_default helper

Adding to a property list only if there is no existing value is used
twice. Once by zpool create -R and again by zpool import -R. Now that
zpool create -t and zpool import -t also need it, lets refactor it into
a helper function to make the code more readable.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417

10 years agoMake user stack limit configurable
Brian Behlendorf [Thu, 25 Sep 2014 22:15:45 +0000 (15:15 -0700)]
Make user stack limit configurable

To aid in detecting and debugging stack overflow issues make the
user space stack limit configurable via a new ZFS_STACK_SIZE
environment variable.  The value assigned to ZFS_STACK_SIZE will
be used as the default stack size in bytes.

Because this is mainly useful as a debugging aid in conjunction
with ztest the stack limit is disabled by default.  See the ztest(1)
man page for additional details on using the ZFS_STACK_SIZE
environment variable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2743
Issue #2293

10 years agoPerform whole-page page truncation for hole-punching under a range lock
Tim Chase [Fri, 26 Sep 2014 04:40:41 +0000 (23:40 -0500)]
Perform whole-page page truncation for hole-punching under a range lock

As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161fad50f53a8fa5ffeea8cc4f8137d522
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp().  This left a window of opportunity during which
the full pages could be touched.

This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2733

10 years agoRefer to ZED's scripts as ZEDLETs
Chris Dunlap [Fri, 19 Sep 2014 18:10:28 +0000 (11:10 -0700)]
Refer to ZED's scripts as ZEDLETs

The executables invoked by the ZED in response to a given zevent
have been generically referred to as "scripts".  By convention,
these scripts have aimed to be /bin/sh compatible for reasons of
portability and comprehensibility.  However, the ZED only requires
they be executable and (ideally) capable of reading environment
variables.  As such, these scripts are now referred to as ZEDLETs
(ZFS Event Daemon Linkage for Executable Tasks).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2735

10 years agoReplace zed's use of malloc with calloc
Chris Dunlap [Mon, 22 Sep 2014 20:22:48 +0000 (13:22 -0700)]
Replace zed's use of malloc with calloc

When zed allocates memory via malloc(), it typically follows that
with a memset().  However, calloc() implementations can often perform
optimizations when zeroing memory:

https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc

This commit replaces zed's use of malloc() with calloc().

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2736

10 years agoFix zed io-spare.sh dash incompatibility
Chris Dunlap [Thu, 11 Sep 2014 22:41:35 +0000 (15:41 -0700)]
Fix zed io-spare.sh dash incompatibility

The zed's io-spare.sh script defines a vdev_status() function to query
the 'zpool status' output for obtaining the status of a specified vdev.
This function contains a small awk script that uses a parameter
expansion (${parameter/pattern/string}) supported in bash but not
in dash.  Under dash, this fails with a "Bad substitution" error.

This commit replaces the awk script with a (hopefully more portable)
sed script that has been tested under both bash and dash.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2536

10 years agoIllumos 5138 - add tunable for maximum number of blocks freed in one txg
Max Grossman [Sun, 7 Sep 2014 15:06:08 +0000 (17:06 +0200)]
Illumos 5138 - add tunable for maximum number of blocks freed in one txg

Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5138
  https://github.com/illumos/illumos-gate/commit/af3465d

Porting notes:

Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long.  This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.

The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos.  This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2675
Closes #2581

10 years agoIllumos 4753 - increase number of outstanding async writes when sync task is waiting
Alex Reece [Fri, 18 Jul 2014 15:08:31 +0000 (07:08 -0800)]
Illumos 4753 - increase number of outstanding async writes when sync task is waiting

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/4753
    https://github.com/illumos/illumos-gate/commit/73527f4

Comments by Matt Ahrens from the issue tracker:
    When a sync task is waiting for a txg to complete, we should hurry
    it along by increasing the number of outstanding async writes
    (i.e. make vdev_queue_max_async_writes() return a larger number).
    Initially we might just have a tunable for "minimum async writes
    while a synctask is waiting" and set it to 3.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2716

10 years agoIllumos 5116 - zpool history -i goes into infinite loop
Matthew Ahrens [Wed, 17 Sep 2014 15:41:51 +0000 (17:41 +0200)]
Illumos 5116 - zpool history -i goes into infinite loop

5116 zpool history -i goes into infinite loop
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Boris Protopopov <boris.protopopov@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5116
  https://github.com/illumos/illumos-gate/commit/3339867

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2715

10 years agoIllumos 5135 - zpool_find_import_cached() can use fnvlist_*
Matthew Ahrens [Fri, 12 Sep 2014 16:26:53 +0000 (18:26 +0200)]
Illumos 5135 - zpool_find_import_cached() can use fnvlist_*

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5135
  https://github.com/illumos/illumos-gate/commit/b18d6b0

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2693

10 years agoIllumos 5139 - SEEK_HOLE failed to report a hole at end of file
Matthew Ahrens [Wed, 17 Sep 2014 15:25:10 +0000 (17:25 +0200)]
Illumos 5139 - SEEK_HOLE failed to report a hole at end of file

5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5139
  https://github.com/illumos/illumos-gate/commit/0fbc0cd

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2714

10 years agolib/libzpool/kernel.c: Assert no owners in rw_destroy()
Richard Yao [Wed, 30 Apr 2014 00:47:47 +0000 (20:47 -0400)]
lib/libzpool/kernel.c: Assert no owners in rw_destroy()

This is intended to cause ztest to fail when rw_destroy() is called on a
rwlock that has owners.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330

10 years agoFix function call with uninitialized value in vdev_inuse
Richard Yao [Wed, 23 Apr 2014 03:18:17 +0000 (23:18 -0400)]
Fix function call with uninitialized value in vdev_inuse

LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330

10 years agoProperly NULL terminate string in zfs_strcmp_pathname
Richard Yao [Wed, 23 Apr 2014 00:25:39 +0000 (20:25 -0400)]
Properly NULL terminate string in zfs_strcmp_pathname

The utility cppcheck caught this.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330

10 years agoIllumos 5147 - zpool list -v should show individual disk capacity
George Wilson [Fri, 12 Sep 2014 03:07:20 +0000 (05:07 +0200)]
Illumos 5147 - zpool list -v should show individual disk capacity

The 'zpool list -v' command displays lots of info but excludes the
capacity of each disk. This should be added.

5147 zpool list -v should show individual disk capacity
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5147
  https://github.com/illumos/illumos-gate/commit/7a09f97

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2688

10 years agoIllumos 5161 - add tunable for number of metaslabs per vdev
Matthew Ahrens [Sat, 13 Sep 2014 14:13:00 +0000 (16:13 +0200)]
Illumos 5161 - add tunable for number of metaslabs per vdev

5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5161
  https://github.com/illumos/illumos-gate/commit/bf3e216

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2698