]> granicus.if.org Git - zfs/log
zfs
6 years agoOpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K...
Matt Ahrens [Mon, 24 Jul 2017 18:07:39 +0000 (11:07 -0700)]
OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices

Authored by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c
Closes #7445

6 years agoRevert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier"
megari [Mon, 16 Apr 2018 19:44:40 +0000 (22:44 +0300)]
Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier"

This reverts commit cbb893321545c2c9052787b556c9375fcb103979.

The original change in OpenZFS 9036 did remove duplicate 'const'
specifiers, but the ZoL port had already done what *should* have been
done in OpenZFS 9036, which is to make the pointers themselves const.
The port of the change to ZoL ended up doing an unnecessary removal
of the constness of the pointers. Undo that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Closes #7444

6 years agoOpenZFS 7638 - Refactor spa_load_impl into several functions
Pavel Zakharov [Tue, 23 Feb 2016 16:49:30 +0000 (11:49 -0500)]
OpenZFS 7638 - Refactor spa_load_impl into several functions

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7638
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6
Closes #7437

6 years agoZTS: zpool_create_002 clean up leftover filedisk
bunder2015 [Sun, 15 Apr 2018 22:17:44 +0000 (18:17 -0400)]
ZTS: zpool_create_002 clean up leftover filedisk

zpool_create_002_pos did not clean up filedisk files left over from
running the test.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7435
Closes #7439

6 years agoAvoid Linux hung task message in ZTHR
Tim Chase [Sun, 15 Apr 2018 22:12:28 +0000 (17:12 -0500)]
Avoid Linux hung task message in ZTHR

Use an interruptible to avoid Linux hung task message in
ZTHR and to prevent inflating the load average.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7440
Closes #7441

6 years agoOpenZFS 9213 - zfs: sytem typo
Toomas Soome [Sat, 10 Mar 2018 23:01:46 +0000 (15:01 -0800)]
OpenZFS 9213 - zfs: sytem typo

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
* The additional instances of this typo addressed in the OpenZFS
  patch were already resolved.

OpenZFS-issue: https://illumos.org/issues/9213
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92
Closes #7436

6 years agoOpenZFS 9036 - zfs: duplicate 'const' declaration specifier
Toomas Soome [Mon, 5 Feb 2018 12:56:04 +0000 (14:56 +0200)]
OpenZFS 9036 - zfs: duplicate 'const' declaration specifier

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9036
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434
Closes #6900

6 years agoOpenZFS 9084 - spa_*_ashift must ignore spare devices
Prakash Surya [Fri, 16 Feb 2018 16:17:06 +0000 (08:17 -0800)]
OpenZFS 9084 - spa_*_ashift must ignore spare devices

It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40ebddef74,
        ebddef688992dc0ebe10a00fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/9084
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7
Closes #6900

6 years agoOpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()
Serapheim Dimitropoulos [Mon, 9 Jan 2017 20:54:51 +0000 (12:54 -0800)]
OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9080
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6
Closes #6900

6 years agoOpenZFS 9079 - race condition in starting and ending condensing thread for indirect...
Serapheim Dimitropoulos [Wed, 15 Mar 2017 23:41:52 +0000 (16:41 -0700)]
OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs

The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900

6 years agoOptimize possible split block search space
Brian Behlendorf [Thu, 29 Mar 2018 21:50:40 +0000 (14:50 -0700)]
Optimize possible split block search space

Remove duplicate segment copies to minimize the possible search
space for reconstruction.  Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.

Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6900

6 years agoOpenZFS 9290 - device removal reduces redundancy of mirrors
Matthew Ahrens [Tue, 13 Feb 2018 19:37:56 +0000 (11:37 -0800)]
OpenZFS 9290 - device removal reduces redundancy of mirrors

Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900

6 years agoOpenZFS 7614, 9064 - zfs device evacuation/removal
Matthew Ahrens [Thu, 22 Sep 2016 16:30:13 +0000 (09:30 -0700)]
OpenZFS 7614, 9064 - zfs device evacuation/removal

OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900

6 years agoWait for resilver after online
Tim Chase [Sun, 8 Apr 2018 04:10:22 +0000 (23:10 -0500)]
Wait for resilver after online

This test performs a rapid offline/online cycle of each of several
mirror vdevs.  It can run so quickly that there isn't sufficient pool
redundancy to perform an offline.  The solution is to wait until the
pool is resilvered following the online operation.

Also, add a pool sync before the offline operation to help reduce
spurious errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #6900

6 years agoEliminate trailing spaces in DISKS
Tim Chase [Wed, 3 Jan 2018 21:45:35 +0000 (15:45 -0600)]
Eliminate trailing spaces in DISKS

The zfs-tests.sh driver script could add spaces to the end of $DISKS
which defeates shell-based parsing with constructs such as ${DISKS##* }.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #6900

6 years agoAllow mounting datasets more than once
Seth Forshee [Thu, 12 Apr 2018 19:24:38 +0000 (15:24 -0400)]
Allow mounting datasets more than once

Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:

- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset

The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)

The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.

Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes #5796
Closes #7207

6 years agoZTS: clean up leftover ibackup_trunc files
bunder2015 [Fri, 13 Apr 2018 17:35:55 +0000 (13:35 -0400)]
ZTS: clean up leftover ibackup_trunc files

zfs_receive_raw_incremental did not clean up ibackup_trunc.* files
left over from running the test.

Also changing the path of the ibackup files so they can be placed
in the correct directories when /var/tmp is not the temporary
directory.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7430

6 years agoLinux compat 4.16: blk_queue_flag_{set,clear}
Brian Behlendorf [Fri, 13 Apr 2018 02:46:14 +0000 (19:46 -0700)]
Linux compat 4.16: blk_queue_flag_{set,clear}

The HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY case was overlooked in
the original 10f88c5c commit because blk_queue_write_cache()
was available for the in-kernel builds.

Update the blk_queue_flag_{set,clear} wrappers to call the locked
versions to avoid confusion.  This is safe for all existing callers.

The blk_queue_set_write_cache() function has been updated to use
these wrappers.  This means setting/clearing both QUEUE_FLAG_WC
and QUEUE_FLAG_FUA is no longer atomic but this only done early
in zvol_alloc() prior to any requests so there is no issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7428
Closes #7431

6 years agoAdd 'zpool split' coverage to the ZFS Test Suite
LOLi [Thu, 12 Apr 2018 17:57:24 +0000 (19:57 +0200)]
Add 'zpool split' coverage to the ZFS Test Suite

This change adds five new tests to the ZTS:

 * zpool_split_cliargs: verify command line options and arguments
 * zpool_split_devices: verify zpool split accepts a device list
 * zpool_split_encryption: verify zpool can split encrypted pools
 * zpool_split_props: verify zpool split can set property values
 * zpool_split_vdevs: verify vdev layout when splitting the pool

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7409

6 years agoFix calloc(3) arguments order
Tomohiro Kusumi [Thu, 12 Apr 2018 17:50:39 +0000 (02:50 +0900)]
Fix calloc(3) arguments order

calloc(3) takes `nelem` (or `nmemb` in glibc) first, and then size of
elements.  No difference expected for having these in reverse order,
however should follow the standard.

http://pubs.opengroup.org/onlinepubs/009695399/functions/calloc.html

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7405

6 years agoFix zfs_arc_max minimum tuning
beren12 [Thu, 12 Apr 2018 17:47:32 +0000 (13:47 -0400)]
Fix zfs_arc_max minimum tuning

When setting `zfs_arc_max` its minimum value is allowed
to be 64 MiB.  There was an off-by-1 error which can matter
on tiny systems.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes #7417

6 years agoOpenZFS 9286 - want refreservation=auto
Mike Gerdts [Wed, 11 Apr 2018 16:14:45 +0000 (10:14 -0600)]
OpenZFS 9286 - want refreservation=auto

Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Don Brady <don.brady@delphix.com>
Porting Notes:
* Adopted destroy_dataset in ZTS test cleanup
* Use ksh shebang instead of bash for new tests

OpenZFS-issue: https://www.illumos.org/issues/9286
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/723d0c85
Closes #7387

6 years agoFix zpool set feature@<feature>=disabled
LOLi [Wed, 11 Apr 2018 21:45:58 +0000 (23:45 +0200)]
Fix zpool set feature@<feature>=disabled

Commit e4010f2 accidentally allows zpool to set pool features to
"disabled"; this should only be allowed at pool creation. This commit
adds additional checks and test coverage to 'zpool set'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7402

6 years agozfs(8): fix dedup omission during mdoc overhaul
DeHackEd [Wed, 11 Apr 2018 00:19:01 +0000 (20:19 -0400)]
zfs(8): fix dedup omission during mdoc overhaul

The property description has been updated with new algorithms as well.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #7377

6 years agoFix race in dnode_check_slots_free()
Tom Caputi [Tue, 10 Apr 2018 18:15:05 +0000 (14:15 -0400)]
Fix race in dnode_check_slots_free()

Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147
Closes #7388

6 years agoLinux compat 4.16: blk_queue_flag_{set,clear}
Giuseppe Di Natale [Tue, 10 Apr 2018 17:32:14 +0000 (10:32 -0700)]
Linux compat 4.16: blk_queue_flag_{set,clear}

queue_flag_{set,clear}_unlocked are now private interfaces in
the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14).
Use blk_queue_flag_{set,clear} interfaces which were introduced as
of https://github.com/torvalds/linux/commit/8814ce8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7410

6 years agoCorrect swapped keylocation error messages
Tom Caputi [Tue, 10 Apr 2018 04:11:17 +0000 (00:11 -0400)]
Correct swapped keylocation error messages

This patch corrects a small issue where two error messages
in the code that checks for invalid keylocations were
swapped.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7418

6 years agoRevert "Handle zap_add() failures in mixed ... "
Tony Hutter [Mon, 9 Apr 2018 21:24:46 +0000 (14:24 -0700)]
Revert "Handle zap_add() failures in mixed ... "

This reverts commit cc63068e95ee725cce03b1b7ce50179825a6cda5.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401
Cloes #7416

6 years agoFix 'zfs send/recv' hang with 16M blocks
Brian Behlendorf [Mon, 9 Apr 2018 02:41:15 +0000 (19:41 -0700)]
Fix 'zfs send/recv' hang with 16M blocks

When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7365
Closes #7404

6 years agoClean up (k)shlib and cfg file shebangs
Giuseppe Di Natale [Mon, 9 Apr 2018 02:37:22 +0000 (19:37 -0700)]
Clean up (k)shlib and cfg file shebangs

Most kshlib files are imported by other scripts
and do not have a shebang at the top of their files.
Make all kshlib follow this convention.

Remove shebangs from cfg files as well.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Close #7406

6 years agoFix "file is executable, but no shebang" warnings
Tony Hutter [Fri, 6 Apr 2018 23:34:21 +0000 (16:34 -0700)]
Fix "file is executable, but no shebang" warnings

Fedora 28's RPM build checks warn when executable files don't have a
shebang line.  These warnings are caused when we (incorrectly)
include data & config files in the_SCRIPTS automake lines. Files in
_SCRIPTS are marked executable by automake. This patch fixes the
issue by including non-executable scripts in a _DATA line instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7359
Closes #7395

6 years agoExclude python scripts from RPM shebang check
Tony Hutter [Fri, 6 Apr 2018 23:32:58 +0000 (16:32 -0700)]
Exclude python scripts from RPM shebang check

The newest Fedora packaging rules print warnings for scripts using the
/usr/bin/python shebang:

    *** WARNING: mangling shebang in /usr/bin/arc_summary.py from
    #!/usr/bin/python to #!/usr/bin/python2. This will become an ERROR,
    fix it manually!

Fedora wants all cross compatible scripts to pick python3.  Since we
don't want our users to have to pick a specific version of python, we
exclude our scripts from the RPM build check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7360
Closes #7399

6 years agosystemd mount generator and tracking ZEDLET
Antonio Russo [Fri, 6 Apr 2018 21:11:09 +0000 (17:11 -0400)]
systemd mount generator and tracking ZEDLET

zfs-mount-generator implements the "systemd generator" protocol,
producing systemd.mount units from the cached outputs of zfs list,
during early boot, integrating with systemd.

Each pool has an indpendent cache of the command

  zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool

which is kept synchronized by the ZEDLET

  history_event-zfs-list-cacher.sh

Datasets not in the cache will be loaded later in the boot process by
zfs-mount.service, including pools without a cache.

Among other things, this allows for complex mount hierarchies.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7329

6 years agoFixes for SNPRINTF_BLKPTR with encrypted BP's
Matthew Ahrens [Fri, 6 Apr 2018 20:30:26 +0000 (13:30 -0700)]
Fixes for SNPRINTF_BLKPTR with encrypted BP's

mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390

6 years agoFix divide-by-zero in mmp_delay_update()
Olaf Faaland [Fri, 6 Apr 2018 20:29:11 +0000 (13:29 -0700)]
Fix divide-by-zero in mmp_delay_update()

vdev_count_leaves() in the denominator may return 0, caught by Coverity.
Introduced by

533ea04 Update mmp_delay on sync or skipped, failed write

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7391

6 years agoMake encrypted "zfs mount -a" failures consistent
Tom Caputi [Fri, 6 Apr 2018 20:28:15 +0000 (16:28 -0400)]
Make encrypted "zfs mount -a" failures consistent

Currently, "zfs mount -a" will print a warning and fail to mount
any encrypted datasets that do not have a key loaded. This patch
makes the behavior of this failure consistent with other failure
modes ("zfs mount -a" will silently continue, explict "zfs mount"
will print a message and return an error code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7382

6 years agoUpdate mmp_delay on sync or skipped, failed write
Olaf Faaland [Wed, 4 Apr 2018 23:38:44 +0000 (16:38 -0700)]
Update mmp_delay on sync or skipped, failed write

When an MMP write is skipped, or fails, and time since
mts->mmp_last_write is already greater than mts->mmp_delay, increase
mts->mmp_delay.  The original code only updated mts->mmp_delay when a
write succeeded, but this results in the write(s) after delays and
failed write(s) reporting an ub_mmp_delay which is too low.

Update mmp_last_write and mmp_delay if a txg sync was successful.  At
least one uberblock was written, thus extending the time we can be sure
the pool will not be imported by another host.

Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) /
vdev_count_leaves()) so that a period of frequent successful MMP writes,
e.g. due to frequent txg syncs, does not result in an import activity
check so short it is not reliable based on mmp thread writes alone.

Remove unnecessary local variable, start.  We do not use the start time
of the loop iteration.

Add a debug message in spa_activity_check() to allow verification of the
import_delay value and to prove the activity check occurred.

Alter the tests that import pools and attempt to detect an activity
check.  Calculate the expected duration of spa_activity_check() based on
module parameters at the time the import is performed, rather than a
fixed time set in mmp.cfg.  The fixed time may be wrong.  Also, use the
default zfs_multihost_interval value so the activity check is longer and
easier to recognize.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7330

6 years agoFedora 28: Fix misc bounds check compiler warnings
Tony Hutter [Wed, 4 Apr 2018 17:16:47 +0000 (10:16 -0700)]
Fedora 28: Fix misc bounds check compiler warnings

Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7361
Closes #7368

6 years agoFix spa reference leak in zfs_ioc_pool_scan
LOLi [Wed, 4 Apr 2018 00:31:30 +0000 (02:31 +0200)]
Fix spa reference leak in zfs_ioc_pool_scan

zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a
valid pool_scrub_cmd_t: this could happen if the userland binaries
and ZFS kernel module differ in version and would prevent the pool from
being exported.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7380

6 years agoFix add_nested_replacing_spare test case
LOLi [Wed, 4 Apr 2018 00:30:14 +0000 (02:30 +0200)]
Fix add_nested_replacing_spare test case

Use 'zpool reopen' instead of 'zpool scrub' to kick in the spare device:
this is required to avoid spurious failures caused by a race condition
in events processing by the ZFS Event Daemon:

P1 (zpool scrub)                            P2 (zed)
---
zfs_ioc_pool_scan()
 -> dsl_scan()
  -> vdev_reopen()
   -> vdev_set_state(VDEV_STATE_CANT_OPEN)
                                            zfs_ioc_vdev_attach()
                                             -> spa_vdev_attach()
                                              -> dsl_resilver_restart()
  -> dsl_sync_task()
   -> dsl_scan_setup_check()
   <- dsl_scan_setup_check(): EBUSY

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7247
Closes #7342

6 years agoRemove ASSERT() in l2arc_apply_transforms()
Tim Chase [Sat, 31 Mar 2018 22:14:21 +0000 (17:14 -0500)]
Remove ASSERT() in l2arc_apply_transforms()

The ASSERT was erroneously copied from the next section of code.
The buffer's size should be expanded from "psize" to "asize"
if necessary.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7375

6 years agoDecryption error handling improvements
Tom Caputi [Sat, 31 Mar 2018 18:12:51 +0000 (14:12 -0400)]
Decryption error handling improvements

Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372

6 years agoEncrypted dnode blocks should be prefetched raw
Tom Caputi [Sat, 31 Mar 2018 18:11:48 +0000 (14:11 -0400)]
Encrypted dnode blocks should be prefetched raw

Encrypted dnode blocks are always initially read as raw data and
converted to decrypted data when an encrypted bonus buffer is
needed. This allows the DMU to be used for things like fetching
the DMU master node without requiring keys to be loaded. However,
dbuf_issue_final_prefetch() does not currently read the data as
raw. The end result of this is that prefetched dnode blocks are
read twice from disk: once decrypted and then again as raw data.
This patch corrects the issue by adding the flag when appropriate.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7362

6 years agoFix hung z_zvol tasks during 'zfs receive'
LOLi [Fri, 30 Mar 2018 19:10:01 +0000 (21:10 +0200)]
Fix hung z_zvol tasks during 'zfs receive'

During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue.  This results in hung tasks:

<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3>      Tainted: P           O  3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260

root@linux:~# cat /proc/spl/taskq
taskq                       act  nthr  spwn  maxt   pri  mina
spl_system_taskq/0            1     2     0    64   100     1
active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
wait: 5541
spl_delay_taskq/0             0     1     0     4   100     1
delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1                      1     1     0     1   120     1
active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
pend: zvol_task_cb [zfs](0xffff88001fde6800)

This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6330
Closes #6890
Closes #7343

6 years agozfs-functions: skip lines where mntpnt is 'none'
Georgy Yakovlev [Fri, 30 Mar 2018 19:05:24 +0000 (11:05 -0800)]
zfs-functions: skip lines where mntpnt is 'none'

This fixes zfs-mount initscript trying to mount swap volumes
as filesystems or anything that has 'none' as a mountpoint
in /etc/fstab.  Additionally, fixes trying to mount swap volumes
as a filesystem on RHEL.  RHEL defines mountpoint for swap
as `swap`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Georgy Yakovlev <ya@sysdump.net>
Closes #7346
Closes #7347

6 years agoOpenZFS 9164 - assert: newds == os->os_dsl_dataset
Andriy Gapon [Wed, 21 Feb 2018 12:55:55 +0000 (14:55 +0200)]
OpenZFS 9164 - assert: newds == os->os_dsl_dataset

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
  to successfully run in under 5 minutes.

OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a
Closes #6112
Closes #7336

6 years agoAdd support for nvme based devids
Don Brady [Fri, 30 Mar 2018 00:43:25 +0000 (18:43 -0600)]
Add support for nvme based devids

Adds a devid for nvme devices. This is very similar to how the
other 'bus' (scsi|sata|usb) devids are generated. The devid
resides in a name/value pair in the leaf vdevs in a zpool config.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7356

6 years agoResolve QAT issues with incompressible data
Tom Caputi [Fri, 30 Mar 2018 00:40:34 +0000 (20:40 -0400)]
Resolve QAT issues with incompressible data

Currently, when ZFS wants to accelerate compression with QAT, it
passes a destination buffer of the same size as the source buffer.
Unfortunately, if the data is incompressible, QAT can actually
"compress" the data to be larger than the source buffer. When this
happens, the QAT driver will return a FAILED error code and print
warnings to dmesg. This patch fixes these issues by providing the
QAT driver with an additional buffer to work with so that even
completely incompressible source data will not cause an overflow.

This patch also resolves an error handling issue where
incompressible data attempts compression twice: once by QAT and
once in software. To fix this issue, a new (and fake) error code
CPA_STATUS_INOMPRESSIBLE has been added so that the calling code
can correctly account for the difference between a hardware
failure and data that simply cannot be compressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7338

6 years agoFix ASSERT in dsl_scan_fini() and cleanup comments
Tom Caputi [Thu, 29 Mar 2018 01:30:44 +0000 (21:30 -0400)]
Fix ASSERT in dsl_scan_fini() and cleanup comments

This patch fixes an issue where dsl_scan_prefetch_cb() might
add more prefetch I/Os to the prefetch queue after prefetching
has been completed. This was happening because that code was
checking scn->scn_suspending instead of scn->scn_prefetch_stop.
This occasionally triggered an ASSERT during ztest runs in
dsl_scan_fini() when the code attempted to destroy an AVL tree
that still had entires in it. This patch also includes a number
of spelling corrections and comment cleanups throughout
dsl_scan.c

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7353

6 years agochmod -x on etc/init.d/zfs-*.in automake files
Tony Hutter [Wed, 28 Mar 2018 19:08:21 +0000 (12:08 -0700)]
chmod -x on etc/init.d/zfs-*.in automake files

Clear executable bit on zfs-import.in, zfs-mount.in,
zfs-share.in, and zfs-zed.in.  These are automake files and
should not be marked executable.  This fixes a RPM build error
on Fedora 28.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7355
Closes #7327

6 years agoFix mmap / libaio deadlock
Brian Behlendorf [Wed, 28 Mar 2018 17:19:22 +0000 (10:19 -0700)]
Fix mmap / libaio deadlock

Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7335
Closes #7339

6 years agoRemove libattr requirement
DeHackEd [Tue, 27 Mar 2018 23:51:33 +0000 (19:51 -0400)]
Remove libattr requirement

RHEL/CentOS 6 supports sys/xattr.h eliminating the need for
libattr-devel as a dependency.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #7344
Closes #7351

6 years agoOpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong...
Allan Jude [Tue, 20 Feb 2018 05:31:14 +0000 (00:31 -0500)]
OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value

Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9321
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18
Closes #7333

6 years agoFedora 28: Fix "Macro %_dracutdir has empty body"
Tony Hutter [Sun, 25 Mar 2018 22:00:47 +0000 (15:00 -0700)]
Fedora 28: Fix "Macro %_dracutdir has empty body"

If you run ./configure --with-config=srpm, it will not trigger
the user m4 scripts to populate the dracut and udev directories.
This causes a build error on Fedora 28.  Make the dracut and
udev lines conditional to get around this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7326
Closes #7328

6 years agoDon't count embedded bps in read stats
Tom Caputi [Sat, 24 Mar 2018 04:35:19 +0000 (00:35 -0400)]
Don't count embedded bps in read stats

Currently, ZFS tracks statistics about calls to arc_read()
via the /proc/spl/kstat/zfs/<pool>/reads file for debugging.
Unfortunately, this file currently counts embedded bps as
disk reads since they are technically processed by the ZIO
layer. This pollutes the log since the ARC will never cache
embedded bps. This patch  corrects this issue by preventing
the logging of embedded bp reads.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7334

6 years agoOpenZFS 9193 - bootcfg -C doesn't work
Paul Dagnelie [Thu, 23 Mar 2017 22:28:22 +0000 (15:28 -0700)]
OpenZFS 9193 - bootcfg -C doesn't work

When given an empty string as a rootds value, bootcfg -C fails with
the error message 'could not set nextboot: '' is an invalid name'.
This should be allowed because it represents clearing the nextboot
configuration.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9193
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/504645d227
Closes #7230

6 years agoClarify zpool actions for an intent log device
Peter Ashford [Thu, 22 Mar 2018 22:12:08 +0000 (15:12 -0700)]
Clarify zpool actions for an intent log device

Updated the "Intent Log" section of the "zpool" manual page to
properly reflect the actions that may be performed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Peter Ashford <ashford@accs.com>
Closes #6938
Closes #7318

6 years agomodprobe zfs during dracut mount
kpande [Thu, 22 Mar 2018 17:14:29 +0000 (13:14 -0400)]
modprobe zfs during dracut mount

Resolves importing root pool during boot in dracut.  This case was
inadvertently broken with the module autoloading change in #7287.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #7322

6 years agoenable zfs_dbgmsg() by default, without dprintf()
Matthew Ahrens [Wed, 21 Mar 2018 22:37:32 +0000 (15:37 -0700)]
enable zfs_dbgmsg() by default, without dprintf()

zfs_dbgmsg() should record a message by default.  As a general
principal, these messages shouldn't be too verbose.  Furthermore, the
amount of memory used is limited to 4MB (by default).

dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags.  This flag is not set by default
(even on debug builds).  These messages are extremely verbose, and
sometimes nontrivial to compute.

SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags.  This flag is not set by default (even on debug builds).

This brings our behavior in line with illumos.  Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7314

6 years agoFix spelling errors in comments
Tom Caputi [Wed, 21 Mar 2018 15:42:13 +0000 (11:42 -0400)]
Fix spelling errors in comments

This patch simply corrects some spelling / grammar errors in
the QAT and encryption code comments. No functional changes

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7319

6 years agoAdd support for nvme disk detection
timor [Wed, 21 Mar 2018 15:35:20 +0000 (16:35 +0100)]
Add support for nvme disk detection

This treats /dev/nvme.. devices the same way as /dev/sd... devices.  The
motivation behind this is that whole disk detection did not work on nvme
SSDs without that, because it DKC_UNKNOWN was returned for such devices.

Perhaps there should be a separate DKC_ type for this, but I don't know
enough about the code to know the implications of that.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: timor <timor.dd@googlemail.com>
Closes #7304

6 years agoAdd comments for portable dnode / objset flags
Tom Caputi [Tue, 20 Mar 2018 18:55:21 +0000 (14:55 -0400)]
Add comments for portable dnode / objset flags

This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7313

6 years agoAdd JSON output support to channel programs
Alek P [Mon, 19 Mar 2018 19:40:58 +0000 (15:40 -0400)]
Add JSON output support to channel programs

The changes piggyback JSON output support on top of channel programs
(#6558).  This way the JSON output support is targeted to scripting
use cases and is easily maintainable since it really only touches
one function (zfs_do_channel_program()).

This patch ports Joyent's JSON nvlist library from illumos to enable
easy JSON printing of channel program output nvlist.  To keep the
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting
the program.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7281

6 years agoFix deadlock in IO pipeline
Brian Behlendorf [Fri, 16 Mar 2018 23:46:06 +0000 (16:46 -0700)]
Fix deadlock in IO pipeline

In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock.  This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed.  This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7307

6 years agoAllow QAT to handle non page-aligned buffers
Tom Caputi [Fri, 16 Mar 2018 23:02:49 +0000 (19:02 -0400)]
Allow QAT to handle non page-aligned buffers

This patch adds some handling to the QAT acceleration functions
that allows them to handle buffers that are not aligned with the
page cache. At the moment this never happens since callers only
happen to work with page-aligned buffers, but this code should
prevent headaches if this isn't always true in the future. This
patch also adds some cleanups to align the QAT compression code
with the encryption and checksumming code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7305

6 years agoReport pool suspended due to MMP
Olaf Faaland [Thu, 15 Mar 2018 17:56:55 +0000 (10:56 -0700)]
Report pool suspended due to MMP

When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296

6 years agoSHA256 QAT acceleration
Tom Caputi [Thu, 15 Mar 2018 17:53:58 +0000 (13:53 -0400)]
SHA256 QAT acceleration

This patch enables acceleration of SHA256 checksums using Intel
Quick Assist Technology. This patch also fixes up and refactors
some of the code from QAT encryption to make the behavior
consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7295

6 years agoOpenZFS 9076 - Adjust perf test concurrency settings
Stephen Blinick [Sun, 11 Feb 2018 23:11:59 +0000 (16:11 -0700)]
OpenZFS 9076 - Adjust perf test concurrency settings

ZFS Performance test concurrency should be lowered for better latency

Work by Stephen Blinick.

Nightly performance runs typically consist of two levels of concurrency;
and both are fairly high.

Since the IO runs are to a ZFS filesystem, within a zpool, which is
based on some variable number of vdev's, the amount of IO driven to each
device is variable. Additionally, different device types (HDD vs SSD,
etc) can generally handle a different amount of concurrent IO before
saturating.

Nevertheless, in practice, it appears that most tests are well past the
concurrency saturation point and therefore both perform with the same
throughput, the maximum of the device. Because the queuedepth to the
device(s) is so high however, the latency is much higher than the best
possible at that throughput, and increases linearly with the increase in
concurrency.

This means that changes in code that impact latency during normal
operation (before saturation) may not be apparent when a large component
of the measured latency is from the IO sitting in a queue to be
serviced. Therefore, changing the concurrency settings is recommended

Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Ported-by: John Wren Kennedy <jwk404@gmail.com>
OpenZFS-issue: https://www.illumos.org/issues/9076
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/562
Upstream bug: DLPX-45477
Closes #7302

6 years agoreceive_spill does not byte swap spill contents
Paul Zuchowski [Thu, 15 Mar 2018 17:29:51 +0000 (13:29 -0400)]
receive_spill does not byte swap spill contents

In zfs receive, the function receive_spill should account
for spill block endian conversion as a defensive measure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7300

6 years agoOpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression
Brian Behlendorf [Tue, 13 Mar 2018 17:52:48 +0000 (10:52 -0700)]
OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression

With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress
indirect blocks, under a workload of random cached reads. To reduce this
decompression cost, we would like to increase the size of the dbuf cache so
that more indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the dbuf
cache as well.)

In real world workloads, this won't help as dramatically as the example above,
but we think it's still worth it because the risk of decreasing performance is
low. The potential negative performance impact is that we will be slightly
reducing the size of the ARC (by ~3%).

Porting Notes:
* Added modules options to zfs-module-parameters.5 man page.
* Preserved scaling based on target ARC size rather than max ARC size.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9188
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564
Upstream bug: DLPX-46942
Closes #7273

6 years agoAdd kernel module auto-loading
Brian Behlendorf [Tue, 13 Mar 2018 17:45:55 +0000 (10:45 -0700)]
Add kernel module auto-loading

Historically a dynamic misc minor number was registered for the
/dev/zfs device in order to prevent minor number collisions.  This
was fine but it prevented us from being able to use the kernel
module auto-loaded which requires a known reserved value.

Resolve this issue by adding a configure test to find an available
misc minor number which can then be used in MODULE_ALIAS_MISCDEV at
build time.  By adding this alias the zfs kmod is added to the list
of known static-nodes and the systemd-tmpfiles-setup-dev service
will create a /dev/zfs character device at boot time.

This in turn allows us to update the 90-zfs.rules file to make it
aware this is a static node.  The upshot of this is that whenever
a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be
automatic loaded.  This even works for unprivileged users so there
is no longer a need to manually load the modules at boot time.

As an additional bonus the zed now no longer needs to start after
the zfs-import.service since it will trigger the module load.

In the unlikely event the minor number we selected conflicts with
another out of tree unregistered minor number the code falls back
to dynamically allocating it.  In this case the modules again
must be manually loaded.

Note that due to the change in the method of registering the minor
number the zimport.sh test case may incorrectly fail when the
static node for the installed packages is created instead of the
dynamic one.  This issue will only transiently impact zimport.sh
for this single commit when we transition and are mixing and
matching methods.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7287

6 years agoAdd zfs_scan_ignore_errors tunable
Tim Chase [Tue, 13 Mar 2018 17:43:14 +0000 (12:43 -0500)]
Add zfs_scan_ignore_errors tunable

When it's set, a DTL range will be cleared even if its scan/scrub had
errors.  This allows to work around resilver/scrub upon import when the
pool has errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7293

6 years agoDestroy makes full snap list before destroying
Paul Zuchowski [Mon, 12 Mar 2018 22:24:08 +0000 (18:24 -0400)]
Destroy makes full snap list before destroying

Change zfs destroy logic so destroying begins before
the entire list of snapshots is built.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7271

6 years agoFix race in trace point in zrl_add_impl
Chunwei Chen [Mon, 12 Mar 2018 18:27:02 +0000 (11:27 -0700)]
Fix race in trace point in zrl_add_impl

We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291

6 years agoFix MMP write frequency for large pools
Brian Behlendorf [Mon, 12 Mar 2018 18:26:05 +0000 (11:26 -0700)]
Fix MMP write frequency for large pools

When a single pool contains more vdevs than the CONFIG_HZ for
for the kernel the mmp thread will not delay properly.  Switch
to using cv_timedwait_sig_hires() to handle higher resolution
delays.

This issue was reported on Arch Linux where HZ defaults to only
100 and this could be fairly easily reproduced with a reasonably
large pool.  Most distribution kernels set CONFIG_HZ=250 or
CONFIG_HZ=1000 and thus are unlikely to be impacted.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7205
Closes #7289

6 years agoHold SCL_VDEV when counting leaves
Olaf Faaland [Thu, 8 Mar 2018 23:39:07 +0000 (15:39 -0800)]
Hold SCL_VDEV when counting leaves

A config lock should be held while vdev_count_leaves() walks the tree,
otherwise the pointers reference may become invalid during the walk.

SCL_VDEV is a minimal lock provided for such uses cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286

6 years agoHandle zio_resume and mmp => off
Olaf Faaland [Thu, 8 Mar 2018 23:21:54 +0000 (15:21 -0800)]
Handle zio_resume and mmp => off

When multihost is disabled on a pool, and the pool is resumed via zpool
clear, within a single cycle of the mmp thread's loop (e.g.  while it's
in the cv_timedwait call), both mmp_last_write and mmp_delay should be
updated.

The original code mistakenly treated the two cases as if they could not
occur at the same time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286

6 years agoDocument allowed pool names
Tomohiro Kusumi [Thu, 22 Feb 2018 17:31:34 +0000 (02:31 +0900)]
Document allowed pool names

PR #7208 was a patch to allow non-reserved pool names which begin with
mirror, raidz, spare (but do not equal), however we'd rather document
it in the man page for compatibility with other OpenZFS implementations,
to avoid pool names that may not work on non-Linux platforms.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7216

6 years agoFix zfs-kmod builds when using rpm >= 4.14
LOLi [Fri, 9 Mar 2018 21:52:37 +0000 (22:52 +0100)]
Fix zfs-kmod builds when using rpm >= 4.14

With rpm-software-management/rpm@5e94633 a package version containing
invalid characters (most commonly a double '-') causes the kmod package
generation to terminate with an error.  This change takes advantage of
the newly introduced rpm macro "_wrong_version_format_terminate_build"
to allow kmod packages to be built.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7284

6 years agoChange functions which return literals to return `const char*`
Tomohiro Kusumi [Fri, 9 Mar 2018 21:47:32 +0000 (06:47 +0900)]
Change functions which return literals to return `const char*`

get_format_prompt_string() and zpool_state_to_name() return
a string literal which is read-only, thus they should return
`const char*`.

zpool_get_prop_string() returns a non-const string after
successful nv-lookup, and returns a string literal otherwise.
Since this function is designed to be used for read-only purpose,
the return type should also be `const char*`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7285

6 years agoQAT support for AES-GCM
Tom Caputi [Fri, 9 Mar 2018 21:37:15 +0000 (16:37 -0500)]
QAT support for AES-GCM

This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7282

6 years agozdb and inuse tests don't pass with real disks
Paul Zuchowski [Thu, 8 Mar 2018 01:03:33 +0000 (20:03 -0500)]
zdb and inuse tests don't pass with real disks

Due to zpool create auto-partioning in Linux (i.e. sdb1),
certain utilities need to use the parition (sdb1) while
others use the whole disk name (sdb).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #6939
Closes #7261

6 years agoTake user namespaces into account in policy checks
Wolfgang Bumiller [Wed, 7 Mar 2018 23:40:42 +0000 (00:40 +0100)]
Take user namespaces into account in policy checks

Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

This also adds an initial user namespace regression test
for the setgid bit loss, with a user_ns_exec helper usable
in further tests.

Additionally, configure checks for the required user
namespace related features are added for:
  * ns_capable
  * kuid/kgid_has_mapping()
  * user_ns in cred_t

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes #6800
Closes #7270

6 years agoZTS: fix send-c_stream_size_estimate
Brian Behlendorf [Wed, 7 Mar 2018 17:55:54 +0000 (09:55 -0800)]
ZTS: fix send-c_stream_size_estimate

The test could fail when attempting to write to a newly created
volume which was missing its device node.  Resolve the issue by
calling block_device_wait() which blocks until udev creates the
needed  entry.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7276
Closes #7277

6 years agoFix dbufstats_001_pos
Giuseppe Di Natale [Wed, 7 Mar 2018 17:53:04 +0000 (09:53 -0800)]
Fix dbufstats_001_pos

Implement a new helper within_tolerance to test if a value
is within range of a target.

Because the dbufstats and dbufs kstat file are being read
at slightly different times, it is possible for stats to be
slightly off. Use within_tolerance to determine if the value
is "close enough" to the target.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7239
Closes #7266

6 years agoAllow to limit zed's syslog chattiness
Tony Hutter [Tue, 6 Mar 2018 23:41:52 +0000 (15:41 -0800)]
Allow to limit zed's syslog chattiness

Some usage patterns like send/recv of replication streams can
produce a large number of events. In such a case, the current
all-syslog.sh zedlet will hold up to its name, and flood the
logs with mostly redundant information. Two mitigate this
situation, this changeset introduces to new variables
ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE
to zed.rc that give more control over which event classes end
up in the syslog.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6886
Closes #7260

6 years agoRecord skipped MMP writes in multihost_history
Olaf Faaland [Tue, 27 Feb 2018 01:32:49 +0000 (20:32 -0500)]
Record skipped MMP writes in multihost_history

Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212

6 years agoDetect long config lock acquisition in mmp
Olaf Faaland [Wed, 21 Feb 2018 01:33:51 +0000 (17:33 -0800)]
Detect long config lock acquisition in mmp

If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212

6 years agoIntroduce a destroy_dataset helper
Giuseppe Di Natale [Tue, 6 Mar 2018 22:54:57 +0000 (14:54 -0800)]
Introduce a destroy_dataset helper

Datasets can be busy when calling zfs destroy. Introduce
a helper function to destroy datasets and use it to destroy
datasets in zfs_allow_004_pos, zfs_promote_008_pos, and
zfs_destroy_002_pos.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7224
Closes #7246
Closes #7249
Closes #7267

6 years agoMisc fixes and cleanup for project quota
Nasf-Fan [Mon, 5 Mar 2018 20:56:27 +0000 (04:56 +0800)]
Misc fixes and cleanup for project quota

1) The Coverity Scan reports some issues for the project
   quota patch, including:

1.1) zfs_prop_get_userquota() directly uses the const quota
   type value as the condition check by wrong.

1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid
   to be overwritten by dnode::dn->dn_oldprojid.

2) This patch fixes related issues. It also enhances the logic
   for zfs_project_item_alloc() to avoid buffer overflow.

3) Skip project quota ability check if does not change project
   quota related things (id or flag). Otherwise, it will cause
   chattr (for other non project quota flags) operation failed
   if project quota disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Closes #7251
Closes #7265

6 years agoLinux 4.16 compat: get_disk_and_module()
Giuseppe Di Natale [Mon, 5 Mar 2018 20:44:35 +0000 (12:44 -0800)]
Linux 4.16 compat: get_disk_and_module()

As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk()
is now get_disk_and_module(). Add a configure check to determine
if we need to use get_disk_and_module().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7264

6 years agoChange checksum & IO delay ratelimit values
Tony Hutter [Mon, 5 Mar 2018 01:34:51 +0000 (17:34 -0800)]
Change checksum & IO delay ratelimit values

Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252

6 years agoIncrement zil_itx_needcopy_bytes properly
chrisrd [Fri, 2 Mar 2018 18:01:53 +0000 (05:01 +1100)]
Increment zil_itx_needcopy_bytes properly

In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw)
into the log write block (lwb) and send it off using zil_lwb_add_txg().
If we also have WR_NEED_COPY, we additionally copy the lwr's data into
the lwb to be sent off.  If the lwr + data doesn't fit into the lwb, we
send the lrw and as much data as will fit (dnow bytes), then go back
and do the same with the remaining data.

Each time through this loop we're sending dnow data bytes. I.e.
zil_itx_needcopy_bytes should be incremented by dnow.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6988
Closes #7176

6 years agoZTS: fix spurious failures in mv_files
chrisrd [Fri, 2 Mar 2018 17:57:29 +0000 (04:57 +1100)]
ZTS: fix spurious failures in mv_files

The test could fail because of a race condition between the files being
generated in the background and attempting to move the files. Wait for
all file generation to complete before trying to move the files around.

Also, clean up the waiting: the 'wait' command without arguments waits
for all child pids.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7220
Closes #7242
Closes #7258

6 years agoAdd ZFS perf test for dbuf cache
John Wren Kennedy [Wed, 28 Feb 2018 18:38:37 +0000 (10:38 -0800)]
Add ZFS perf test for dbuf cache

This change adds a test for sequential reads out of the dbuf cache.
It's essentially a copy of sequential_reads_cached, using a smaller
data set. The sequential read tests are renamed to differentiate them.

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #7225

6 years agoFix some typos
John Eismeier [Wed, 28 Feb 2018 16:57:10 +0000 (11:57 -0500)]
Fix some typos

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Eismeier <john.eismeier@gmail.com>
Closes #7237

6 years agoFix zpool(8) list example to match actual format
Tomohiro Kusumi [Wed, 28 Feb 2018 16:54:53 +0000 (01:54 +0900)]
Fix zpool(8) list example to match actual format

a05dfd00 (Illumos 5147) has swapped FRAG and EXPANDSZ,
so it's natural to modify these examples.

 # zpool list | head -1
 NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
                              ^^^^^^^^^^^^^^^

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7244

6 years agoAdd Python 3 rewrite of arc_summary.py
Scot W. Stevenson [Wed, 28 Feb 2018 16:52:34 +0000 (17:52 +0100)]
Add Python 3 rewrite of arc_summary.py

Add new script arc_summary3.py as a complete rewrite of the
arc_summary.py tool (see issue #6873)

Add new options:

        -g/--graph    - Display crude graphic representation
                        of ARC status and quit
        -r/--raw      - Print all available information as
                        minimally formatted list (for grep)
        -s/--section  - Print a single section. This
                        replaces -p/--page, which is kept for
                        backwards use but marked as
                        depreciated

Add new sections with information on ZIL and SPL. Notify user
if sections L2ARC and VDEV are skipped instead of failing
silently. Add warning that -p/--page option is depreciated.

Developed for Python 3.5.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6873
Closes #6892

6 years agoAdd SMART self-test results to zpool status -c
Tony Hutter [Tue, 27 Feb 2018 17:31:27 +0000 (09:31 -0800)]
Add SMART self-test results to zpool status -c

Add in SMART self-test results to zpool status|iostat -c.  This
works for both SAS and SATA drives.

Also, add plumbing to allow the 'smart' script to take smartctl
output from a directory of output text files instead of running
it against the vdevs.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7178

6 years agoRaw DRR_OBJECT records must write raw data
Tom Caputi [Tue, 27 Feb 2018 17:04:05 +0000 (12:04 -0500)]
Raw DRR_OBJECT records must write raw data

b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215
Closes #7236