Tom Caputi [Thu, 18 Oct 2018 08:13:07 +0000 (04:13 -0400)]
Fix issue with scanning dedup blocks as scan ends
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
Tom Caputi [Tue, 16 Oct 2018 21:00:55 +0000 (17:00 -0400)]
Fix lock inversion in txg_sync_thread()
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
Tom Caputi [Mon, 15 Oct 2018 19:14:22 +0000 (15:14 -0400)]
Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).
In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.
This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
Tom Caputi [Wed, 10 Oct 2018 20:48:33 +0000 (16:48 -0400)]
Fix random ztest_deadman_thread failures
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
George Melikov [Wed, 24 Oct 2018 03:06:40 +0000 (06:06 +0300)]
Allow use of pool GUID as root pool
It's helpful if there are pools with same names,
but you need to use only one of them.
Main case is twin servers, meanwhile some software
requires the same name of pools (e.g. Proxmox).
Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Igor ‘guardian’ Lidin of Moscow, Russia
Closes #8052
Brian Behlendorf [Wed, 24 Oct 2018 02:53:14 +0000 (19:53 -0700)]
ZTS: Update project quota tests
e2fsprogs v1.44.1, which provides lsattr, added a new attribute
for ext3 called "verity". It is reported after the project quota
flag as a 'V' character in the `lsattr` output.
Update projectid_001_pos.ksh and projecttree_001_pos.ksh to use
a pattern which will match the expected output in both cases.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8043
Matthew Ahrens [Mon, 22 Oct 2018 19:23:30 +0000 (12:23 -0700)]
Make gitrev more reliable
In some build methods, the gitrev is unnecessarily set to "unknown".
We can improve this by changing the gitrev to use
`git describe --always --long --dirty`.
This gets the revision even when no tag matches (--always). It prints
the hash even when it exactly matches a tag (--long). And if there are
uncommitted changes, it appends "-dirty", rather than failing (--dirty).
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Thode <prometheanfire@gentoo.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8034
Tom Caputi [Fri, 19 Oct 2018 04:06:18 +0000 (00:06 -0400)]
Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.
This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: @mmaybee Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
Matthew Thode [Wed, 17 Oct 2018 19:06:05 +0000 (14:06 -0500)]
Allow copy-builtin to work with modified sources
`scripts/make_gitrev.sh` had 'set -e' so if any command failed it would
fail and cause copy-builtin to fail (copy-builtin also has `set -e`.
This commit also simplifies scripts/make_gitrev.sh to always write a
file by using a cleanup function. It also simplifies other areas of
the script as well (making it much shorter).
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #8022
Closes #8025
LOLi [Wed, 17 Oct 2018 18:21:07 +0000 (20:21 +0200)]
zpool: allow sharing of spare device among pools
ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7999
Matthew Ahrens [Wed, 17 Oct 2018 17:31:38 +0000 (10:31 -0700)]
Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
Matthew Ahrens [Wed, 17 Oct 2018 17:30:08 +0000 (10:30 -0700)]
Linux does not HAVE_DNLC
Since Linux does not have the Directory Name Lookup Cache, we don't need
the code to manage it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8031
bunder2015 [Wed, 17 Oct 2018 17:25:38 +0000 (13:25 -0400)]
Advise users to retain issue/PR templates
Occasionally we get issues and PRs from users who delete the
templates. Advise users that their issues and PRs may be closed if
they do not fill out the templates as we really need this information.
Also updating PR template to drop unneeded approval toggle as we are
now using issue labels for status tracking.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8029
Paul Dagnelie [Tue, 16 Oct 2018 18:15:04 +0000 (11:15 -0700)]
Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.
This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.
This functionality will be used by the Redacted Send/Receive feature.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
ilbsmart [Tue, 16 Oct 2018 18:11:24 +0000 (02:11 +0800)]
deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
`mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
thread #2, so it stuck and can't complete, then txg "n" will
not complete.
So thread #1 and thread #2 are deadlocked.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
Brian Behlendorf [Mon, 15 Oct 2018 23:59:28 +0000 (16:59 -0700)]
Add zts-report.py to python shebang exclusion
Include zts-report.py is the __brp_mangle_shebangs_exclude_from
to resolve build failures in Fedora 28 and newer.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8020
Issue #7360
We're leaking the dd_clones objects in dsl_dir_destroy_sync. This bug
appears to have been around forever. Thankfully the amount of space
typically involved is tiny.
In addition this adds a mechanism in ZDB to find objects in the MOS
which are leaked (not referenced anywhere).
Porting notes:
* Added dd_crypto_obj to ZDB MOS object leak tracking
Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9847
Closes #7979
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing. While this is the most common cause, it's not the only
cause. Attemping to access a damaged ZAP will result in other
errors.
The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import. Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully. Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.
Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
Tony Hutter [Fri, 12 Oct 2018 18:13:34 +0000 (11:13 -0700)]
Define timestruc_t for Lustre compatibility
Lustre 2.8 (and possibly other versions) are still using timestruc_t,
which was removed in spl-0.7.10 in favor of inode_timespec_t. Add
in a backwards compatibility #define for timestruc_t so that Lustre
builds.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8014
Matt Ahrens [Mon, 1 Oct 2018 22:13:12 +0000 (15:13 -0700)]
OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t. However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.
We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.
Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
check to prevent needless warnings when integrating changes
which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
needed to defer the cv_destroy() which was called immediately
after cv_broadcast(). With d2733258 this should be safe but
if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
to this change in order to make it easier to apply OpenZFS 9689.
- d88895a0, which removed the dummy znode from zvol_state
- e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function. Preserved the
rangelock free list. The cv_destroy() function will block waiting
for all processes in cv_wait() to be scheduled and drop their
reference. This is done to ensure it's safe to free the condition
variable. However, blocking while holding the rl->rl_lock mutex
can result in a deadlock on Linux. A free list is introduced to
defer the cv_destroy() and kmem_free() until after the mutex is
released.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
Alek P [Thu, 11 Oct 2018 04:13:13 +0000 (00:13 -0400)]
Fix changelist mounted-dataset iteration
Commit 0c6d093 caused a regression in the inherit codepath.
The fix is to restrict the changelist iteration on mountpoints and
add proper handling for 'legacy' mountpoints
Garrett Fields [Wed, 10 Oct 2018 15:46:22 +0000 (11:46 -0400)]
Check scheduler for "noop" before setting "noop"
Originally code only checked for presence of "/sys/block/$i/queue/
scheduler". "sh: write error: Invalid argument" was produced when
trying to set "noop" on certain devices (eg. virtio) when it isn't
a listed option. This modification continues to check for the presence
of "/sys/block/$i/queue/scheduler" and also checks that it contains
"noop" as an option before setting "noop".
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Garrett Fields <ghfields@gmail.com>
Closes #8004
Paul Dagnelie [Tue, 9 Oct 2018 21:05:13 +0000 (14:05 -0700)]
Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
Update arc_release to use arc_buf_size(). This hunk was accidentally
dropped when porting compressed send/recv, 2aa34383b.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8000
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred. Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;
This issue may occur when using encryption with refcount debugging
enabled. An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder. When unsharing the buffer the
p_abd should be transferred.
This issue does not impact production builds because refcount holders
are not tracked.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
Matthew Ahrens [Tue, 9 Oct 2018 04:57:02 +0000 (21:57 -0700)]
Create /proc/sys/kernel/spl/gitrev with git hash
The existing mechanisms for determining what code is running in the
kernel do not always correctly report the git hash. The versions
reported there do not reflect changes made since `configure` was run
(i.e. incremental builds do not update the version) and they are
misleading if git tags are not set up properly. This applies to
`modinfo zfs`, `dmesg`, and `/sys/module/zfs/version`.
There are complicated requirements on how the existing version is
generated. Therefore we are leaving that alone, and adding a new
mechanism to record and retrieve the git hash:
`cat /proc/sys/kernel/spl/gitrev`
The gitrev is re-generated at compile time, when running `make`
(including for incremental builds). The value is the output of `git
describe` (or "unknown" if not in a git repo or there are uncommitted
changes).
We're also removing /proc/sys/kernel/spl/version, which was never very
useful.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7931
Closes #7965
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
Matthew Ahrens [Thu, 4 Oct 2018 20:11:45 +0000 (13:11 -0700)]
Warn if checking programs are not installed
`make checkstyle` silently skips checks if the required programs are not
installed (e.g. shellcheck, mandoc). Therefore developers may not
realize that they are not getting the full suite of code checks. This
also applies to more specific targets like `make shellcheck`.
We should print a warning message when a check is skipped due to missing
tools.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7984
Matthew Ahrens [Thu, 4 Oct 2018 20:10:10 +0000 (13:10 -0700)]
Add codecheck make target
We'd like to have tooling that verifies code style, while ignoring the
commit message. For example, code does not need to be signed off in
order to be tested. Current workarounds are to run `git checkstyle` and
ignore the commit message errors, or to run `make cstyle shellcheck
flake8 mancheck testscheck`, and make sure that list stays updated.
Solution is to add a new make target, `codecheck` which does all the
code checks. `checkstyle` is now simply `codecheck` + `commitcheck`.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7985
Paul Dagnelie [Thu, 4 Oct 2018 03:16:45 +0000 (20:16 -0700)]
Fix ASSERT macros to not over-expand
The code reuse in the definitions of the ASSERT and VERIFY macros result
in expansion of their arguments before they are stringified, which
produces ugly and undesirable output.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7884
This change adds a new test case to the zfs-test suite to verify that
when 'zfs destroy' is used on a shared dataset, the dataset will be
unshared after the destroy operation completes.
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #7941
When using "zfs destroy" on a dataset that is using "sharenfs=on" and
has been automatically exported (by libzfs), the dataset will not be
automatically unexported as it should be. This workflow appears to have
been broken by this commit: 3fd3e56cfd543d7d7a1bf502bfc0db6e24139668
In that change, the "zfs_unmount" function was modified to use the
"mnt.mnt_special" field when determining the mount point that is being
unmounted, rather than "mnt.mnt_mountp".
As a result, when "mntpt" is passed into "zfs_unshare_proto", it's value
is now the dataset name rather than the mountpoint. Thus, when this
value is used with the "is_shared" function (via "zfs_unshare_proto") it
will not find a match (since that function assumes it'll be passed the
mountpoint) and incorrectly reports that the dataset is not shared.
This can be easily reproduced with the following commands:
Tom Caputi [Wed, 3 Oct 2018 16:47:11 +0000 (12:47 -0400)]
Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.
In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
Jerry Jelinek [Thu, 30 Aug 2018 13:13:30 +0000 (13:13 +0000)]
OpenZFS 9700 - ZFS resilvered mirror does not balance reads
Authored by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/82f63c3c
Closes #7973
Alek P [Tue, 2 Oct 2018 19:30:58 +0000 (15:30 -0400)]
changelist should be able to iter on mounts
Modified changelist_gather()ing for the mountpoint property.
Now instead of iterating on all dataset descendants, we read
/proc/self/mounts and iterate on the mounted descendant datasets only.
Switched changelist implementation from a uu_list_* to uu_avl_* in
order to reduce changlist code-path's worst case time complexity.
Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7967
Mitigate the likelihood of the newly created volumes being busy
when the 'zfs destroy -r' is issued by waiting for udev to settle.
Since this is not a iron clad fix I've added the test case to
the known list of possible failures and referenced issue #7961.
Finally, in the case this test does fail fix the cleanup logic
so subsequent tests won't incorrectly fail.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7961
Closes #7962
Tim Schumacher [Mon, 1 Oct 2018 17:42:05 +0000 (19:42 +0200)]
Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.
To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
Matthew Ahrens [Mon, 1 Oct 2018 17:40:11 +0000 (10:40 -0700)]
Remove duplicate macro in dsl_dir.h
The DD_FIELD_LAST_REMAP_TXG macro was added twice (with the same value).
This change removes one of them.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7968
Due to a flaw in 4589f3ae the number of unique combinations
could be calculated incorrectly. This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.
This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.
The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest. It is implemented such it will never
make a known recoverable block unrecoverable.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900
Closes #7934
John Gallagher [Wed, 26 Sep 2018 18:08:12 +0000 (11:08 -0700)]
Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):
* We don't account for the fact that seq_file sometimes visits a node
multiple times, which results in missing messages when read through
procfs.
* We don't keep separate state for each reader of a file, so concurrent
readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
the list between read syscalls, so reading from these files in procfs
can cause the system to crash.
This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.
Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
Gregor Kopka [Wed, 26 Sep 2018 18:02:26 +0000 (20:02 +0200)]
Fix flake 8 style warnings
Ran zts-report.py and test-runner.py from ./tests/test-runner/bin/
through the 2to3 (https://docs.python.org/2/library/2to3.html).
Checked the result, fixed:
- 'maxint' -> 'maxsize' that 2to3 missed.
- 'cmp=' parameter for a 'sorted()' with a 'key=' version.
- try/except wrapping of configparser import as there are still
python 2.7 systems that lack a compatibility shim
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7925
Closes #7952
Tim Schumacher [Wed, 26 Sep 2018 17:29:26 +0000 (19:29 +0200)]
Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.
Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.
Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885
Closes #7932
Brian Behlendorf [Wed, 26 Sep 2018 16:50:58 +0000 (09:50 -0700)]
Fix small sysfs leak
When zfs_kobj_init() is called with an attr_cnt of 0 only the
kobj->zko_default_attrs is allocated. It subsequently won't
get freed in zfs_kobj_release since the free is wrapped in
a kobj->zko_attr_count != 0 conditional.
Split the block in zfs_kobj_release() to make sure the
kobj->zko_default_attrs are freed in this case.
Additionally, fix a minor spelling mistake and typo in
zfs_kobj_init() which could also cause a leak but in practice
is almost certain not to fail.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7957
Gregor Kopka [Tue, 25 Sep 2018 23:29:16 +0000 (01:29 +0200)]
Zpool iostat: remove latency/queue scaling
Bandwidth and iops are average per second while *_wait are averages
per request for latency or, for queue depths, an instantaneous
measurement at the end of an interval (according to man zpool).
When calculating the first two it makes sense to do
x/interval_duration (x being the increase in total bytes or number of
requests over the duration of the interval, interval_duration in
seconds) to 'scale' from amount/interval_duration to amount/second.
But applying the same math for the latter (*_wait latencies/queue) is
wrong as there is no interval_duration component in the values (these
are time/requests to get to average_time/request or already an
absulute number).
This bug leads to the only correct continuous *_wait figures for both
latencies and queue depths from 'zpool iostat -l/q' being with
duration=1 as then the wrong math cancels itself (x/1 is a nop).
This removes temporal scaling from latency and queue depth figures.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7945
Closes #7694
Brian Behlendorf [Tue, 25 Sep 2018 00:11:25 +0000 (17:11 -0700)]
Fix statfs(2) for 32-bit user space
When handling a 32-bit statfs() system call the returned fields,
although 64-bit in the kernel, must be limited to 32-bits or an
EOVERFLOW error will be returned.
This is less of an issue for block counts since the default
reported block size in 128KiB. But since it is possible to
set a smaller block size, these values will be scaled as
needed to fit in a 32-bit unsigned long.
Unlike most other filesystems the total possible file counts
are more likely to overflow because they are calculated based
on the available free space in the pool. In order to prevent
this the reported value must be capped at 2^32-1. This is
only for statfs(2) reporting, there are no changes to the
internal ZFS limits.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7927
Closes #7122
Closes #7937
This change simplify the test case removing part of the logic which was
introducing a race condition and thus causing spurious failures: we use
attempt_during_removal() from removal.kshlib instead which has been
observed to be more stable.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7894
Closes #7913
Gregor Kopka [Mon, 24 Sep 2018 17:12:59 +0000 (19:12 +0200)]
Fix flake 8 style warnings
Ran zts-report.py and test-runner.py from ./tests/test-runner/bin/
through the 2to3 (https://docs.python.org/2/library/2to3.html).
Checked the result, fixed:
- 'maxint' -> 'maxsize' that 2to3 missed.
- 'cmp=' parameter for a 'sorted()' with a 'key=' version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Wren Kennedy <jwk404@gmail.com> Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7925
Closes #7929
Since "\001" (ASCII Start Of Header) is not printable this results in
weird characters displayed when inspecting the debug log. This commit
simply removes this superfluous prefix passed to zfs_dbgmsg().
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7936
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: DHE <git@dehacked.net>
Closes #7938
This change adds limits to the possible spa_slop_shift values set via
the sysfs interface. Accepted values are from a minimum of 1 to a
maximum of 31 (inclusive): these limits are based on the following
values observed on a 128PB file-vdev test pool:
Richard Yao [Tue, 18 Sep 2018 19:03:47 +0000 (15:03 -0400)]
Add NEWS file
I received a request for a NEWS file. That needs to be handled by Tony
and Brian, but for now, we can at least provide one that provides a link
to github so that users who expect NEWS files from their packages will
know where to go for release information.
Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #7918
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: DHE <git@dehacked.net>
Closes #7920
Gregor Kopka [Tue, 18 Sep 2018 15:55:33 +0000 (17:55 +0200)]
Man page fixes - zpool/zfs optional parameters
The man pages for zpool and zfs (get command)
listed the pool/dataset parameter as required,
but these are optional. Fixed that.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7916
Brian Behlendorf [Tue, 18 Sep 2018 00:28:18 +0000 (17:28 -0700)]
Clarify 'zpool remove' restrictions
Update zpool(8) to clarify what type of vdevs may be safely
removed and that the existence of any top-level raidz device
which is part of the primary pool will prevent device removal.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7880
Closes #7893
This change improve the handling of invalid filesystem properties when
specified at pool creation: this is useful when 'zpool create -n'
(dry run) is executed to detect invalid fs-level options (-O) before
the actual command is run.
Brian Behlendorf [Thu, 13 Sep 2018 20:35:09 +0000 (13:35 -0700)]
Add removal_resume_export to zts-report.py
Add the removal_resume_export test case to the possible failure
section of the zts-report.py and reference the Github issue. In
the CI environment this test has proven to be unreliable due to
the way it detects the removal thread. This is a flaw in the test
and not device removal so update the result summary accordingly.
Additionally, increase the allowed timeout in an effort to reduce
the observed rate of false positves.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7895
Issue #7894
Don Brady [Fri, 7 Sep 2018 04:44:52 +0000 (22:44 -0600)]
Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module). When not built as a module add
the sysfs entries under /sys/fs/zfs/.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868
Closes #7872
When no permission set is defined for a dataset the create time
permissions are incorrectly shown as if they were a permission set.
This change simply correct how allow permissions are displayed.
This commit also fixes a small manpage formatting issue and adds the
"zfs_allow_003_pos" test case to the ZFS Test Suite.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7519
Closes #7860
Don Brady [Thu, 6 Sep 2018 01:33:36 +0000 (19:33 -0600)]
Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Gregor Kopka <gregor@kopka.net> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
As a regular kernel function, kern_path() returns errors as negative
errnos, such as -ELOOP. zfsctl_snapdir_vget() must convert these into
the positive errnos used throughout the ZFS code when it returns them
to other ZFS functions so that the ZFS code properly sees them as
errors.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Closes #7764
Closes #7864
Revert "Update zfs_admin_snapshot default value (disabled)"
This reverts commit a6214a0ae9e78d3cac0e495e2fcf7af0858a872f.
Disabling zfs_admin_snapshot by default results in multiple ZTS
tests failing which depend on this functionality. Revert this
change until the relevant test cases can be updated.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7838
Relax allocation throttling for ditto blocks. Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment. Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.
Sponsored by: iXsystems, Inc.
Authored by: mav <mav@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9751
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/8253837ac3
Closes #7857
mav [Fri, 17 Aug 2018 15:00:41 +0000 (15:00 +0000)]
OpenZFS 9738 - Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.
Authored by: mav <mav@FreeBSD.org>
Reviewed by: Paul Dagnelie <pcd@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9738
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/63e7138
Closes #7858
Don Brady [Sun, 2 Sep 2018 19:14:01 +0000 (15:14 -0400)]
Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).
Programmatically document, via a zfs_ioc_key_t, the valid arguments
for the ioc commands that rely on nvpair input arguments (i.e. non
legacy commands from libzfs_core). Automatically verify the expected
pairs before dispatching a command.
This initial phase focuses on the non-legacy ioctls. A follow-on
change can address the legacy ioctl input from the zfs_cmd_t.
The zfs_ioc_key_t for zfs_keys_channel_program looks like:
Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE,
EBADF, and E2BIG).
ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
Don Brady [Sun, 2 Sep 2018 19:09:53 +0000 (15:09 -0400)]
Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature
and property attributes. The primary consumer of this information
is user processes, like the zfs CLI, that need to know what the
current loaded ZFS module supports. The libzfs binary will consult
this information when instantiating the zfs and zpool property
tables and the pool features table.
This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
features.runtime
features.pool
properties.dataset
properties.pool
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
Brian Behlendorf [Fri, 31 Aug 2018 22:30:44 +0000 (15:30 -0700)]
ZTS: Fix EBUSY volume destroy failures
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup functions
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy. This was done not only for
volumes but also for file systems for consistency.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7854
Brian Behlendorf [Fri, 31 Aug 2018 21:20:34 +0000 (14:20 -0700)]
Allow ECKSUM in vdev_checkpoint_sm_object()
The checkpoint space map object may not be accessible from the
vdev's ZAP when it has been damaged. This may be the case when
performing an extreme rewind when importing the pool.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7853
Matthew Ahrens [Fri, 31 Aug 2018 17:16:54 +0000 (13:16 -0400)]
clean up __dbuf_hold_impl
We can simplify the dbuf_hold code by allocating dbuf_hold_arg_t's on
demand, rather than allocating a big array of them up front. While this
can occasionally increase the number of allocations, typically only one
allocation is needed since the indirect block is already cached.
The performance test suite gets the same results with this change.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7841
Richard Elling [Thu, 30 Aug 2018 21:43:37 +0000 (14:43 -0700)]
ZTS: Fix DEV_DSKDIR trim from disk
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #7848
Brian Behlendorf [Thu, 30 Aug 2018 20:38:09 +0000 (13:38 -0700)]
ZTS: Fix zfs_create_013_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7847
Tom Caputi [Wed, 29 Aug 2018 18:33:33 +0000 (14:33 -0400)]
OpenZFS 9403 - assertion failed in arc_buf_destroy()
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.
Porting notes:
* The ability to zinject decompression errors has been added, but
this only works at the zio_decompress() level, where we have all
of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
ztest for further test coverage. Although this did uncover a few
low priority issues, this unfortuantely also causes ztest to
ASSERT in many locations where the code is working correctly since
it is designed to fail on IO errors. Developers can manually set
this variable with the '-o' option to find and debug issues.
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Tom Caputi <tcaputi@datto.com>
OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
Tom Caputi [Mon, 20 Aug 2018 20:42:17 +0000 (16:42 -0400)]
Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
Brian Behlendorf [Mon, 27 Aug 2018 17:04:21 +0000 (10:04 -0700)]
Direct IO support
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g. readahead) while preventing contention for system
memory between the database and kernel caches.
On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.
Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.
1. Minimize cache effects of the I/O.
By design the ARC is already scan-resistant which helps mitigate
the need for special O_DIRECT handling. Data which is only
accessed once will be the first to be evicted from the cache.
This behavior is in consistent with Illumos and FreeBSD.
Future performance work may wish to investigate the benefits of
immediately evicting data from the cache which has been read or
written with the O_DIRECT flag. Functionally this behavior is
very similar to applying the 'primarycache=metadata' property
per open file.
2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.
No additional alignment or length restrictions are imposed.
3. O_DIRECT _MAY_ perform unbuffered IO operations directly
between user memory and block device.
No unbuffered IO operations are currently supported. In order
to support features such as transparent compression, encryption,
and checksumming a copy must be made to transform the data.
4. O_DIRECT _MAY_ imply O_DSYNC (XFS).
O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide
O_DSYNC to request synchronous semantics.
5. O_DIRECT _MAY_ disable file locking that serializes IO
operations. Applications should avoid mixing O_DIRECT
and normal IO or mmap(2) IO to the same file. This is
particularly true for overlapping regions.
All I/O in ZFS is locked for correctness and this locking is not
disabled by O_DIRECT. However, concurrently mixing O_DIRECT,
mmap(2), and normal I/O on the same file is not recommended.
This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations. Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224
Closes #7823
Brian Behlendorf [Sun, 26 Aug 2018 19:59:33 +0000 (12:59 -0700)]
Remove %changelog from spec file
Remove the %changelog section from the spec files since it does
not get updated in the master branch. Not only does this mean
the information is stale, but it can result in 'make deb' failing
to build packages, issue #7825. This section should be updated
for tagged releases.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7825
Closes #7827
Fix a bunch of truncation compiler warnings that show up
on Fedora 28 (GCC 8.0.1).
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7368
Closes #7826
Closes #7830