Nasf-Fan [Tue, 13 Feb 2018 22:54:54 +0000 (06:54 +0800)]
Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.
Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').
By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.
Support the following commands and functionalities:
zfs set projectquota@project
zfs set projectobjquota@project
zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project
This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>
For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
LOLi [Mon, 12 Feb 2018 20:28:59 +0000 (21:28 +0100)]
'zfs receive' fails with "dataset is busy"
Receiving an incremental stream after an interrupted "zfs receive -s"
fails with the message "dataset is busy": this is because we still have
the hidden clone ../%recv from the resumable receive.
Improve the error message suggesting the existence of a partially
complete resumable stream from "zfs receive -s" which can be either
aborted ("zfs receive -A") or resumed ("zfs send -t").
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7129
Closes #7154
LOLi [Mon, 12 Feb 2018 19:40:00 +0000 (20:40 +0100)]
contrib/initramfs: add missing conf.d/zfs
When upgrading from the distribution-provided zfs-initramfs package on
root-on-zfs Ubuntu and Debian the system may fail to boot: this change
adds the missing initramfs configuration file.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7158
8520 lzc_rollback_to should support rolling back to origin
7198 libzfs should gracefully handle EINVAL from lzc_rollback
lzc_rollback_to() should support rolling back to a clone's origin.
The current checks in zfs_ioc_rollback() would not allow that
because the origin snapshot belongs to a different filesystem.
The overly restrictive check was in introduced in 7600, but it
was not a regression as none of the existing tools provided a
way to rollback to the origin.
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8520
OpenZFS-issue: https://www.illumos.org/issues/7198
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/78a5a1a25a
Closes #7150
sanjeevbagewadi [Fri, 9 Feb 2018 18:15:53 +0000 (23:45 +0530)]
Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.
Chunwei Chen [Fri, 2 Feb 2018 00:36:40 +0000 (16:36 -0800)]
Fix zdb -ed on objset for exported pool
zdb -ed on objset for exported pool would failed with:
failed to own dataset 'qq/fs0': No such file or directory
The reason is that zdb pass objset name to spa_import, it uses that
name to create a spa. Later, when dmu_objset_own tries to lookup the spa
using real pool name, it can't find one.
We fix this by make sure we pass pool name rather than objset name to
spa_import.
Chunwei Chen [Fri, 2 Feb 2018 00:19:36 +0000 (16:19 -0800)]
Fix zdb -R decompression
There are some issues in the zdb -R decompression implementation.
The first is that ZLE can easily decompress non-ZLE streams. So we add
ZDB_NO_ZLE env to make zdb skip ZLE.
The second is the random bytes appended to pabd, pbuf2 stuff. This serve
no purpose at all, those bytes shouldn't be read during decompression
anyway. Instead, we randomize lbuf2, so that we can make sure
decompression fill exactly to lsize by bcmp lbuf and lbuf2.
The last one is the condition to detect fail is wrong.
Chunwei Chen [Tue, 30 Jan 2018 21:39:11 +0000 (13:39 -0800)]
Fix zdb -c traverse stop on damaged objset root
If a corruption happens to be on a root block of an objset, zdb -c will
not correctly report the error, and it will not traverse the datasets
that come after. This is because traverse_visitbp, which does the
callback and reset error for TRAVERSE_HARD, is skipped when traversing
zil is failed in traverse_impl.
Here's example of what 'zdb -eLcc' command looks like on a pool with
damaged objset root:
Related to commit 4859fe796, when directly using the kernel's
refcount functions in kernel compatibility code do not map
refcount_t to zfs_refcount_t. This leads to a type mismatch.
Longer term we should consider renaming refcount_t to
zfs_refcount_t in the zfs code base.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
A new interface was added to manipulate the version field of an
inode. Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.
The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com>
We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.
=== What is implemented?
This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
the context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
=== How do we handle zfs.sync functions in open context?
When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...
if we run it in open context, we will get back the following error:
Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
[C]: in function 'destroy'
[string "channel program"]:3: in main chunk
=== What about testing?
We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com>
Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).
The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.
With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.
A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com>
ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.
Brad Lewis [Thu, 8 Feb 2018 16:20:33 +0000 (09:20 -0700)]
OpenZFS 8592 - ZFS channel programs - rollback
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com>
ZFS channel programs should be able to perform a rollback.
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com>
zfs.exists() in channel programs doesn't return any result, and should
have a man page entry. This patch corrects zfs.exists so that it
returns a value indicating if the dataset exists or not. It also adds
documentation about it in the man page.
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Don Brady <don.brady@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533
Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
LOLi [Wed, 7 Feb 2018 20:43:24 +0000 (21:43 +0100)]
Verify ZFS Test Suite scripts executability
This change adds a make target 'testscheck' which verifies all ksh test
scripts, part of the ZFS Test Suite, have their executable bit set:
this should avoid adding new test scripts that cannot be executed by
the test-runner.py which would result in the following warning message:
Warning: Test removed from TestGroup because it failed verification.
This also verifies both *.cfg and *.kshlib files used in the ZFS Test
Suite are not executable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7131
Richard Elling [Wed, 7 Feb 2018 19:54:20 +0000 (11:54 -0800)]
Remove deprecated zfs_arc_p_aggressive_disable
zfs_arc_p_aggressive_disable is no more. This PR removes docs
and module parameters for zfs_arc_p_aggressive_disable.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Elling <Richard.Elling@RichardElling.com>
Closes #7135
The generated zfs-load-key.sh file should have been added to
the .gitignore file as part of commit 7da8f8d8. And the
generated file should not be included in the repo.
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7134
DeHackEd [Wed, 7 Feb 2018 00:35:16 +0000 (19:35 -0500)]
Fix `zpool status` overflow on fast scrubs
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net>
Closes #7133
The distribution provided architecture specific RPM macro files
for x86_64 and other architectures on Debian/Ubuntu specify the
wrong default libdir install location. When building deb packages
override _lib with the correct location.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7083
Closes #7101
Tom Caputi [Tue, 6 Feb 2018 00:57:53 +0000 (19:57 -0500)]
Adjust ARC prefetch tunables to match docs
Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms
and zfs_arc_min_prescient_prefetch_ms) which are documented
to be specified in milliseconds. However, the code actually
uses the values as though they were in seconds. This patch
adjusts the code to match the names and documentation of the
tunables.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7126
In order to reliably detect deadlocks in the create and import
path ztest should set the failure mode property. This ensures
that the pool is always using the correct failmode behavior.
Removed insidious use of local variable in MAXFAULTS macro.
Converted VERIFY() to VERIFY0() where appropriate.
Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7111
The keystore.sk_dk_lock should not be held while performing I/O.
Drop the lock when reading from disk and update the code so
they the first successful caller adds the key.
Improve error handling in spa_keystore_create_mapping_impl().
Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: RageLtMan <rageltman@sempervictus> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7112
Closes #7115
LOLi [Fri, 2 Feb 2018 21:50:42 +0000 (22:50 +0100)]
Fix systemd_ RPM macros usage on Debian-based distributions
Debian-based distributions do not seem to provide RPM macros for
dealing with systemd pre- and post- (un)install actions: this results
in errors when installing or upgrading .deb packages because the
resulting control scripts contain the following unresolved macros:
Tom Caputi [Thu, 1 Feb 2018 20:37:24 +0000 (15:37 -0500)]
Change os->os_next_write_raw to work per txg
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.
This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
Tom Caputi [Fri, 19 Jan 2018 09:19:47 +0000 (04:19 -0500)]
Raw sends must be able to decrease nlevels
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.
This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.
This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6821
Closes #6864
Tom Caputi [Mon, 4 Dec 2017 19:10:31 +0000 (14:10 -0500)]
Fix recovery import (-F) with encrypted pool
When performing zil_claim() at pool import time, it is
important that encrypted datasets set os_next_write_raw
before writing to the zil_header_t. This prevents the code
from attempting to re-authenticate the objset_phys_t when
it writes it out, which is unnecessary because the
zil_header_t is not protected by either objset MAC and
impossible since the keys aren't loaded yet. Unfortunately,
one of the code paths did not set this flag, which causes
failed ASSERTs during 'zpool import -F'. This patch corrects
this issue.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
Closes #6916
Tom Caputi [Wed, 8 Nov 2017 19:12:59 +0000 (14:12 -0500)]
Encryption Stability and On-Disk Format Fixes
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.
Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.
This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.
This patch also contains minor bug fixes and cleanups.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6845
Closes #6864
Closes #7052
Tom Caputi [Wed, 31 Jan 2018 23:17:56 +0000 (18:17 -0500)]
Change movaps to movups in AES-NI code
Currently, the ICP contains accelerated assembly code to be
used specifically on CPUs with AES-NI enabled. This code
makes heavy use of the movaps instruction which assumes that
it will be provided aes keys that are 16 byte aligned. This
assumption seems to hold on Illumos, but on Linux some kernel
options such as 'slub_debug=P' will violate it. This patch
changes all instances of this instruction to movups which is
the same except that it can handle unaligned memory.
This patch also adds a few flags which were accidentally never
given to the assembly compiler, resulting in objtool warnings.
Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Nathaniel R. Lewis <linux.robotdude@gmail.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7065
Closes #7108
Brian Behlendorf [Wed, 31 Jan 2018 17:33:33 +0000 (09:33 -0800)]
Fix txg_sync_thread hang in scan_exec_io()
When scn->scn_maxinflight_bytes has not been initialized it's
possible to hang on the condition variable in scan_exec_io().
This issue was uncovered by ztest and is only possible when
deduplication is enabled through the following call path.
Resolve the issue by always initializing scn_maxinflight_bytes
to a reasonable minimum value. This value will be recalculated
in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit
and the addition/removal of vdevs.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7098
LOLi [Tue, 30 Jan 2018 23:54:33 +0000 (00:54 +0100)]
Fix 'zfs receive -o' when used with '-e|-d'
When used in conjunction with one of '-e' or '-d' zfs receive options
none of the properties requested to be set (-o) are actually applied:
this is caused by a wrong assumption made about the toplevel dataset
in zfs_receive_one().
Fix this by correctly detecting the toplevel dataset.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7088
bunder2015 [Mon, 29 Jan 2018 23:10:32 +0000 (18:10 -0500)]
Fix zpool-features(5) large_block inconsistency
Large_blocks feature activation was not consistent with man page,
which erroneously stated that the feature was active when the
recordsize was increased past the stock 128KB. It actually
becomes active when data is written to the dataset.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6275
Closes #7093
LOLi [Mon, 29 Jan 2018 23:05:03 +0000 (00:05 +0100)]
Fix style issues in man pages and commands help
* Remove 'zfs snap' from zfs help message (OpenZFS sync)
* Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot'
* Enforce 80 columns limit in help messages
* Remove zfs_disable_dup_eviction from zfs-module-parameters(5)
* Expose zfs_scan_max_ext_gap as a kernel module parameter.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7087
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.
Correct format of dbuf kstat file.
Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.
Introduce field filtering in the dbufstat python script.
Introduce a no header option to the dbufstat python script.
Introduce a test case to test basic mru->mfu list movement
in the ARC.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6906
Prakash Surya [Mon, 8 Jan 2018 21:45:53 +0000 (13:45 -0800)]
OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue
PROBLEM
=======
When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.
If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.
If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:
In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.
SOLUTION
========
This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.
Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.
With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.
Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.
Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`
Chunwei Chen [Fri, 26 Jan 2018 18:49:46 +0000 (10:49 -0800)]
zpool import -d to specify device path
When we know which devices have the pool we are looking for, sometime
it's better if we can directly pass those device paths to zpool import
instead of letting it to search through all unrelated stuff, which might
take a lot of time if you have hundreds of disks.
This patch allows option -d <dev_path> to zpool import. You can have
multiple pairs of -d <dev_path>, and zpool import will only search
through those devices. For example:
zpool import -d /dev/sda -d /dev/sdb
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7077
Brian Behlendorf [Fri, 26 Jan 2018 17:55:16 +0000 (09:55 -0800)]
Update README.initramfs.markdown
Fix several typos and grammar.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Arno van Wyk <avw1987@users.noreply.github.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7080
Brian Behlendorf [Mon, 22 Jan 2018 20:48:39 +0000 (12:48 -0800)]
Extend zloop.sh for automated testing
In order to debug issues encountered by ztest during automated
testing it's important that as much debugging information as
possible by dumped at the time of the failure. The following
changes extend the zloop.sh script in order to make it easier
to integrate with buildbot.
* Add the `-m <maximum cores>` option to zloop.sh to place a
limit of the number of core dumps generated. By default, the
existing behavior is maintained and no limit is set.
* Add the `-l` option to create a 'ztest.core.N' symlink in the
current directory to the core directory. This functionality
is provided primarily for buildbot which expects log files to
have well known names.
* Rename 'ztest.ddt' to 'ztest.zdb' and extend it to dump
additional basic information on failure for latter analysis.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
The zdb(8) command may not terminate in the case where the pool
gets suspended and there is a caller in zio_wait() blocking on
an outstanding read I/O that will never complete. This can in
turn cause ztest(1) to block indefinitely despite the deadman.
Resolve the issue by setting the default failure mode for zdb(8)
to panic. In user space we always want the command to terminate
when forward progress is no longer possible.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
Brian Behlendorf [Mon, 18 Dec 2017 22:06:07 +0000 (17:06 -0500)]
Extend deadman logic
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems. The proposed changes include:
* Added a new `zfs_deadman_failmode` module option which is
used to dynamically control the behavior of the deadman. It's
loosely modeled after, but independant from, the pool failmode
property. It can be set to wait, continue, or panic.
* wait - Wait for the "hung" I/O (default)
* continue - Attempt to recover from a "hung" I/O
* panic - Panic the system
* Added a new `zfs_deadman_ziotime_ms` module option which is
analogous to `zfs_deadman_synctime_ms` except instead of
applying to a pool TXG sync it applies to zio_wait(). A
default value of 300s is used to define a "hung" zio.
* The ztest deadman thread has been re-enabled by default,
aligned with the upstream OpenZFS code, and then extended
to terminate the process when it takes significantly longer
to complete than expected.
* The -G option was added to ztest to print the internal debug
log when a fatal error is encountered. This same option was
previously added to zdb in commit fa603f82. Update zloop.sh
to unconditionally pass -G to obtain additional debugging.
* The FM_EREPORT_ZFS_DELAY event which was previously posted
when the deadman detect a "hung" pool has been replaced by
a new dedicated FM_EREPORT_ZFS_DEADMAN event.
* The proposed recovery logic attempts to restart a "hung"
zio by calling zio_interrupt() on any outstanding leaf zios.
We may want to further restrict this to zios in either the
ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
Calling zio_interrupt() is expected to only be useful for
cases when an IO has been submitted to the physical device
but for some reasonable the completion callback hasn't been
called by the lower layers. This shouldn't be possible but
has been observed and may be caused by kernel/driver bugs.
* The 'zfs_deadman_synctime_ms' default value was reduced from
1000s to 600s.
* Depending on how ztest fails there may be no cache file to
move. This should not be considered fatal, collect the logs
which are available and carry on.
* Add deadman test cases for spa_deadman() and zio_wait().
* Increase default zfs_deadman_checktime_ms to 60s.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7079
Brian Behlendorf [Fri, 19 Jan 2018 17:22:37 +0000 (09:22 -0800)]
OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL
usr/src/uts/common/sys/fs/zfs.h
Change ZPROP_INVAL and ZPROP_CONT from macros to enum values. Clang
and GCC both prefer to use unsigned ints to store enums. That was
causing tautological comparison warnings (and likely eliminating
error handling code at compile time) whenever a zfs_prop_t or
zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT. Making the
error flags be explicity enum values forces the enum types to be
signed.
ZPROP_INVAL was also compared against two different enum types. I
had to change its name to ZPOOL_PROP_INVAL whenever its compared to
a zpool_prop_t. There are still some places where ZPROP_INVAL or
ZPROP_CONT is compared to a plain int, in code that doesn't know
whether the int is storing a zfs_prop_t or a zpool_prop_t.
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74
Closes #7061
Brian Behlendorf [Fri, 19 Jan 2018 17:20:58 +0000 (09:20 -0800)]
OpenZFS 8641 - "zpool clear" and "zinject" don't work on "spare" or "replacing" vdevs
Add "spare" and "replacing" to the list of interior vdev types in
zpool_vdev_is_interior(), alongside the existing "mirror" and "raidz".
This fixes running "zinject -d" and "zpool clear" on spare and replacing
vdevs.
Authored by: Alan Somers <asomers@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8641
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9a36801382
Closes #7060
Matthew Thode [Thu, 18 Jan 2018 18:20:34 +0000 (18:20 +0000)]
Run zfs load-key if needed in dracut
'zfs load-key -a' will only be called if needed. If a dataset not
needed for boot does not have its key loaded (home directories for
example) boot can still continue.
zfs:AUTO was not working via dracut, so we still need the generator
script to do its thing.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #6982
Closes #7004
LOLi [Thu, 18 Jan 2018 18:15:41 +0000 (19:15 +0100)]
Fix Debian packaging on ARMv7/ARM64
When building packages on Debian-based systems specify the target
architecture used by 'alien' to convert .rpm packages into .deb: this
avoids detecting an incorrect value which results in the following
errors:
<package>.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system
<package>.armv7l.rpm is for architecture armel ; the package cannot be built on this system
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7046
Closes #7058
John L. Hammond [Wed, 17 Jan 2018 20:24:42 +0000 (14:24 -0600)]
Emit an error message before MMP suspends pool
In mmp_thread(), emit an MMP specific error message before calling
zio_suspend() so that the administrator will understand why the pool
is being suspended.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John L. Hammond <john.hammond@intel.com>
Closes #7048
Sean Eric Fagan [Mon, 6 Nov 2017 23:13:23 +0000 (15:13 -0800)]
OpenZFS 8959 - Add notifications when a scrub is paused or resumed
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.
Brian Behlendorf [Wed, 17 Jan 2018 18:17:16 +0000 (10:17 -0800)]
Fix shellcheck v0.4.6 warnings
Resolve new warnings reported after upgrading to shellcheck
version 0.4.6. This patch contains no functional changes.
* egrep is non-standard and deprecated. Use grep -E instead. [SC2196]
* Check exit code directly with e.g. 'if mycmd;', not indirectly
with $?. [SC2181] Suppressed.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7040
Brian Behlendorf [Fri, 12 Jan 2018 17:36:26 +0000 (09:36 -0800)]
Force ztest to always use /dev/urandom
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable. Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7017
Closes #7036
Brian Behlendorf [Wed, 10 Jan 2018 18:49:27 +0000 (10:49 -0800)]
Support -fsanitize=address with --enable-asan
When --enable-asan is provided to configure then build all user
space components with fsanitize=address. For kernel support
use the Linux KASAN feature instead.
When using gcc version 4.8 any test case which intentionally
generates a core dump will fail when using --enable-asan.
The default behavior is to disable core dumps and only newer
versions allow this behavior to be controled at run time with
the ASAN_OPTIONS environment variable.
Additionally, this patch includes some build system cleanup.
* Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS,
and AM_LDFLAGS. Any additional flags should be added on a
per-Makefile basic. The --enable-debug and --enable-asan
options apply to all user space binaries and libraries.
* Compiler checks consolidated in always-compiler-options.m4
and renamed for consistency.
* -fstack-check compiler flag was removed, this functionality
is provided by asan when configured with --enable-asan.
* Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and
DEBUG_LDFLAGS.
* Moved default kernel build flags in to module/Makefile.in and
split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These
flags are set with the standard ccflags-y kbuild mechanism.
* -Wframe-larger-than checks applied only to binaries or
libraries which include source files which are built in
both user space and kernel space. This restriction is
relaxed for user space only utilities.
* -Wno-unused-but-set-variable applied only to libzfs and
libzpool. The remaining warnings are the result of an
ASSERT using a variable when is always declared.
* -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped
because they are Solaris specific and thus not needed.
* Ensure $GDB is defined as gdb by default in zloop.sh.
Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7027
Brian Behlendorf [Wed, 10 Jan 2018 18:41:30 +0000 (10:41 -0800)]
Disable history_004_pos
Occasionally observed failure of history_004_pos due to the test
case not being 100% reliable. In order to prevent false positives
disable this test case until it can be made reliable.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7026
Closes #7028
Richard Yao [Wed, 10 Jan 2018 00:18:19 +0000 (19:18 -0500)]
Fix incompatibility with Reiser4 patched kernels
In ZFSOnLinux, our sources and build system are self contained such that
we do not need to make changes to the Linux kernel sources. Reiser4 on
the other hand exists solely as a kernel tree patch and opts to make
changes to the kernel rather than adapt to it. After Linux 4.1 made a
VFS change that replaced new_sync_read with do_sync_read, Reiser4's
maintainer decided to modify the kernel VFS to export the old function.
This caused our autotools check to misidentify the kernel API as
predating Linux 4.1 on kernels that have been patched with Reiser4
support, which breaks our build.
Reiser4 really should be patched to stop doing this, but lets modify our
check to be more strict to help the affected users of both filesystems.
Also, we were not checking the types of arguments and return value of
new_sync_read() and new_sync_write() . Lets fix that too.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #6241
Closes #7021
Alex Zhuravlev [Mon, 8 Jan 2018 18:57:47 +0000 (10:57 -0800)]
Use zap_count instead of cached z_size for unlink
As a performance optimization Lustre does not strictly update
the SA_ZPL_SIZE when adding/removing from non-directory entries.
This results in entries which cannot be removed through the ZPL
layer even though the ZAP is empty and safe to remove.
Resolve this issue by checking the zap_count() directly instead
on relying on the cached SA_ZPL_SIZE. Micro-benchmarks show no
significant performance impact due to the additional overhead
of using zap_count().
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7019
As part of the refactoring of ab9f4b0b824ab4cc64a4fa382c037f4154de12d6,
several uint64_t-s and uint8_t-s were changed to other types. This
caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM
machine. In absense of any strong motivation for the type changes, this
simply puts them back, modulo the changes accumulated for ABD.
Attempt to reduce the number of comments posted by codecov
to PR requests. Based on the codecov documenation setting
"require_changes=yes" and "behavior=once" should result in
a single comment under most circumstances.
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.
This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6171
Closes #6852
Closes #6989
LOLi [Thu, 28 Dec 2017 18:15:32 +0000 (19:15 +0100)]
Fix 'zpool add' handling of nested interior VDEVs
When replacing a faulted device which was previously handled by a spare
multiple levels of nested interior VDEVs will be present in the pool
configuration; the following example illustrates one of the possible
situations:
NAME STATE READ WRITE CKSUM
testpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
/var/tmp/fault-dev UNAVAIL 0 0 0 cannot open
/var/tmp/replace-dev ONLINE 0 0 0
/var/tmp/spare-dev1 ONLINE 0 0 0
/var/tmp/safe-dev ONLINE 0 0 0
spares
/var/tmp/spare-dev1 INUSE currently in use
This is safe and allowed, but get_replication() needs to handle this
situation gracefully to let zpool add new devices to the pool.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6678
Closes #6996
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race *is* present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
SOLUTION
========
The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.
Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.
Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.
Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.
lidongyang [Fri, 22 Dec 2017 18:19:51 +0000 (05:19 +1100)]
Call commit callbacks from the tail of the list
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.
The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.
This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.
A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
This patch simply removes 2 empty files that were accidentally
added a part of the scrub priority patch.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6990
Tom Caputi [Thu, 21 Dec 2017 17:13:06 +0000 (12:13 -0500)]
Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.
This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921
Closes #6926
Simon Guest [Wed, 20 Dec 2017 17:42:07 +0000 (06:42 +1300)]
vdev_id: new slot type ses
This extends vdev_id to support a new slot type, ses, for SCSI Enclosure
Services. With slot type ses, the disk slot numbers are determined by
using the device slot number reported by sg_ses for the device with
matching SAS address, found by querying all available enclosures.
This is primarily of use on systems with a deficient driver omitting
support for bay_identifier in /sys/devices. In my testing, I found that
the existing slot types of port and id were not stable across disk
replacement, so an alternative was required.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Simon Guest <simon.guest@tesujimath.org>
Closes #6956
LOLi [Tue, 19 Dec 2017 21:02:40 +0000 (22:02 +0100)]
Handle invalid options in arc_summary
If an invalid option is provided to arc_summary.py we handle any error
thrown from the getopt Python module and print the usage help message.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6983
LOLi [Tue, 19 Dec 2017 18:49:33 +0000 (19:49 +0100)]
ZTS: Fix create-o_ashift test case
The function that fills the uberblock ring buffer on every device label
has been reworked to avoid occasional failures caused by a race
condition that prevents 'zpool sync' from writing some uberblock
sequentially: this happens when the pool sync ioctl dispatch code calls
txg_wait_synced() while we're already waiting for a TXG to sync.
Brian Behlendorf [Mon, 18 Dec 2017 18:28:27 +0000 (10:28 -0800)]
Fix multihost stale cache file import
When the multihost property is enabled it should be impossible to
import an active pool even using the force (-f) option. This patch
prevents a forced import from succeeding when importing with a
stale cache file.
The root cause of the problem is that the kernel modules trusted
the hostid provided in configuration. This is always correct when
the configuration is generated by scanning for the pool. However,
when using an existing cache file the hostid could be stale which
would result in the activity check being skipped.
Resolve the issue by always using the hostid read from the label
configuration where the best uberblock was found.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6933
Closes #6971
LOLi [Sun, 17 Dec 2017 22:14:07 +0000 (23:14 +0100)]
Honor --with-mounthelperdir where applicable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6962
LOLi [Sun, 17 Dec 2017 22:08:48 +0000 (23:08 +0100)]
Fix --with-systemd on Debian-based distributions (#6963)
These changes propagate the "--with-systemd" configure option to the
RPM spec file, allowing Debian-based distributions to package
systemd-related files.
Brian Behlendorf [Sun, 17 Dec 2017 22:02:29 +0000 (14:02 -0800)]
Remove lib/libspl/include/sys/frame.h
The functionality provided by this header is not required by any
of the ZFS user space code. Minimal functionality was provided
in commit c28a677 which added include/sys/frame.h.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6960
Closes #6972
David Qian [Thu, 7 Dec 2017 08:43:13 +0000 (16:43 +0800)]
Enable QAT support in zfs-dkms RPM
Enable QAT accelerated gzip compression in zfs-dkms RPM package when
environment variant ICP_ROOT is set to QAT drive source code folder
and QAT hardware presence. Otherwise, use default gzip compression.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: David Qian <david.qian@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6932