Ryan Moeller [Tue, 3 Sep 2019 19:44:08 +0000 (15:44 -0400)]
Use the right booleans
TRUE and FALSE happen to be defined, but we should use B_TRUE and
B_FALSE for the sake of consistency.
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9264
Pavel Zakharov [Tue, 3 Sep 2019 18:29:52 +0000 (14:29 -0400)]
zvol_wait script should ignore partially received zvols
Partially received zvols won't have links in /dev/zvol.
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Closes #9260
Always refuse receving non-resume stream when resume state exists
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).
Additionally, distinguish incremental and resume streams in error
messages.
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9252
Igor K [Tue, 3 Sep 2019 17:46:41 +0000 (20:46 +0300)]
ZTS: Fix removal_cancel.ksh
Create a larger file to extend the time required to perform the
removal. Occasional failures were observed due to the removal
completing before the cancel could be requested.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #9259
George Wilson [Tue, 3 Sep 2019 02:17:51 +0000 (22:17 -0400)]
maxinflight can overflow in spa_load_verify_cb()
When running on larger memory systems, we can overflow the value of
maxinflight. This can result in maxinflight having a value of 0 causing
the system to hang.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #9272
Andrea Gelmini [Tue, 3 Sep 2019 01:17:39 +0000 (03:17 +0200)]
Fix typos
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9251
Andrea Gelmini [Tue, 3 Sep 2019 01:14:53 +0000 (03:14 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9250
Andrea Gelmini [Tue, 3 Sep 2019 01:13:19 +0000 (03:13 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9249
Andrea Gelmini [Tue, 3 Sep 2019 01:12:01 +0000 (03:12 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9247
Andrea Gelmini [Tue, 3 Sep 2019 01:10:31 +0000 (03:10 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9246
Andrea Gelmini [Tue, 3 Sep 2019 01:08:56 +0000 (03:08 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9244
Andrea Gelmini [Tue, 3 Sep 2019 01:07:35 +0000 (03:07 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9243
Andrea Gelmini [Tue, 3 Sep 2019 00:58:26 +0000 (02:58 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9242
Andrea Gelmini [Tue, 3 Sep 2019 00:56:41 +0000 (02:56 +0200)]
Fix typos in module/zfs/
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9240
Andrea Gelmini [Tue, 3 Sep 2019 00:53:27 +0000 (02:53 +0200)]
Fix typos in lib/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9237
Andrea Gelmini [Fri, 30 Aug 2019 23:53:48 +0000 (01:53 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9248
Andrea Gelmini [Fri, 30 Aug 2019 23:52:00 +0000 (01:52 +0200)]
Fix typos in tests/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9245
Andrea Gelmini [Fri, 30 Aug 2019 21:32:18 +0000 (23:32 +0200)]
Fix typos in module/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9241
Andrea Gelmini [Fri, 30 Aug 2019 21:26:07 +0000 (23:26 +0200)]
Fix typos in modules/icp/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9239
Andrea Gelmini [Fri, 30 Aug 2019 16:53:15 +0000 (18:53 +0200)]
Fix typos in include/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9238
Andrea Gelmini [Fri, 30 Aug 2019 16:46:52 +0000 (18:46 +0200)]
Fix typos in etc/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9236
Andrea Gelmini [Fri, 30 Aug 2019 16:44:43 +0000 (18:44 +0200)]
Fix typos in contrib/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9235
Andrea Gelmini [Fri, 30 Aug 2019 16:43:30 +0000 (18:43 +0200)]
Fix typos in cmd/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9234
Andrea Gelmini [Fri, 30 Aug 2019 16:41:35 +0000 (18:41 +0200)]
Fix typos in man/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9233
Andrea Gelmini [Fri, 30 Aug 2019 16:40:30 +0000 (18:40 +0200)]
Fix typos in config/
Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #9232
Igor K [Fri, 30 Aug 2019 16:32:25 +0000 (19:32 +0300)]
Fix refquota_007_neg.ksh
Must use 'zfs' instead of '$ZFS' which is undefined.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #9257
Paul Dagnelie [Fri, 30 Aug 2019 16:28:31 +0000 (09:28 -0700)]
Prevent metaslab_sync panic due to spa_final_dirty_txg
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9185
Closes #9186
Closes #9231
Closes #9253
Ryan Moeller [Thu, 29 Aug 2019 20:11:29 +0000 (16:11 -0400)]
Simplify deleting partitions in libtest
Eliminate unnecessary code duplication. We can use a for-loop instead
of a while-loop. There is no need to echo $DISKSARRAY in a subshell or
return 0. Declare all variables with typeset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9224
Ryan Moeller [Thu, 29 Aug 2019 18:03:09 +0000 (14:03 -0400)]
Use compatible arg order in tests
BSD getopt() and getopt_long() want options before arguments.
Reorder arguments to zfs/zpool in tests to put all the options first.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9228
Paul Dagnelie [Thu, 29 Aug 2019 17:20:36 +0000 (10:20 -0700)]
Keep more metaslabs loaded
With the other metaslab changes loaded onto a system, we can
significantly reduce the memory usage of each loaded metaslab and
unload them on demand if there is memory pressure. However, none
of those changes actually result in us keeping more metaslabs loaded.
If we don't keep more metaslabs loaded, we will still have to wait
for demand-loading to finish when no loaded metaslab can satisfy our
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while
unloading is currently done on a txg-count basis. Txgs can take a
widely varying range of times, from tenths of a second to several
seconds. This can result in confusing, hard to predict behavior.
This change simply adds a time-based component to metaslab unloading.
A metaslab will remain loaded for one minute and 8 txgs (by default)
after it was last used, unless it is evicted due to memory pressure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-65016
External-issue: DLPX-65047
Closes #9197
Pavel Zakharov [Wed, 28 Aug 2019 22:02:58 +0000 (18:02 -0400)]
zfs_handle used after being closed/freed in change_one callback
This is a typical case of use after free. We would call zfs_close(zhp)
which would free the handle, and then call zfs_iter_children() on that
handle later. This change ensures that the zfs_handle is only closed
when we are ready to return.
Running `zfs inherit -r sharenfs pool` was failing with an error
code without any error messages. After some debugging I've pinpointed
the issue to be memory corruption, which would cause zfs to try to
issue an ioctl to the wrong device and receive ENOTTY.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Issue #7967
Closes #9165
Tony Nguyen [Wed, 28 Aug 2019 21:56:54 +0000 (15:56 -0600)]
Use smaller default slack/delta value for schedule_hrtimeout_range()
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.
This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.
Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
and smaller gains as thread count increases. At >64 threads, 2-5%
decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
leveling out around 64 threads. At >64 threads, 5-10% decrease in
performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
high thread count.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #9217
Don Brady [Wed, 28 Aug 2019 17:44:46 +0000 (11:44 -0600)]
Tag ABD pages for exclusion in kernel crash dumps
Tag the ABD data pages so that they can be identified for exclusion
from kernel crash dumps. Eliminating the zfs file data allows for
significantly smaller crash dump files. Note that ZFS in illumos has
always excluded the zfs data pages from a kernel crash dump.
This change tags ARC scatter data pages so they can be identified from
the makedumpfile(8) command. That command is used to create smaller
dump files by ignoring some memory regions and using compression. It
already filters file data from the VFS page cache and will now be able
to exclude ZFS file data pages from the dump file.
A corresponding change to makeumpfile(8) is required to identify ZFS
data pages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8899
Chunwei Chen [Wed, 28 Aug 2019 17:42:02 +0000 (10:42 -0700)]
Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.
We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
Ryan Moeller [Wed, 28 Aug 2019 17:38:40 +0000 (13:38 -0400)]
Prefer `for (;;)` to `while (TRUE)`
Defining a special constant to make an infinite loop is excessive,
especially when the name clashes with symbols commonly defined on
some platforms (ie FreeBSD).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Kennedy <john.kennedy@delphix.com Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9219
Andriy Gapon [Tue, 27 Aug 2019 20:45:53 +0000 (23:45 +0300)]
zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9179
Closes #9180
Richard Allen [Tue, 27 Aug 2019 20:44:02 +0000 (21:44 +0100)]
Fix Plymouth passphrase prompt in initramfs script
Entering the ZFS encryption passphrase under Plymouth wasn't working
because in the ZFS initrd script, Plymouth was calling zfs via
"--command", which wasn't passing through the filesystem argument to
zfs load-key properly (it was passing through the single quotes around
the filesystem name intended to handle spaces literally,
which zfs load-key couldn't understand).
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Garrett Fields <ghfields@gmail.com> Signed-off-by: Richard Allen <belperite@gmail.com>
Issue #9193
Closes #9202
Tom Caputi [Tue, 27 Aug 2019 16:55:51 +0000 (12:55 -0400)]
Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.
This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
Ryan Moeller [Mon, 26 Aug 2019 18:48:31 +0000 (14:48 -0400)]
Restore :: in Makefile.am
The double-colon looked like a typo, but it's actually an obscure
feature. Rules with :: may appear multiple times and are run
independently of one another in the order they appear. The use of ::
for distclean-local was conventional, not accidental.
Add comments to indicate the intentional use of double-colon rules.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9210
Ryan Moeller [Mon, 26 Aug 2019 01:30:39 +0000 (21:30 -0400)]
Split argument list, satisfy shellcheck SC2086
Split the arguments for ${TEST_RUNNER} across multiple lines for
clarity. Also added quotes in the message to match the invoked command.
Unquoted variables in argument lists are subject to splitting. In this
particular case we can't quote the variable because it is an optional
argument. Use the method suggested in the description linked below,
instead.
The technique is to use an unquoted variable with an alternate value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9212
Brian Behlendorf [Fri, 23 Aug 2019 00:37:48 +0000 (17:37 -0700)]
ZTS: Fix in-tree dbufstats test case
Commit a887d653 updated the dbufstats such that escalated privileges
are required. Since all tests under cli_user are run with normal
privileges move this test case to a location where it will be run
required privileges.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9118
Closes #9196
Ryan Moeller [Fri, 23 Aug 2019 00:26:51 +0000 (20:26 -0400)]
Make slog test setup more robust
The slog tests fail when attempting to create pools using file vdevs
that already exist from previous test runs. Remove these files in the
setup for the test.
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9194
Brian Behlendorf [Thu, 22 Aug 2019 17:36:57 +0000 (10:36 -0700)]
ZTS: Use decimal values when setting tunables
The mdb_set_uint32 function requires that the values passed in be
decimal. This was overlooked initially because the matching Linux
function accepts both decimal and hexadecimal values.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #9125
Closes #9195
Brian Behlendorf [Thu, 22 Aug 2019 16:53:45 +0000 (09:53 -0700)]
Fix automake program name transformations (#9190)
Automake can perform program name transformations at install time.
However, arc_summary has its own name transformation taking place,
which interferes with the automake transforms. The automake transforms
must be taken into account in order to resolve the conflict.
Document ZFS_DKMS_ENABLE_DEBUGINFO in userland configuration
Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland
configuration file, as done with the other ZFS_DKMS_* options.
It has been introduced with commit e45c1734a665 ("dkms: Enable
debuginfo option to be set with zfs sysconfig file") but isn't
mentioned anywhere other than the 'dkms.conf' file (generated).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes #9191
Ryan Moeller [Thu, 22 Aug 2019 16:46:09 +0000 (12:46 -0400)]
Dedup IOC enum values in libzfs_input_check
Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`.
This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and
`('Z' << 8)` is wrong.
Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9188
Ryan Moeller [Thu, 22 Aug 2019 16:44:11 +0000 (12:44 -0400)]
Enhance ioctl number checks
When checking ZFS_IOC_* numbers, print which numbers are wrong rather
than silently failing.
Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9187
Brian Behlendorf [Thu, 22 Aug 2019 15:53:44 +0000 (08:53 -0700)]
ZTS: Fix vdev_zaps_005_pos on CentOS 6
The ancient version of blkid (v2.17.2) used in CentOS 6 will not
detect the newly created pool unless it has been written to.
Force a pool sync so `zpool import` will detect the newly created
pool.
Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9199
Ryan Moeller [Wed, 21 Aug 2019 16:01:59 +0000 (12:01 -0400)]
Minor cleanup in Makefile.am
Split long lines where adding license info to dist archive.
Remove extra colon from target line.
Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9189
Ryan Moeller [Tue, 20 Aug 2019 21:45:26 +0000 (17:45 -0400)]
Fix automake program name transformations
Automake can perform program name transformations at install time.
However, arc_summary has its own name transformation taking place,
which interferes with the automake transforms. The automake transforms
must be taken into account in order to resolve the conflict.
Matthew Ahrens [Tue, 20 Aug 2019 18:34:52 +0000 (11:34 -0700)]
Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing. With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.
By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes #8744
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
zap_lockdir_impl+0x858/0x16c0 [zfs]
zap_lockdir+0xd2/0x100 [zfs]
zap_lookup_norm+0x7f/0x100 [zfs]
zap_lookup+0x12/0x20 [zfs]
sa_setup+0x902/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
and thread 2 trying to get sa_lock, either in sa_setup:
sa_setup+0x742/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
sa_build_index+0x13d/0x790 [zfs]
sa_handle_get_from_db+0x368/0x500 [zfs]
zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
zfs_znode_alloc+0x3da/0x1a40 [zfs]
zfs_zget+0x39a/0x6e0 [zfs]
zfs_root+0x101/0x160 [zfs]
zfs_domount+0x91f/0xea0 [zfs]
From there, there are different locking paths back to something
holding zap->zap_rwlock.
The deadlock scenarios involve multiple different ZFS filesystems
being mounted. sa_lock is common to these scenarios, and the sa
struct involved is private to a mount. Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.
The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #9110
colmbuckley [Mon, 19 Aug 2019 22:11:47 +0000 (23:11 +0100)]
Set "none" scheduler if available (initramfs)
Existing zfs initramfs script logic will attempt to set the 'noop'
scheduler if it's available on the vdev block devices. Newer kernels
have the similar 'none' scheduler on multiqueue devices; this change
alters the initramfs script logic to also attempt to set this scheduler
if it's available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Garrett Fields <ghfields@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #9042
Paul Dagnelie [Mon, 19 Aug 2019 22:06:53 +0000 (15:06 -0700)]
Add more refquota tests
It used to be possible for zfs receive (and other operations related
to clone swap) to bypass refquotas. This can cause a number of issues,
and there should be an automated test for it.
Added tests for rollback and receive not overriding refquota.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9139
Paul Dagnelie [Fri, 16 Aug 2019 15:08:21 +0000 (08:08 -0700)]
Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.
First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs. We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.
Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions
At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for
its initramfs. Include both in build when initramfs is configured.
* contrib/initramfs: include 60-zvol.rules and zvol_id
Include 60-zvol.rules and zvol_id and set udev as predependency instead
of debians zdev. This makes debians additional zdev hook unneeded.
* Correct initconfdir substitution for some distros
Not every Linux distro is using @sysconfdir@/default but @initconfdir@
which is already determined by configure. Let's use it.
* systemd: prevent possible conflict between systemd and sysvinit
Systemd will not load a sysvinit service if a unit exists with the same
name. This prevents conflicts between sysvinit and systemd.
In ZFS there is one sysvinit service that does not have a systemd
service but a target counterpart, zfs-import.target.
Usually it does not make any sense to install both but it is possisble.
Let's prevent any conflict by masking zfs-import.service by default.
This does not harm even if init.d/zfs-import does not exist.
Reviewed-by: Chris Wedgwood <cw@f00f.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Alex Ingram <reimu@reimuhakurei.net> Tested-by: Dreamcat4 <dreamcat4@gmail.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #7904
Closes #9089
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Assert that a dnode's bonuslen never exceeds its recorded size
This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.
This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().
A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8348
Paul Zuchowski [Thu, 15 Aug 2019 14:27:13 +0000 (10:27 -0400)]
Make txg_wait_synced conditional in zfsvfs_teardown
The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9115
Paul Dagnelie [Wed, 14 Aug 2019 03:24:43 +0000 (20:24 -0700)]
Prevent race in blkptr_verify against device removal
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run
into a race condition where that vdev is temporarily unavailable. This
happens when a device removal operation and the old vdev_t has been
removed from the array, but the new indirect vdev has not yet been
inserted.
We hold the spa_config_lock while doing our sensitive verification.
To ensure that we don't deadlock, we only grab the lock if we don't
have config_writer held. In addition, I had to const the tags of the
refcounts and the spa_config_lock arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9112
Chunwei Chen [Wed, 14 Aug 2019 03:21:27 +0000 (20:21 -0700)]
Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.
We fix this by only calling zil_remove_async when the file is fully
unlinked.
Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
Prakash Surya [Wed, 14 Aug 2019 03:18:53 +0000 (20:18 -0700)]
Fix device expansion when VM is powered off
When running on an ESXi based VM, I've found that "zpool online -e" will
not expand the zpool, if the disk was expanded in ESXi while the VM was
powered off.
For example, take the following scenario:
1. VM running on top of VMware ESXi
2. ZFS pool created with a given device "sda" of size 8GB
3. VM powered off
4. Device "sda" size expanded to 16GB
5. VM powered on
6. "zpool online -e" used on device "sda"
In this situation, after (2) the zpool will be roughly 8GB in size.
After (6), the expectation is the zpool's size will expand to roughly
16GB in size; i.e. expand to the new size of the "sda" device.
Unfortunately, I've seen that after (6), the zpool size does not change.
What's happening is after (5), the EFI label of the "sda" device will be
such that fields "efi_last_u_lba", "efi_last_lba", and "efi_altern_lba"
all reflect the new size of the disk; i.e. "33554398", "33554431", and
"33554431" respectively.
Thus, the check that we perform in "efi_use_whole_disk":
if ((efi_label->efi_altern_lba == 1) || (efi_label->efi_altern_lba
>= efi_label->efi_last_lba)) {
This will return true, and then we return from the function without
having expanded the size of the zpool/device.
In contrast, if we remove steps (3) and (5) in the sequence above, i.e.
the device is expanded while the VM is powered on, things change. In
that case, the fields "efi_last_u_lba" and "efi_altern_lba" do not
change (i.e. they still reflect the old 8GB device size), but the
"efi_last_lba" field does change (i.e. it now reflects the new 16GB
device size). Thus, when we evaluate the same conditional in
"efi_use_whole_disk", it'll return false, so the zpool is expanded.
Taking all of this into account, this PR updates "efi_use_whole_disk" to
properly expand the zpool when the underlying disk is expanded while the
VM is powered off.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9111
Allan Jude [Wed, 14 Aug 2019 03:16:23 +0000 (23:16 -0400)]
Mark dsl_livelist_should_disable() static
This function is not used outside of dsl_dataset.c
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #9154
George Wilson [Tue, 13 Aug 2019 14:11:57 +0000 (08:11 -0600)]
spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity
of the data and metadata. The amount it scans will depend on the
import flags provided. On systems with small amounts of memory or
when importing a pool from the crash kernel, it's possible for
spa_load_verify to issue too many I/Os that it consumes all the memory
of the system resulting in an OOM message or a hang.
To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.
Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
Tomohiro Kusumi [Tue, 13 Aug 2019 13:58:02 +0000 (22:58 +0900)]
Change boolean-like uint8_t fields in znode_t to boolean_t
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9092
Richard Yao [Tue, 13 Aug 2019 13:46:12 +0000 (09:46 -0400)]
Drop KMC_NOEMERGENCY
This is not implemented. If it were implemented, using it would risk
deadlocks on pre-3.18 kernels. Lets just drop it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #9119
Introduce getting holds and listing bookmarks through ZCP
Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.
Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com>
OpenZFS-issue: https://illumos.org/issues/8862
Closes #7902
Paul Dagnelie [Mon, 5 Aug 2019 21:34:27 +0000 (14:34 -0700)]
Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.
This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
lockdep false positive - move txg_kick() outside of ->dp_lock
This fixes a lockdep warning by breaking a link between ->tx_sync_lock
and ->dp_lock.
The deadlock envisioned by lockdep is this:
thread 1 holds db->db_mtx and tries to get dp->dp_lock:
dsl_pool_dirty_space+0x70/0x2d0 [zfs]
dbuf_dirty+0x778/0x31d0 [zfs]
thread 2 holds bpo->bpo_lock and tries to get db->db_mtx:
dmu_buf_will_dirty_impl
dmu_buf_will_dirty+0x6b/0x6c0 [zfs]
bpobj_iterate_impl+0xbe6/0x1410 [zfs]
thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock:
bpobj_space+0x63/0x470 [zfs]
dsl_scan_active+0x340/0x3d0 [zfs]
txg_sync_thread+0x3f2/0x1370 [zfs]
thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock
txg_kick+0x61/0x420 [zfs]
dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs]
This patch is orginally from Brian Behlendorf and slightly simplified
by me.
It breaks this cycle in thread 4 by moving the call from
dsl_pool_need_dirty_delay to txg_kick outside the section controlled
by dp->dp_lock.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #9094
Channel programs that many users find useful should be included with zfs
in the /contrib directory. This is the first of these contributions. A
channel program to recursively take snapshots of datasets with the
property com.sun:auto-snapshot=true.
9072 handle error of zap_cursor_retrieve() for log spacemap zap
In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve()
to return errors other than the expected ENOENT (e.g. when we are at
the end of the zap). Ensure that these error cases are handled
correctly by the import path.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9074
mismerged log spacemap comment for metaslab_verify_weight_and_frag
When the log spacemap commit was merged in ZoL, the
metaslab_verify_unflushed_changes() debugging function
was deleted as the feature was pretty much stable by
then. Unfortunately though there was a reference to
it from a comment in metaslab_verify_weight_and_frag().
This patch deletes the reference and pastes that
comment as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9097
pkconfig files get installed to $datarootdir/pkgconfig but rpm expects
them to be at $datadir. This works when $datarootdir==$datadir which is
the case most of the time but will fail when they differ.
* install: make initramfs-tools path static
Since initramfs-tools' path is nothing we can control as it is an
external package it does not make any sense to install zfs additions
anywhere else. Simply use /usr/share/initramfs-tools as path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9087
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.
These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9084
Matthew Ahrens [Tue, 30 Jul 2019 16:18:30 +0000 (09:18 -0700)]
Improve performance by using dmu_tx_hold_*_by_dnode()
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand. This eliminates some calls to dnode_hold(), which can be
expensive. This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.
We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.
This also removes some unnecessary casts that are in the area.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9081
Brian Behlendorf [Mon, 29 Jul 2019 19:46:56 +0000 (12:46 -0700)]
Revert "Develop tests for issues #5866 and #8858"
This reverts commit 693c1fc478cc8118dd0168c4815c0ae3be41c9c3. This
change resulted in a kmem leak being observed in existing code which
needs to be identified and addressed.
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8978
Closes #9090
Brian Behlendorf [Mon, 29 Jul 2019 01:15:26 +0000 (18:15 -0700)]
Fix channel programs on s390x
When adapting the original sources for s390x the JMP_BUF_CNT was
mistakenly halved due to an incorrect assumption of the size of
a unsigned long. They are 8 bytes for the s390x architecture.
Increase JMP_BUF_CNT accordingly.
Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8992
Closes #9080
George Wilson [Mon, 29 Jul 2019 01:13:56 +0000 (21:13 -0400)]
Race between zfs-share and zfs-mount services
When a system boots the zfs-mount.service and the
zfs-share.service can start simultaneously. What may be
unclear is that sharing a filesystem will first mount
the filesystem if it's not already mounted. This means
that both service can race to mount the same fileystem.
This race can result in a SEGFAULT or EBUSY conditions.
This change explicitly defines the start ordering between the
two services such that the zfs-mount.service is solely
responsible for mounting filesystems eliminating the race
between "zfs mount -a" and "zfs share -a" commands.
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #9083
Paul Zuchowski [Sat, 27 Jul 2019 00:52:13 +0000 (20:52 -0400)]
Develop tests for issues #5866 and #8858
Provide zfstest coverage for these two issues which
were a panic accessing extended attributes and
a problem comparing 64 bit and 32 bit generation
numbers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Issue #5866
Issue #8858
Closes #8978
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.
https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.
Matthew Ahrens [Fri, 26 Jul 2019 19:07:48 +0000 (12:07 -0700)]
zed crashes when devid not present
zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.
zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.
Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced. This can happen if the pool
has been moved from a system that has devid's to one that does not.
The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65090
Closes #9054
Closes #9060
Sara Hartse [Fri, 26 Jul 2019 17:54:14 +0000 (10:54 -0700)]
Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.
This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.
We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
Matthew Ahrens [Thu, 25 Jul 2019 18:57:58 +0000 (11:57 -0700)]
Replace zf_rwlock with a mutex
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex. Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9062