Brian Behlendorf [Thu, 30 Aug 2018 20:38:09 +0000 (13:38 -0700)]
ZTS: Fix zfs_create_013_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7847
Tom Caputi [Wed, 29 Aug 2018 18:33:33 +0000 (14:33 -0400)]
OpenZFS 9403 - assertion failed in arc_buf_destroy()
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.
Porting notes:
* The ability to zinject decompression errors has been added, but
this only works at the zio_decompress() level, where we have all
of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
ztest for further test coverage. Although this did uncover a few
low priority issues, this unfortuantely also causes ztest to
ASSERT in many locations where the code is working correctly since
it is designed to fail on IO errors. Developers can manually set
this variable with the '-o' option to find and debug issues.
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Tom Caputi <tcaputi@datto.com>
OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
Tom Caputi [Mon, 20 Aug 2018 20:42:17 +0000 (16:42 -0400)]
Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
Brian Behlendorf [Mon, 27 Aug 2018 17:04:21 +0000 (10:04 -0700)]
Direct IO support
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g. readahead) while preventing contention for system
memory between the database and kernel caches.
On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.
Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.
1. Minimize cache effects of the I/O.
By design the ARC is already scan-resistant which helps mitigate
the need for special O_DIRECT handling. Data which is only
accessed once will be the first to be evicted from the cache.
This behavior is in consistent with Illumos and FreeBSD.
Future performance work may wish to investigate the benefits of
immediately evicting data from the cache which has been read or
written with the O_DIRECT flag. Functionally this behavior is
very similar to applying the 'primarycache=metadata' property
per open file.
2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.
No additional alignment or length restrictions are imposed.
3. O_DIRECT _MAY_ perform unbuffered IO operations directly
between user memory and block device.
No unbuffered IO operations are currently supported. In order
to support features such as transparent compression, encryption,
and checksumming a copy must be made to transform the data.
4. O_DIRECT _MAY_ imply O_DSYNC (XFS).
O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide
O_DSYNC to request synchronous semantics.
5. O_DIRECT _MAY_ disable file locking that serializes IO
operations. Applications should avoid mixing O_DIRECT
and normal IO or mmap(2) IO to the same file. This is
particularly true for overlapping regions.
All I/O in ZFS is locked for correctness and this locking is not
disabled by O_DIRECT. However, concurrently mixing O_DIRECT,
mmap(2), and normal I/O on the same file is not recommended.
This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations. Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224
Closes #7823
Brian Behlendorf [Sun, 26 Aug 2018 19:59:33 +0000 (12:59 -0700)]
Remove %changelog from spec file
Remove the %changelog section from the spec files since it does
not get updated in the master branch. Not only does this mean
the information is stale, but it can result in 'make deb' failing
to build packages, issue #7825. This section should be updated
for tagged releases.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7825
Closes #7827
Fix a bunch of truncation compiler warnings that show up
on Fedora 28 (GCC 8.0.1).
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7368
Closes #7826
Closes #7830
Brian Behlendorf [Thu, 23 Aug 2018 16:34:34 +0000 (09:34 -0700)]
Add libaio-devel BuildRequires
The zfs-test package needs a build requirement on the libaio-devel
package. Without it ./configure will correctly determine that
mmap_libaio cannot be built and it will be skipped.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7821
Closes #7824
Tom Caputi [Mon, 20 Aug 2018 18:03:56 +0000 (14:03 -0400)]
Fix issues with raw receive_write_byref()
This patch fixes 2 issues with raw, deduplicated send streams. The
first is that datasets who had been completely received earlier in
the stream were not still marked as raw receives. This caused
problems when newly received datasets attempted to fetch raw data
from these datasets without this flag set.
The second problem was that the arc freeze checksum code was not
consistent about which locks needed to be held while performing
its asserts. The proper locking needed to run these asserts is
actually fairly nuanced, since the asserts touch the linked list
of buffers (requiring the header lock), the arc_state (requiring
the b_evict_lock), and the b_freeze_cksum (requiring the
b_freeze_lock). This seems like a large performance sacrifice and
a lot of unneeded complexity to verify that this relatively small
debug feature is working as intended, so this patch simply removes
these asserts instead.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7701
Olaf Faaland [Mon, 20 Aug 2018 17:05:23 +0000 (10:05 -0700)]
Skip import activity test in more zdb code paths
Since zdb opens the pools read-only, it cannot damage the pool in the
event the pool is already imported either on the same host or on
another one.
If the pool vdev structure is changing while zdb is importing the
pool, it may cause zdb to crash. However this is unlikely, and in any
case it's a user space process and can simply be run again.
For this reason, zdb should disable the multihost activity test on
import that is normally run.
This commit fixes a few zdb code paths where that had been overlooked.
It also adds tests to ensure that several common use cases handle this
properly in the future.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gu Zheng <guzheng2331314@163.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7797
Closes #7801
DeHackEd [Mon, 20 Aug 2018 16:55:18 +0000 (12:55 -0400)]
Don't modify argv[] in user tools
argv[] gets modified during string parsing for input arguments. This
is reflected in the live process listing. Don't do that.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net>
Closes #7760
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
bunder2015 [Sun, 19 Aug 2018 04:19:41 +0000 (00:19 -0400)]
ZTS: events path cleanup
Removing hardcoded paths in events.cfg
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7805
bunder2015 [Sun, 19 Aug 2018 04:18:31 +0000 (00:18 -0400)]
ZTS: largest_pool_001 path cleanup
Removing hardcoded paths in largest_pool_001
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7804
Brian Behlendorf [Sun, 19 Aug 2018 04:16:12 +0000 (21:16 -0700)]
ZTS: Fix import_cache_device_replaced
Allow the 'zpool replace' to run slowly without overwhelming the vdev
queues by setting zfs_scan_vdev_limit=128k. This limits the number of
concurrent slow IOs which need to be handled. The net effect is the
test case runs approximately 3x faster putting it well under the 10
minute per-test time limit.
Rename import_cache* test cases to imprt_cachefile*. Originally
these were renamed due to a maximum tar name limit, this limit was
removed by commit 1dfde3d9b.
Replaced instances of /var/tmp in zpool_import.cfg with $TEST_BASE_DIR.
Reviewed-by: bunder2015 <omfgbunder@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7765
Closes #7802
LOLi [Sat, 18 Aug 2018 22:47:41 +0000 (00:47 +0200)]
'zfs holds' scripted mode is not documented
This change simply documents the existing "scripted mode" option in
both command help and man page.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7798
LOLi [Sat, 18 Aug 2018 20:10:36 +0000 (22:10 +0200)]
Fix arcstat.py handling of unsupported options
This change allows the arcstat.py script to handle unsupported options
gracefully and print both error and usage messages when one such option
is provided.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7799
Brian Behlendorf [Fri, 17 Aug 2018 17:01:47 +0000 (10:01 -0700)]
ZTS: Fix reservation_001_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7796
Brian Behlendorf [Wed, 15 Aug 2018 16:53:44 +0000 (09:53 -0700)]
Fix traverse_impl() kmem leak
The error path must free the memory allocated by this function or
it will be leaked. In practice, this would leak only a few bytes
of memory under rare circumstances and thus is unlikely to have
caused any real problems. This issue was caught by the kmemleak.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7791
Brian Behlendorf [Wed, 15 Aug 2018 16:52:28 +0000 (09:52 -0700)]
Use posix format for dist tarballs
Traditionally Automake has defaulted to the V7 tar format when
creating tarballs for distributions. One of the many limitions
of this format is a 99 character maximum path + file name limit.
This can cause problems when adding new test cases to the ZTS
due to the depth of the sub-tree and descriptive test names.
This change switches the build system to the posix (aliased as
pax) tar format which conforms to the POSIX.1-2001 specification.
This format does not suffer from the V7 limitations, was designed
to be compatible, and will become the default format in future
versions of GNU tar.
Tom Caputi [Mon, 18 Jun 2018 19:47:12 +0000 (15:47 -0400)]
Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
Tom Caputi [Fri, 13 Oct 2017 17:09:04 +0000 (13:09 -0400)]
Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:
* Receiving an unencrypted stream as encrypted without
needing to create a "dummy" encrypted parent so that
encryption can be inheritted.
* Allowing users to change their keylocation on receive,
so long as the receiving dataset is an encryption root.
* Allowing users to explicitly exclude or override the
encryption property from an unencrypted properties stream,
allowing it to be received as encrypted.
* Receiving a recursive heirarchy of unencrypted datasets,
encrypting the top-level one and forcing all children to
inherit the encryption.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
bunder2015 [Mon, 13 Aug 2018 15:24:02 +0000 (11:24 -0400)]
ZTS: delegate group path cleanup
Removing hardcoded paths in delegate group tests
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7778
bunder2015 [Mon, 13 Aug 2018 15:22:41 +0000 (11:22 -0400)]
ZTS: acl group path cleanup
Removing hardcoded paths in acl group tests
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7777
bunder2015 [Mon, 13 Aug 2018 15:16:55 +0000 (11:16 -0400)]
ZTS: inuse_004 path cleanup
Removing hardcoded path in inuse_004
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7775
bunder2015 [Mon, 13 Aug 2018 15:15:53 +0000 (11:15 -0400)]
ZTS: projectquota_002 path cleanup
Removing hardcoded path in projectquota_002
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7774
Olaf Faaland [Mon, 13 Aug 2018 15:12:25 +0000 (08:12 -0700)]
MMP should not suspend pool in ztest
When running ztest, never suspend the pool due to failed or delayed
MMP writes.
There are many sources of long delays within ztest, such as device
opens, closes, etc. which in combination, may delay MMP writes too
long and cause MMP to suspend the pool.
Some of these delays also affect real pools, and should be fixed.
That is being worked separately.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7776
Brian Behlendorf [Sun, 12 Aug 2018 16:38:53 +0000 (09:38 -0700)]
ZTS: Test case reliability
* Both cli_root/zpool_import/import_cache_device_replaced, and
redundancy/redundancy_004_neg have been observed to fail for
spurious reasons ~1% of the time. Add them to the exception
list and reference the open Github issue.
* Speed up replacement/replacement_001_pos to prevent it from
exceeding the 10 minute per test limit and getting KILLED.
File vdev creation switched to truncate -s, redundant raidz1
testing pass dropped, fixed some minor formating issues.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7766
LOLi [Fri, 3 Aug 2018 21:56:25 +0000 (23:56 +0200)]
Allow inherited properties in zfs_check_settable()
This change modifies how 'checksum' and 'dedup' properties are verified
in zfs_check_settable() handling the case where they are explicitly
inherited in the dataset hierarchy when receiving a recursive send
stream.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7755
Closes #7576
Closes #7757
Don Brady [Fri, 3 Aug 2018 21:50:51 +0000 (15:50 -0600)]
zfs_ioc_unload_key can drop extra spa ref
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7759
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7763
Matthew Ahrens [Thu, 2 Aug 2018 22:51:45 +0000 (15:51 -0700)]
Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).
On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching. We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().
This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
John Gallagher [Thu, 2 Aug 2018 21:03:56 +0000 (14:03 -0700)]
Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866
Closes #7761
Nathan Lewis [Thu, 2 Aug 2018 18:59:24 +0000 (11:59 -0700)]
Add support for selecting encryption backend
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
that control the crypto implementation. At the moment there is a
choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
with as key schedules generated with various implementations
are not necessarily interchangable.
Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes #7102
Closes #7103
George Wilson [Tue, 31 Jul 2018 21:51:15 +0000 (17:51 -0400)]
Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.
This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
Fix deadlock between zfs umount & snapentry_expire
zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a
writer and calls zfsctl_snapshot_unmount_cancel(), which waits
for snapentry_expire() if present (when snap is automounted).
This snapentry_expire() itself then waits for zfs_snapshot_lock
as a reader, resulting in a deadlock.
The fix is to only hold the zfs_snapshot_lock over the tree
lookup and removal. After a successful lookup the lock can
be dropped and zfs_snapentry_t will remain valid until the
reference taken by the lookup is released.
Paul Dagnelie [Mon, 12 Feb 2018 20:56:06 +0000 (12:56 -0800)]
OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========
We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time. Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group. There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch. The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations. If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.
In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk. As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.
We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.
Performance Results
===================
Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.
Future Work
===========
Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized. Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Porting Notes:
* Fix reservation test failures by increasing tolerance.
Brian Behlendorf [Tue, 31 Jul 2018 17:17:44 +0000 (10:17 -0700)]
Add missing zfs-dracut RPM dependencies
The zfs-dracut package requires the hostid, basename, head, awk,
and grep utilities be installed. The first three are provided by
coreutils but additional dependencies are required for awk and grep.
Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7729
Closes #7747
Don Brady [Wed, 27 Sep 2017 01:45:19 +0000 (19:45 -0600)]
OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
were intentially not added as tunable module parameters.
OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation
While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.
An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists. Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg
Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.
= What this patch does
In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).
= Performance Analysis
To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.
The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.
From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts. Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().
From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).
Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
zpool_do_events_nvprint().
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
Paul Dagnelie [Tue, 20 Sep 2016 17:02:29 +0000 (10:02 -0700)]
OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth times
As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is
yet another hole birth issue. In this one, if a block is entirely holes,
but the birth times are not all the same, we lose that information by
creating one hole with the current txg as its birth time.
The ZoL PR's fix approach is incorrect. Ultimately, the problem here is
that when you truncate and write a file in the same transaction group,
the dbuf for the indirect block will be zeroed out to deal with the
truncation, and then written for the write. During this process, we will
lose hole birth time information for any holes in the range. In the case
where a dnode is being freed, we need to determine whether the block
should be converted to a higher-level hole in the zio pipeline, and if
so do it when the dnode is being synced out.
Porting Notes:
* The DMU_OBJECT_END change in zfs_znode.c was already applied.
* Added test cases from #5675 provided by @rincebrain for hole_birth
issues. These test cases should be pushed upstream to OpenZFS.
* Updated mk_files which is used by several rsend tests so the
files created are a little more interesting and may contain holes.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9438
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c
External-issue: DLPX-46861
Closes #7746
Brian Behlendorf [Mon, 30 Jul 2018 16:23:45 +0000 (09:23 -0700)]
ZTS: Fix reservation_017_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7750
Brian Behlendorf [Mon, 30 Jul 2018 16:22:30 +0000 (09:22 -0700)]
Add rwsem_tryupgrade for 4.9.20-rt16 kernel
The RT rwsem implementation was changed to allow multiple readers
as of the 4.9.20-rt16 patch set. This results in a build failure
because the existing implementation was forced to directly access
the rwsem structure which has changed.
While this could be accommodated by adding additional compatibility
code. This patch resolves the build issue by simply assuming the
rwsem can never be upgraded. This functionality is a performance
optimization and all callers must already handle this case.
Converting the last remaining use of __SPIN_LOCK_UNLOCKED to
spin_lock_init() was additionally required to get a clean build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7589
Systemd binaries necessary for mounting an encrypted root dataset
weren't copied to initramfs generated by dracut. This patch fixes
this and copies these binaries unconditionally, that is
regardless of whether native ZFS encryption is used for the
root dataset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Diamantopoulos <georgediam@gmail.com>
Closes #7607
Closes #7719
Toomas Soome [Wed, 1 Jun 2016 16:18:10 +0000 (19:18 +0300)]
OpenZFS 8906 - uts: illumos rootfs should support salted cksum
Porting notes:
* As of grub-2.02 these checksums are not supported. However, as
pointed out in #6501 there are alternatives such as EFISTUB which
work and have no such restriction. A warning was added to the
checksum property section of the zfs.8 man page.
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
Matthew Ahrens [Thu, 11 Jan 2018 16:54:38 +0000 (08:54 -0800)]
OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.
Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
and dmu_object_alloc_dnsize() functions to use a common shared
dmu_object_alloc_impl() function.
Brian Behlendorf [Wed, 25 Jul 2018 17:15:39 +0000 (10:15 -0700)]
ZTS: Add reservation_008_pos exception
The reservation_008_pos test case has been observed to fail in
a non-dangerous way in approximately 5% of automated test runs.
Add the test case to the list of possible expected failures
until the test case can be made perfectly reliable.
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7741
Closes #7742
Feng Sun [Wed, 25 Jul 2018 16:52:27 +0000 (00:52 +0800)]
Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit 539d33c7(OpenZFS 6569 - large file delete can starve out write ops).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
sara hartse [Thu, 7 Sep 2017 20:10:53 +0000 (13:10 -0700)]
OpenZFS 9457 - libzfs_import.c:add_config() has a memory leak
A memory leak occurs on lines 209 and 213 because the config is not
freed in the error case. The interface to add_config() seems less than
ideal - it would be better if it copied any data necessary from the
config and the caller freed it.
Porting notes:
* This issue had already been resolved on Linux by adding the missing
calls to nvlist_free(). But we'll adopt the upstream fix to keep
the behavior of the code consistent.
Authored by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9457
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/be86bb8a
Closes #7713
Matthew Ahrens [Fri, 23 Jun 2017 19:32:16 +0000 (12:32 -0700)]
OpenZFS 9338 - moved dnode has incorrect dn_next_type
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
While investigating a different problem, I noticed that moved dnodes
(those processed by dnode_move_impl() via kmem_move()) have an incorrect
dn_next_type. This could cause the on-disk dn_type to be changed to an
invalid value. The fix to copy the dn_next_type in dnode_move_impl().
Porting notes:
* For the moment this potential issue cannot occur on Linux since
the SPL does not provide the kmem_move() functionality.
Tom Caputi [Tue, 24 Jul 2018 19:20:04 +0000 (15:20 -0400)]
Refactor arc_hdr_realloc_crypt()
The arc_hdr_realloc_crypt() function is responsible for converting
a "full" arc header to an extended "crypt" header and visa versa.
This code was originally written with a bcopy() so that any new
members added to arc headers would automatically be included
without requiring a code change. However, in practice this (along
with small differences in kmem_cache implementations between
various platforms) has caused a number of hard-to-find problems in
ports to other operating systems. This patch solves this problem
by making all member copies explicit and adding ASSERTs for fields
that cannot be set during the transfer. It also manually resets the
old header after the reallocation is finished so it can be properly
reallocated and reused.
Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7711
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.
The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.
This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:
Matthew Ahrens [Tue, 24 Jul 2018 05:08:18 +0000 (22:08 -0700)]
PR's should provide motivation & context first
It's often necessary to understand why a change is made, before
understanding the exact changes that are made. Context provides
background, which by definition is necessary to understand prior to the
substance of the Pull Request.
Change the PR template to request "Motivation and Context" first, before
"Description".
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7737
Brian Behlendorf [Mon, 23 Jul 2018 22:40:15 +0000 (15:40 -0700)]
Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure. Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.
Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev. The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid. If
a match is found the ZED reopens the pool vdev. This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read. Otherwise, it wouldn't
be possible to report the maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be
attempted. After performing some sanity checks on the disk to
verify that it is safe to expand, the primary partition (-part1)
will be expanded and the partition table updated. The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.
In order to make all of the above possible the following changes
were required:
* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
These tests now create a pool which is layered on a loopback,
scsi_debug, and file vdev. This allows for testing of non-
partitioned block device (loopback), a partition block device
(scsi_debug), and a file which does not receive udev change
events. This provided for better test coverage, and by removing
the layering on ZFS volumes there issues surrounding layering
one pool on another are avoided.
* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
This allows for matching by guid rather than path which is a
more reliable way for the ZED to reference a vdev.
* Fixed zfs_zevent_wait() signal handling which could result
in the ZED spinning when a signal was not handled.
* Removed vdev_disk_rrpart() functionality which can be abandoned
in favor of kernel provided blkdev_reread_part() function.
* Added a rwlock which is held as a writer while a disk is being
reopened. This is important to prevent errors from occurring
for any configuration related IOs which bypass the SCL_ZIO lock.
The zpool_reopen_007_pos.ksh test case was added to verify IO
error are never observed when reopening. This is not expected
to impact IO performance.
Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.
* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
ZFS volumes. This is as good as a unique physical path, while the
volumes are not used in the test cases anymore for other reasons
this improvement was included.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
Matthew Ahrens [Tue, 10 Jul 2018 17:49:50 +0000 (13:49 -0400)]
OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:
The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf. Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.
Performance Testing:
- Disabled kmem_flags
- Tuned dbuf_cache_max_bytes very low (128K)
- Tuned zfs_arc_max very low (64M)
Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.
Results:
| Test | Time (trunk / diff) | I/Os (trunk / diff) |
+------------------------+---------------------+---------------------+
| zpool import | 0:05 / 0:06 | 12.9k / 12.9k |
| zfs get all (uncached) | 1:36 / 0:53 | 16.7k / 5.7k |
| zfs get all (cached) | 1:36 / 0:51 | 16.0k / 6.0k |
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
Don Brady [Fri, 11 Aug 2017 21:28:17 +0000 (15:28 -0600)]
OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
OpenZFS 9479 - fix wrong format specifier for vdev_id
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Brian Behlendorf [Wed, 11 Jul 2018 22:49:10 +0000 (15:49 -0700)]
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock. This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.
-- THREAD A --
spa_sync()
spa_sync_upgrades()
rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B
-- THREAD B --
mount_fs()
zpl_mount()
zpl_mount_impl()
dmu_objset_hold()
dmu_objset_hold_flags()
dsl_pool_hold()
dsl_pool_config_enter()
rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
sget()
sget_userns()
grab_super()
down_write(&s->s_umount); <- Waiting on C
-- THREAD C --
cleanup_mnt()
deactivate_super()
down_write(&s->s_umount);
deactivate_locked_super()
zpl_kill_sb()
kill_anon_super()
generic_shutdown_super()
sync_filesystem()
zpl_sync_fs()
zfs_sync()
zil_commit()
txg_wait_synced() <- Waiting on A
Brian Behlendorf [Wed, 11 Jul 2018 20:10:40 +0000 (13:10 -0700)]
Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time. Otherwise
fallback to using the sa_copy_data() function.
The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space. In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642
Closes #7684
Matthew Ahrens [Mon, 7 Aug 2017 16:29:34 +0000 (09:29 -0700)]
OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"
Ztest failed with the following crash.
::status
debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
file: /usr/bin/amd64/ztest
initial argv: /usr/bin/amd64/ztest
threading model: raw lwps
status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
value type 'function' for key 'error')
The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.
The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Brian Behlendorf [Wed, 11 Jul 2018 04:23:17 +0000 (21:23 -0700)]
Reduce zdb output when pool contains checkpoint
When running zdb without additional arguments against a pool containing
a checkpoint the entire checkpoint spacemap should not be dumped. Make
this behavior conditional upon passing the -mmmm option as described in
the zdb(8) man page.
-mmmm Display every spacemap record.
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7702
Matthew Ahrens [Mon, 4 Dec 2017 19:29:40 +0000 (11:29 -0800)]
OpenZFS 9454 - ::zfs_blkstats should count embedded blocks
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.
Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
as part of the sequential resilver functionality. Since phys_birth
would be zero that case should never happen as described above. This
is confirmed by the code coverage analysis. Remove the conditional
to realign that aspect of this function with OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout
Problem
=======
Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail. When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue". As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.
Solution
========
As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.
The problem can be seen by attempting to create a dataset using the -p
option with many levels:
Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K. Enough time has passed
that 16K can be reasonably assumed to be the default value. It
was increased in the 3.15 kernel released in June of 2014.
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Remove the dependency on partitionable devices for the clean_mirror
and scrub_mirror test cases. This allows for the setup and cleanup
of the test cases to be simplified by removing the need for complex
partitioning.
This change also resolves a issue where the clean_mirror devices
were not being properly damaged since the device name was not a
full path. The result being loopX files were being left in the
top level test_results directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7434
Closes #7690
Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with
NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d.
As per the official documentation [1] a 4096 byte blocksize should be
used to match the underlying hardware.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Troels Nørgaard <tnn@tradeshift.com>
Closes #7676
Paul Dagnelie [Fri, 6 Jul 2018 16:27:37 +0000 (09:27 -0700)]
Cause autogen.sh to fail if autoreconf fails
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7683
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit. Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
Ahmed Gahnem [Tue, 26 Jun 2018 22:30:56 +0000 (16:30 -0600)]
OpenZFS 9184 - Add ZFS performance test for fixed blocksize random read/write IO
This change introduces a new performance test which does random reads
and writes, but instead of using `bssplit` to determine the block size,
it uses a fixed blocksize. Additionally, some new IO sizes are added to
other tests and timestamp data is recorded with the performance data.
Authored by: Ahmed Gahnem <ahmedg@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Requires-builders: perf
Brian Behlendorf [Fri, 29 Jun 2018 16:40:32 +0000 (09:40 -0700)]
ZTS: Improve enospc tests
The enospc_002_pos test case would frequently fail due a command
succeeding when it was expected to fail due to lack of space.
In order to make this far less likely, files are created across
multiple transaction groups in order to consume as many unused
blocks as possible.
The dependency that the tests run on a partitioned block device
has been removed. It's simpler to use sparse files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7663
Tom Caputi [Thu, 28 Jun 2018 21:55:11 +0000 (17:55 -0400)]
Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.
Tested-by: DHE <git@dehacked.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617
Closes #7662
Tom Caputi [Thu, 28 Jun 2018 16:20:34 +0000 (12:20 -0400)]
Raw receive fix and encrypted objset security fix
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.
Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7632
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
OpenZFS 9591 - ms_shift can be incorrectly changed
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded
According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.
As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes #7644
Matthew Ahrens [Wed, 20 Jun 2018 21:07:50 +0000 (14:07 -0700)]
Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace. This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.
We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.
Note that we already support multiple taskq's with the same name. This
happens when there are multiple pools. In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
Brian Behlendorf [Wed, 20 Jun 2018 21:03:13 +0000 (14:03 -0700)]
ZTS: Adopt OpenZFS test analysis script
Adopt and extend the OpenZFS ZTS results analysis script for use
with ZFS on Linux. This allows for automatic analysis of tests
which may be skipped for a variety or reasons or which are not
entirely reliable.
In addition to the list of 'known' failures, which have been updated
for ZFS on Linux, there in a new 'maybe' section. This mapping
include tests which might be correctly skipped depending on the
test environment. This may be because of a missing dependency or
lack of required kernel support. This list also includes tests
which normally pass but might on occasion fail for a harmless
reason.
The script was also extended include a reason for why a given test
might be skipped or may fail. The reason will be included after
the test in the "results other than PASS that are expected" section.
For failures it is preferable to set the reason to the GitHub issue
number and for skipped tests several generic reasons are available.
You may also specify a custom reason if needed.
All tests were added back in to the linux.run file even if they are
expected to failed. There is value in running tests which may not
pass, the expected results for these tests has been encoded in
the new analysis script.
All tests which were disabled because they ran more slowly on a
32-bit system have been re-enabled. Developers working on 32-bit
systems should assess what it reasonable for their environment.
The unnecessary dependency on physical block devices was removed for
the checksum, grow_pool, and grow_replicas test groups so they are
no longer skipped. Updated the filetest_001_pos test case to run
properly now that it is enabled and moved the grow tests in to a
single directory.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7638
Brian Behlendorf [Wed, 20 Jun 2018 04:52:45 +0000 (21:52 -0700)]
Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212. This change
resulted in the default elevator being used which can negatively
impact performance.
Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator. Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().
Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269. Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
Brian Behlendorf [Wed, 20 Jun 2018 04:51:18 +0000 (21:51 -0700)]
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe. As part of this change the current_time() function was
also updated to return the timespec64 type.
Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode. It should
be used when working with inode timestamps to ensure matching types.
The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t. Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.
Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other. They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
Tom Caputi [Mon, 18 Jun 2018 21:10:54 +0000 (17:10 -0400)]
Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
Richard Yao [Fri, 15 Jun 2018 22:16:29 +0000 (18:16 -0400)]
copy-builtin: SPL must be in Kbuild first
The recent SPL merge caused a regression in kernels with ZFS integrated
into the sources where our modules would be initialized in alphabetical
order, despite icp requiring the spl module be loaded first. This caused
kernels with ZFS builtin to fail to boot.
We resolve this by adding a special case for the spl that lists it
first. It is somewhat ugly, but it works.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Thode <prometheanfire@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #7595
Closes #7606
John Gallagher [Fri, 15 Jun 2018 22:10:42 +0000 (15:10 -0700)]
Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618