Morgan Jones [Mon, 19 Jun 2017 16:43:16 +0000 (16:43 +0000)]
Add kpreempt_disable/enable around CPU_SEQID uses
In zfs/dmu_object and icp/core/kcf_sched, the CPU_SEQID macro
should be surrounded by `kpreempt_disable` and `kpreempt_enable`
calls to avoid a Linux kernel BUG warning. These code paths use
the cpuid to minimize lock contention and is is safe to reschedule
the process to a different processor at any time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Morgan Jones <me@numin.it>
Closes #6239
Don Brady [Sat, 17 Jun 2017 00:21:11 +0000 (18:21 -0600)]
Inject zinject(8) a percentage amount of dev errs
In the original form of device error injection, it was an all or nothing
situation. To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6227
LOLi [Thu, 15 Jun 2017 18:08:45 +0000 (20:08 +0200)]
Fix zvol_state_t->zv_open_count race
5559ba0 added zv_state_lock to protect zvol_state_t internal data:
this, however, doesn't guard zv->zv_open_count and
zv->zv_disk->private_data in zvol_remove_minors_impl().
Fix this by taking zv->zv_state_lock before we check its zv_open_count.
Richard Yao [Tue, 13 Jun 2017 16:18:08 +0000 (12:18 -0400)]
Make zvol operations use _by_dnode routines
This continues what was started in 0eef1bde31d67091d3deed23fe2394f5a8bf2276 by fully converting zvols
to avoid unnecessary dnode_hold() calls. This saves a small amount
of CPU time and slightly improves latencies of operations on zvols.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@prophetstor.com>
Closes #6058
Cleanup zpool_import_all_001_pos to no longer use devices.
The test is meant to test zpool import -a and by no longer
requiring devices, a number of dependencies are no longer
necessary.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6198
Brian Behlendorf [Mon, 12 Jun 2017 16:45:32 +0000 (09:45 -0700)]
Use log_must_busy in destroy_pool
The log function log_must_busy was added in commit e623aea2 for
this purpose. Update destroy_pool to use it.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6217
kpande [Fri, 9 Jun 2017 16:51:13 +0000 (12:51 -0400)]
Add missing \n for "invalid optionusage" output
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Jack Draak <jackdraak@gmail.com> Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #6203
Paul Dagnelie [Thu, 7 Jul 2016 22:00:51 +0000 (15:00 -0700)]
OpenZFS 8056 - zfs send size estimate is inaccurate for some zvols
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
The send size estimate for a zvol can be too low, if the size of the
record headers (dmu_replay_record_t's) is a significant portion of the
size. This is typically the case when the data is highly compressible,
especially with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes
that blocks are the size of the "recordsize" property (128KB). However,
for zvols, the blocks are the size of the "volblocksize" property (8KB).
Therefore, we estimate that there will be 16x less record headers than
there really will be.
The fix is to check the type of the object set (whether it is a zvol or
not) and pick the appropriate property. In addition, while we are at it,
we also add the size of the BEGIN and END records to the estimate.
Matthew Ahrens [Tue, 28 Mar 2017 22:31:49 +0000 (15:31 -0700)]
OpenZFS 8156 - dbuf_evict_notify() does not need dbuf_evict_lock
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should
do the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention. It isn't necessary to hold
the lock, because if we make the wrong choice occasionally, nothing bad
will happen. This commit results in a ~60% performance improvement for
ARC-cached sequential reads.
Matthew Ahrens [Fri, 13 May 2016 04:16:36 +0000 (21:16 -0700)]
OpenZFS 8199 - multi-threaded dmu_object_alloc()
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock. To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.
The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs. When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128). Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.
This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.
Work sponsored by Intel Corp.
Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
Matthew Ahrens [Thu, 23 Mar 2017 16:07:27 +0000 (09:07 -0700)]
OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffers
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.
We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.
Commit torvalds/linux@9e8925b6 allowed for kernels to be built
without support for mandatory locking (MS_MANDLOCK). This will
result in 'zfs mount' failing when the nbmand=on property is set
if the kernel is built without CONFIG_MANDATORY_FILE_LOCKING.
Unfortunately we can not reliably detect prior to the mount(2) system
call if the kernel was built with this support. The best we can do
is check if the mount failed with EPERM and if we passed 'mand'
as a mount option and then print a more useful error message. e.g.
filesystem 'tank/fs' has the 'nbmand=on' property set, this mount
option may be disabled in your kernel. Use 'zfs set nbmand=off'
to disable this option and try to mount the filesystem again.
Additionally, switch the default error message case to use
strerror() to produce a more human readable message.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4729
Closes #6199
zpool_create_024_pos, zvol_misc_002_pos, write_dirs_002_pos are slow
on the buildbot 32-bit builder. Skip the test cases for now on 32-bit
builders.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6195
The number of blocks which can be freed per TXG is controlled
by the zfs_free_max_blocks module option (defaults to 100,000).
Both speed up this test case and reduce the memory requirements
by only creating 4 TXGs worth of blocks to be freed.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5479
Closes #6192
Håkan Johansson [Mon, 5 Jun 2017 20:53:09 +0000 (22:53 +0200)]
Allow add of raidz and mirror with same redundancy
Allow new members to be added to a pool mixing raidz and mirror vdevs
without giving -f, as long as they have matching redundancy. This case
was missed in #5915, which only handled zpool create.
Add zfstest zpool_add_010_pos.ksh, with test of zpool create
followed by zpool add of mixed raidz and mirror vdevs.
Add some more mixed raidz and mirror cases to zpool_create_006_pos.ksh.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Haakan Johansson <f96hajo@chalmers.se>
Issue #5915
Closes #6181
Users can now provide their own scripts to be run
with 'zpool iostat/status -c'. User scripts should be
placed in ~/.zpool.d to be included in zpool's
default search path.
Provide a script which can be used with
'zpool iostat|status -c' that will return the type of
device (hdd, sdd, file).
Provide a script to get various values from smartctl
when using 'zpool iostat/status -c'.
Allow users to define the ZPOOL_SCRIPTS_PATH
environment variable which can be used to override
the default 'zpool iostat/status -c' search path.
Allow the ZPOOL_SCRIPTS_ENABLED environment
variable to enable or disable 'zpool status/iostat -c'
functionality.
Use the new smart script to provide the serial command.
Install /etc/sudoers.d/zfs file which contains the sudoer
rule for smartctl as a sample.
Allow 'zpool iostat/status -c' tests to run in tree.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6121
Closes #6153
LOLi [Fri, 2 Jun 2017 14:17:00 +0000 (16:17 +0200)]
Fix "snapdev" property issues
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').
Additionally, fix a NULL pointer dereference accidentally introduced in 5559ba0 that can be triggered when setting the "snapdev" property to
the value "hidden" twice.
Finally, add a new test case "zvol_misc_snapdev" to the ZFS Test Suite.
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6131
Closes #6175
Closes #6176
Chunwei Chen [Thu, 25 May 2017 22:56:12 +0000 (15:56 -0700)]
Fix import wrong spare/l2 device when path change
If, for example, your aux device was /dev/sdc, but now the aux device is
removed and /dev/sdc points to other device. zpool import will still
use that device and corrupt it.
The problem is that the spa_validate_aux in spa_import, rather than
validate the on-disk label, it would actually write label to disk. We
remove them since spa_load_{spares,l2cache} seems to do everything we
need and they would actually validate on-disk label.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6158
Chunwei Chen [Wed, 24 May 2017 22:11:23 +0000 (15:11 -0700)]
Fix import finding spare/l2cache when path changes
When spare or l2cache device path changes, zpool import will not fix up
their paths like normal vdev. The issue is that when you supply a pool
name argument to zpool import, it will use it to filter out device which
doesn't have the pool name in the label. Since spare and l2cache device
never have that in the label, they'll always get filtered out.
We fix this by making sure we never filter out a spare or l2cache
device.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6158
We no longer perform automated filebench testing. Remove
references to it for the automated testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6186
LOLi [Wed, 31 May 2017 19:52:12 +0000 (21:52 +0200)]
Fix memory leak in zvol_set_volsize()
Move kmem_free() so it's called for every error path: this is
preferred over making `dmu_object_info_t doi` local to accommodate
older kernels with limited stacks.
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6177
kpande [Wed, 31 May 2017 14:30:07 +0000 (10:30 -0400)]
Explain reason for Signed-off-by in CONTRIBUTING
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #6183
Alek P [Fri, 26 May 2017 18:42:10 +0000 (08:42 -1000)]
Don't dirty bpobj if it has no entries
In certain cases (dsl_scan_sync() is one), we may end up calling
bpobj_iterate() on an empty bpobj. Even though we don't end up
modifying the bpobj it still gets dirtied, causing unneeded writes
to the pool.
This patch adds an early bail from bpobj_iterate_impl() if bpobj
is empty to prevent unneeded writes.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6164
This reverts commit 959f56b99366c8727647b5b19fb3d47555c96cf3.
An issue was uncovered by the new zvol_misc_snapdev test case
which needs to be investigated and resolved.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6174
Issue #6131
Yuri Pankov [Wed, 24 May 2017 11:11:47 +0000 (07:11 -0400)]
OpenZFS 8077 - zfs-tests suite fails zpool_get_002_pos
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: bunder2015 <omfgbunder@gmail.com>
Porting Notes:
* Also corrected a quoting mistake found in our copy
LOLi [Thu, 25 May 2017 23:43:46 +0000 (01:43 +0200)]
Fix "snapdev" property inheritance behaviour
When inheriting the "snapdev" property to we don't always call
zfs_prop_set_special(): this prevents device nodes from being created in
certain situations. Because "snapdev" is the only *special* property
that is also inheritable we need to call zfs_prop_set_special() even
when we're not reverting it to the received value ('zfs inherit -S').
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6131
Chunwei Chen [Wed, 24 May 2017 23:02:04 +0000 (16:02 -0700)]
config: allow --with-linux without --with-linux-obj
Don't use `uname -r` to determine kernel build directory when the user
specified kernel source with --with-linux. Otherwise, the user is forced
to use --with-linux-obj even if they are the same directory, which is
very counterintuitive.
LOLi [Thu, 25 May 2017 16:55:55 +0000 (18:55 +0200)]
Linux 4.12 compat: fix super_setup_bdi_name() call
Provide a format parameter to super_setup_bdi_name() so we don't
create duplicate names in '/devices/virtual/bdi' sysfs namespace which
would prevent us from mounting more than one ZFS filesystem at a time.
Brian Behlendorf [Thu, 18 May 2017 19:57:21 +0000 (15:57 -0400)]
Add zpool events tests
* events_001_pos - Verify the expected events are generated when
invoking the various zpool sub-commands. These events must
appear in `zpool event` and be consumed by the ZED.
* events_002_pos - Verify the ZED consumes events which were
generated while it wasn't running when it is started.
Additionally, verify that events are only processed once.
As part of this change the default.cfg used by the test suite
was changed to a default.cfg.in file. This was needed so the
install location of all zed scripts, not only the enabled ones,
could be reliably determined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6128
Brian Behlendorf [Fri, 19 May 2017 00:22:04 +0000 (20:22 -0400)]
Enable xattr tests
Updated the xattr_common.ksh helper functions to use the attr
command on Linux to manipulate xattrs. Added an xattr.cfg file
and reworked the user/group functionality to be consist with
the existing delegate test cases. The intent of each test
case was preserved.
* xattr_001_pos, xattr_002_neg - Updated to verity xattr=on
and xattr=sa sytle xattrs.
* xattr_003_neg - Use user_run helper instead of su.
* xattr_004_pos - Updated to work with ext2 xattrs.
* xattr_007_neg - Updated to use attr instead of runat.
* xattr_008_pos, xattr_009_neg8_pos, xattr_010_neg -
Test cases disables since they aren't applicable to Linux.
* xattr_011_pos - Updated to expected behavior from GNU
versions of the tested utilities.
* xattr_012_pos - Updated to use xattrtest to create many
small xattrs instead of a single large one.
* xattr_013_pos - Updated to use attr instead of runat.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6128
Brian Behlendorf [Fri, 19 May 2017 00:21:15 +0000 (20:21 -0400)]
Enable remaining tests
Enable most of the remaining test cases which were previously
disabled. The required fixes are as follows:
* cache_001_pos - No changes required.
* cache_010_neg - Updated to use losetup under Linux. Loopback
cache devices are allowed, ZVOLs as cache devices are not.
Disabled until all the builders pass reliably.
* cachefile_001_pos, cachefile_002_pos, cachefile_003_pos,
cachefile_004_pos - Set set_device_dir path in cachefile.cfg,
updated CPATH1 and CPATH2 to reference unique files.
* zfs_clone_005_pos - Wait for udev to create volumes.
* zfs_mount_007_pos - Updated mount options to expected Linux names.
* zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required.
* zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos -
Updated to expect -f to not unmount busy mount points under Linux.
* rsend_019_pos - Observed to occasionally take a long time on both
32-bit systems and the kmemleak builder.
* zfs_written_property_001_pos - Switched sync(1) to sync_pool.
* devices_001_pos, devices_002_neg - Updated create_dev_file() helper
for Linux.
* exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated
test case to expect EPERM from Linux as described by mmap(2).
* grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh
scripts from OpenZFS.
* history_004_pos, history_006_neg, history_008_pos - Fixed by
previous commits and were not enabled. No changes required.
* zfs_allow_010_pos - Added missing spaces after assorted zfs
commands in delegate_common.kshlib.
* inuse_* - Illumos dump device tests skipped. Remaining test
cases updated to correctly create required partitions.
* large_files_001_pos - Fixed largest_file.c to accept EINVAL
as well as EFBIG as described in write(2).
* link_count_001 - Added nproc to required commands.
* umountall_001 - Updated to use umount -a.
* online_offline_001_* - Pull in OpenZFS change to file_trunc.c
to make the '-c 0' option run the test in a loop. Included
online_offline.cfg file in all test cases.
* rename_dirs_001_pos - Updated to use the rename_dir test binary,
pkill restricted to exact matches and total runtime reduced.
* slog_013_neg, write_dirs_002_pos - No changes required.
* slog_013_pos.ksh - Updated to use losetup under Linux.
* slog_014_pos.ksh - ZED will not be running, manually degrade
the damaged vdev as expected.
* nopwrite_varying_compression, nopwrite_volume - Forced pool
sync with sync_pool to ensure up to date property values.
* Fixed typos in ZED log messages. Refactored zed_* helper
functions to resolve all-syslog exit=1 errors in zedlog.
* zfs_copies_005_neg, zfs_get_004_pos, zpool_add_004_pos,
zpool_destroy_001_pos, largest_pool_001_pos, clone_001_pos.ksh,
clone_001_pos, - Skip until layering pools on zvols is solid.
* largest_pool_001_pos - Limited to 7eb pool, maximum
supported size in 8eb-1 on Linux.
* zpool_expand_001_pos, zpool_expand_003_neg - Requires
additional support from the ZED, updated skip reason.
* zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup
busy mount points under Linux between test loops.
* privilege_001_pos, privilege_003_pos, rollback_003_pos,
threadsappend_001_pos - Skip with log_unsupported.
* snapshot_016_pos - No changes required.
* snapshot_008_pos - Increased LIMIT from 512K to 2M and added
sync_pool to avoid false positives.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6128
4a3a99 lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
d5e7ca LZ4 : fix the data abort issue
bea2b5 lib/lz4: Pull out constant tables
99b7e9 lz4: fix system halt at boot kernel on x86_64
Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #5975
Closes #5973
Alek P [Fri, 19 May 2017 19:33:11 +0000 (12:33 -0700)]
Implemented zpool sync command
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6122
Tony Hutter [Fri, 19 May 2017 19:30:16 +0000 (12:30 -0700)]
Force fault a vdev with 'zpool offline -f'
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline. Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED. The -f faults
persist across imports, unless they were set with the temporary
(-t) flag. Both persistent and temporary faults can be cleared
with zpool clear.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6094
Tom Caputi [Fri, 19 May 2017 00:35:49 +0000 (20:35 -0400)]
Fixed small memory leak in ereport handling
One pre-check in zfs_ereport_start() was being called after
the nvlists were being allocated. This simply corrects that
issue.
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6140
Brian Behlendorf [Thu, 18 May 2017 17:02:16 +0000 (10:02 -0700)]
Fix large dnode send stream flag conflict
Bit 21 of the send stream flags was inadvertently used for two
different features under concurrent development. To avoid any
future compatibility problems the large dnode flag is being
switched to bit 23 which is unused.
The large dnode feature has only been present in pre-releases of
ZoL and dnodesize defaults to legacy which is compatible with
existing OpenZFS implementations. Users with dnodesize=auto
needing to use zfs send/recv must update ZoL on both the
source and destination systems.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6139
Boris Protopopov [Wed, 10 May 2017 17:51:29 +0000 (13:51 -0400)]
Introduce zv_state_lock
The lock is designed to protect internal state of zvol_state_t and
to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path)
while holding zvol_stat_lock. Refactor the code accordingly.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3484
Closes #6065
Closes #6134
Fix lock order inversion with zvol_open() as it did not account
for use of zvols as vdevs. The latter use cases resulted in the
lock order inversion deadlocks that involved spa_namespace_lock
and bdev->bd_mutex.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6065
Issue #6134
Isaac Huang [Sat, 13 May 2017 00:28:03 +0000 (18:28 -0600)]
Skip spurious resilver IO on raidz vdev
On a raidz vdev, a block that does not span all child vdevs, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.
This patch eliminates such spurious IOs.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5316
Brian Behlendorf [Thu, 11 May 2017 21:27:57 +0000 (14:27 -0700)]
Enable additional test cases
Enable additional test cases, in most cases this required a few
minor modifications to the test scripts. In a few cases a real
bug was uncovered and fixed. And in a handful of cases where pools
are layered on pools the test case will be skipped until this is
supported. Details below for each test case.
* zpool_add_004_pos - Skip test on Linux until adding zvols to pools
is fully supported and deadlock free.
* zpool_add_005_pos.ksh - Skip dumpadm portion of the test which isn't
relevant for Linux. The find_vfstab_dev, find_mnttab_dev, and
save_dump_dev functions were updated accordingly for Linux. Add
O_EXCL to the in-use check to prevent the -f (force) option from
working for mounted filesystems and improve the resulting error.
* zpool_add_006_pos - Update test case such that it doesn't depend
on nested pools. Switch to truncate from mkfile to reduce space
requirements and speed up the test case.
* zpool_clear_001_pos - Speed up test case by filling filesystem to
25% capacity.
* zpool_create_002_pos, zpool_create_004_pos - Use sparse files for
file vdevs in order to avoid increasing the partition size.
* zpool_create_006_pos - 6ba1ce9 allows raidz+mirror configs with
similar redundancy. Updating the valid_args and forced_args cases.
* zpool_create_011_neg - Fix to correctly create the extra partition.
Modified zpool_vdev.c to use fstat64_blk() wrapper which includes
the st_size even for block devices.
* zpool_create_012_neg - Updated to properly find swap devices.
* zpool_create_014_neg, zpool_create_015_neg - Updated to use
swap_setup() and swap_cleanup() wrappers which do the right thing
on Linux and Illumos. Removed '-n' option which succeeds under
Linux due to differences in the in-use checks.
* zpool_create_016_pos.ksh - Skipped test case isn't useful.
* zpool_create_020_pos - Added missing / to cleanup() function.
Remove cache file prior to test to ensure a clean environment
and avoid false positives.
* zpool_destroy_001_pos - Removed test case which creates a pool on
a zvol. This is more likely to deadlock under Linux and has never
been completely supported on any platform.
* zpool_destroy_002_pos - 'zpool destroy -f' is unsupported on Linux.
Mount point must not be busy in order to unmount them.
* zfs_destroy_001_pos - Handle EBUSY error which can occur with
volumes when racing with udev.
* zpool_expand_001_pos, zpool_expand_003_neg - Skip test on Linux
until adding zvols to pools is fully supported and deadlock free.
The test could be modified to use loop-back devices but it would
be preferable to use the test case as is for improved coverage.
* zpool_export_004_pos - Updated test case to such that it doesn't
depend on nested pools. Normal file vdev under /var/tmp are fine.
* zpool_import_all_001_pos - Updated to skip partition 1, which is
known as slice 2, on Illumos. This prevents overwriting the
default TESTPOOL which was causing the failure.
* zpool_import_002_pos, zpool_import_012_pos - No changes needed.
* zpool_remove_003_pos - No changes needed
* zpool_upgrade_002_pos, zpool_upgrade_004_pos - Root cause addressed
by upstream OpenZFS commit 3b7f360.
* zpool_upgrade_007_pos - Disabled in test case due to known failure.
Opened issue https://github.com/zfsonlinux/zfs/issues/6112
* zvol_misc_002_pos - Updated to to use ext2.
* zvol_misc_001_neg, zvol_misc_003_neg, zvol_misc_004_pos,
zvol_misc_005_neg, zvol_misc_006_pos - Moved to skip list, these
test case could be updated to use Linux's crash dump facility.
* zvol_swap_* - Updated to use swap_setup/swap_cleanup helpers.
File creation switched from /tmp to /var/tmp. Enabled minimal
useful tests for Linux, skip test cases which aren't applicable.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3484
Issue #5634
Issue #2437
Issue #5202
Issue #4034
Closes #6095
Matthew Ahrens [Mon, 24 Apr 2017 16:34:36 +0000 (09:34 -0700)]
OpenZFS 8063 - verify that we do not attempt to access inactive txg
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru>
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Porting Notes:
- ASSERTV added to txg_sync_waiting() for unused variable.
Matthew Ahrens [Wed, 10 May 2017 17:32:40 +0000 (10:32 -0700)]
OpenZFS 8166 - zpool scrub thinks it repaired offline device
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com>
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
Tom Caputi [Wed, 10 May 2017 17:25:27 +0000 (13:25 -0400)]
Add missing arc_free_cksum() to arc_release()
The arc layer tracks checksums of its data in the arc header
so that it can ensure that buffers haven't changed when they're
not supposed to. This checksum is only maintained while there
is an uncompressed buffer still attached to the header.
Unfortunately there is a missing call to arc_free_cksum() in
arc_release() that can trigger ASSERTs. This has not been a
common issue because the checksums are only maintained for
debug builds and triggering the bug requires writing a block
(and therefore calling arc_release()) while a compressed buffer
is still being used on a debug build. This simply corrects the
issue.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6105
LOLi [Tue, 9 May 2017 23:21:09 +0000 (01:21 +0200)]
Add property overriding (-o|-x) to 'zfs receive'
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.
This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.
One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.
Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.
By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use
a) `createtxg` to determine the logical & temporal order of snapshots
(the creation property is not an equivalent substitute since
multiple snapshots may be created within one second)
b) `guid` to uniquely identify a snapshot, independent of its current
display name
This has the potential of making scripts safer and correct.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #6102
LOLi [Tue, 9 May 2017 22:22:46 +0000 (00:22 +0200)]
Fix NULL pointer dereference in 'zfs create'
A race condition between 'zpool export' and 'zfs create' can crash the
latter: this is because we never check libzfs`zpool_open() return
value in libzfs`zfs_create().
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6096
LOLi [Tue, 9 May 2017 18:51:40 +0000 (20:51 +0200)]
Fix zfs .deb package warning in prerm script
Debian zfs package generated by alien doesn't call the prerm script
(rpm's %preun) with an integer as first parameter, which results in
the following warning:
"zfs.prerm: line 2: [: remove: integer expression expected"
Modify the if-condition to avoid the warning.
Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6108
Richard Yao [Fri, 11 Jul 2014 18:35:58 +0000 (14:35 -0400)]
Enable Linux read-ahead for a single page on ZVOLs
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.
Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.
Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #5902
RageLtMan [Sat, 18 Mar 2017 04:51:36 +0000 (00:51 -0400)]
Disable write merging on ZVOLs
The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.
Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.
Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: RageLtMan <rageltman@sempervictus>
Issue #5902
The send-c_volume test case has been observed to occasionally
fail on 32-bit systems. Until this issue is fully understood
disable this test case.
The rsend_014_pos test case can occasionally fail due to an
EBUSY during export. This can lead to subsequent test failures.
Resolve the issue by retrying the export on EBUSY. Additionally,
remove the gratuitous use of eval.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6088
* zfs_destroy_001_pos - Unable to reproduce the failures locally.
Re-enabled to determine observed buildbot failure rate.
* zfs_destroy_005_neg - Updated for expected Linux behavior.
Busy mount points, even snapshots, are expected to fail.
* zfs_destroy_010_pos - Resolved transient EBUSY with retry.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5635
Issue #5893
Closes #6091
LOLi [Wed, 3 May 2017 16:31:05 +0000 (18:31 +0200)]
More ashift improvements
This commit allow higher ashift values (up to 16) in 'zpool create'
The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.
Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.
Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.
Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.
Olaf Faaland [Tue, 2 May 2017 20:55:24 +0000 (13:55 -0700)]
Write label 2,3 uberblocks when vdev expands
When vdev_psize increases, the location of labels 2 and 3 changes
because their location is relative to the end of the device.
The configs for labels 2 and 3 are written during the next spa_sync()
because the vdev is added to the dirty config list. However, the
uberblock rings are not re-written in their new location, leaving the
device vulnerable to the beginning of the device being overwritten or
damaged.
This patch copies the uberblock ring from label 0 to labels 2 and 3,
in their new locations, at the next sync after vdev_psize increases.
Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks
are copied.
Reviewed-by: BearBabyLiu <liu.huang@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5108
LOLi [Tue, 2 May 2017 20:43:53 +0000 (22:43 +0200)]
Add zfs_nicebytes() to print human-readable sizes
* Add zfs_nicebytes() to print human-readable sizes
Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the
user when no units are specified. This add a new zfs_nicenum_format
"ZFS_NICENUM_BYTES" used to print bytes in their human-readable form.
Additionally, update some test cases to use machine-parsable 'zfs get'.
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.
AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6035
Don't run the reaper if we didn't shrink the cache
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.
You can see how hard this is being hit in kswapd with a light test
workload:
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches
Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.
AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.
AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
jxiong [Tue, 2 May 2017 17:06:18 +0000 (10:06 -0700)]
minor improvement to abd_free_pages()
It doesn't need to have a loop to free page in a single scatterlist
entry because it should be single or compound page. The pages can be
freed in one invocation to __free_pages() for both cases.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Closes #6057
jxiong [Tue, 2 May 2017 17:04:30 +0000 (10:04 -0700)]
Guarantee PAGESIZE alignment for large zio buffers
In current implementation, only zio buffers in 16KB and bigger are
guaranteed PAGESIZE alignment. This breaks Lustre since it assumes
that 'arc_buf_t::b_data' must be page aligned when zio buffers are
greater than or equal to PAGESIZE.
This patch will make the zio buffers to be PAGESIZE aligned when
the sizes are not less than PAGESIZE.
This change may cause a little bit memory waste but that should be
fine because after ABD is introduced, zio buffers are used to hold
data temporarily and live in memory for a short while.
Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes #6084
All filesystems were converted to dynamically allocated BDIs. The
destruction of backing_dev_info structures is handled as part of
super block destruction. Refactor the code to abstract away the
details of creating and destroying a BDI.
Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6089
OpenZFS 7786 - zfs`vdev_online() needs better notification about state changes
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: bunder2015 <omfgbunder@gmail.com>
OpenZFS-issue: https://www.illumos.org/issues/7786
OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f
Closes #6074
Chunwei Chen [Thu, 23 Feb 2017 00:08:04 +0000 (16:08 -0800)]
Reinstate zvol_taskq to fix aio on zvol
Commit 37f9dac removed the zvol_taskq for processing zvol requests.
This was removed as part of switching to make_request_fn and was
motivated by a concern at the time over dispatch latency.
However, this also made all bio request synchronous, and caused
serious performance issues as the bio submitter would wait for
every bio it submitted, effectively making the IO depth 1.
This patch reinstate zvol_taskq, and to make sure overlapped I/Os
are ordered properly, we take range lock in zvol_request, and pass
it along with bio to the I/O functions zvol_{write,discard,read}.
In order to facilitate benchmarks a zvol_request_sync module
option was added to switch between sync and async request handling.
For the moment, the default behavior is synchronous but this is
likely to change pending additional testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5824
Tim Chase [Tue, 25 Apr 2017 04:01:04 +0000 (23:01 -0500)]
Update documentation for zfs_vdev_queue_depth_pct
It was documented as being related to zfs_vdev_async_max_active
when it is actually related to zfs_vdev_async_write_max_active.
Also, expand the documentation to describe the allocation throttle
which was introduced as part of OpenZFS 7090 in 3dfb57a.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #6064
Dan Kimmel [Tue, 11 Apr 2017 21:56:54 +0000 (21:56 +0000)]
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Ported-by: bunder2015 <omfgbunder@gmail.com> Ported-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Most of 7252 was already picked up during ABD work. This
commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
implemented for Linux which does not depend on fio. The behavior
of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.
After run a long time with QAT compression, the variable "inst_num"
is overflow by "atomic_inc_32_nv", which causes its neighbor
variable overwritten. Change its definition from U16 to U32.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6051
Matthew Ahrens [Thu, 13 Apr 2017 21:35:00 +0000 (14:35 -0700)]
OpenZFS 8025 - dbuf_read() creates unnecessary zio_root() for bonus buf
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.
The fix is to only create/destroy the zio_root() in dbuf_read() if the
blkptr is not NULL and not a hole.
Porting Notes:
- The error handling for when dbuf_read_impl() fails which was
originally added in commit 5f6d0b6f5 has been preserved.
Don Brady [Mon, 24 Apr 2017 17:31:45 +0000 (11:31 -0600)]
Fixed zdb -e regression for active cacheless pools
zdb -e for active cache-less pools fails:
$ sudo zpool create -o cachefile=none basic mirror sdk sdl
$ sudo zdb -e -b basic
zdb: can't open 'basic': No such file or directory
This is a recent regression introduce by commit c30d8de.
Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6059