Adding VPATH support, commit 47a4a6f, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/spl#481
Issue zfsonlinux/spl#498
Chunwei Chen [Mon, 23 Nov 2015 23:06:46 +0000 (15:06 -0800)]
Linux 4.4 compat: xattr operations takes xattr_handler
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
Chunwei Chen [Mon, 23 Nov 2015 22:47:29 +0000 (14:47 -0800)]
Linux 4.4 compat: make_request_fn returns blk_qc_t
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
tuxoko [Fri, 30 Oct 2015 23:10:01 +0000 (16:10 -0700)]
Fix zfs_dirty_data_max overflow on 32-bit
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.
On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3973
Chunwei Chen [Tue, 17 Nov 2015 00:39:52 +0000 (16:39 -0800)]
Fix snapshot automount behavior when concurrent or fail
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.
Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.
The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4018
Jason Zaman [Sat, 24 Oct 2015 06:01:08 +0000 (14:01 +0800)]
sysmacros: Make P2ROUNDUP not trigger int overflow
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3949
Brian Behlendorf [Mon, 16 Nov 2015 17:47:43 +0000 (09:47 -0800)]
zimport.sh: Add configure/make option support
Allow the following environment variables to control the build
behavior of the zimport.sh script. This can be useful when you
want a debug build or require specific build options. The
default values are:
CONFIG_OPTIONS=""
MAKE_OPTIONS="-s -j$(nproc)"
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Mon, 16 Nov 2015 23:00:38 +0000 (15:00 -0800)]
Follow 0/-E convention for module load errors
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned. This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter. This is what was causing the hard failures
in the automated testing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
AndCycle [Tue, 10 Nov 2015 14:01:26 +0000 (22:01 +0800)]
Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit. Otherwise, the arc_meta_limit
may be set the same as arc_max.
Signed-off-by: AndCycle <andcycle@andcycle.idv.tw> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
The TEST file is provided as a hint to the automated test infra-
structure. It controls which regression tests are run and how they
are run. This file along with any lines in the commit messages
which start with TEST_* are sourced by the test scripts and can
be used to override the default values. For complete details see:
https://github.com/zfsonlinux/zfs-buildbot/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of gcc 5.1.1 20150618 (Red Hat 5.1.1-4) the -Werror=maybe-uninitialized
check detects that 'snapname' in recv_incremental_replication() may not be
initialized. Explicitly initialize the variable to resolved the warning.
libzfs_sendrecv.c: In function ‘recv_incremental_replication’:
libzfs_sendrecv.c:2019:2: error: ‘snapname’ may be used uninitialized in
(void) snprintf(buf, sizeof (buf), "%s@%s", fsname, snapname);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Chunwei Chen [Fri, 9 Oct 2015 19:27:01 +0000 (12:27 -0700)]
Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().
Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.
We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
Chunwei Chen [Tue, 13 Oct 2015 21:13:52 +0000 (14:13 -0700)]
Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.
We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.
Justin T. Gibbs [Tue, 13 Oct 2015 21:09:45 +0000 (14:09 -0700)]
Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
James Lee [Sun, 20 Sep 2015 02:00:36 +0000 (22:00 -0400)]
zfs-import: Perform verbatim import using cache file
This change modifies the import service to use the default cache file
to perform a verbatim import of pools at boot. This fixes code that
searches all devices and imported all visible pools.
Using the cache file is in keeping with the way ZFS has always worked,
how Solaris, Illumos, FreeBSD, and systemd performs imports, and is how
it is written in the man page (zpool(1M,8)):
All pools in this cache are automatically imported when the
system boots.
Importantly, the cache contains important information for importing
multipath devices, and helps control which pools get imported in more
dynamic environments like SANs, which may have thousands of visible
and constantly changing pools, which the ZFS_POOL_EXCEPTIONS variable
is not equipped to handle. Verbatim imports prevent rogue pools from
being automatically imported and mounted where they shouldn't be.
The change also stops the service from exporting pools at shutdown.
Exporting pools is only meant to be performed explicitly by the
administrator of the system.
The old behavior of searching and importing all visible pools is
preserved and can be switched on by heeding the warning and toggling
the ZPOOL_IMPORT_ALL_VISIBLE variable in /etc/default/zfs.
Signed-off-by: James Lee <jlee@thestaticvoid.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3777
Closes #3526
Brian Behlendorf [Tue, 13 Oct 2015 16:17:01 +0000 (09:17 -0700)]
Fix 'arc_c < arc_c_min' panic
Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
For consistency all systemd unit files and init scripts now share
the same names. This prevents an issue where the zed is started
twice on systems where both the systemd and sysv infrastructure is
installed concurrently.
For backward compatibility a 'zed' alias has been added. This
allows the user to interact with the service using either the
name 'zed' or 'zfs-zed'.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3837
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 866c162 which verifies the existance of the
zfs.release build product would never be true.
This patch resolves the issue by updating the conditional to check
in the explicitly installed zfs_config.h file for the version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3862
* Fix regression - "OVERLAY_MOUNTS" should have been "DO_OVERLAY_MOUNTS".
* Fix update-rc.d commands in postinst. Thanx to subzero79@GitHub.
* Fix make sure a filesystem exists before trying to mount in mount_fs()
* Fix local variable usage.
* Fix to read_mtab():
* Strip control characters (space - \040) from /proc/mounts GLOBALY,
not just first occurrence.
* Don't replace unprintable characters ([/-. ]) for use in the variable
name with underscore. No need, just remove them all together.
* Add check_boolean() to check if a user configure option is
set ('yes', 'Yes', 'YES' or any combination there of) OR '1'.
Anything else is considered 'unset'.
* Add a ZFS_POOL_IMPORT to the default config.
* This is a semi colon separated list of pools to import ONLY.
* This is intended for systems which have _a lot_ of pools (from
a SAN for example) and it would be to many to put in the
ZFS_POOL_EXCEPTIONS variable..
* Add a config option "ZPOOL_IMPORT_OPTS" for adding additional options
to "zpool import".
* Add documentation and the chance of overriding the ZPOOL_CACHE
variable in the config file.
* Remove "sort" from find_pools() and setup_snapshot_booting().
Sometimes not available, and not really necessary.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3816
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.
The solution is to delete the assertion. This could potentially
occur in uiomove as well, which contains analogous assertions
that appear similarly unnecessary, so we remove those as well.
Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3792
Brian Behlendorf [Thu, 24 Sep 2015 23:32:25 +0000 (16:32 -0700)]
Fix synchronous behavior in __vdev_disk_physio()
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in. This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.
This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system. Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.
This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.
Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.
Brian Behlendorf [Wed, 23 Sep 2015 22:59:04 +0000 (15:59 -0700)]
Avoid blocking in arc_reclaim_thread()
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking. Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf(). This will deadlock, see
issue #3822 for full backtraces showing the problem.
To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function. This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer. Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
Brian Behlendorf [Wed, 23 Sep 2015 21:28:43 +0000 (14:28 -0700)]
Disable zpl_nr_cached_objects() callback
The zpl_nr_cached_objects() function has been disabled because in the
current code it doesn't provide any critical functionality and it may
result in a deadlock under certain circumstances. However, because
we expect to need these hooks in the future this code has not been
entirely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3719
Brian Behlendorf [Wed, 23 Sep 2015 20:00:28 +0000 (13:00 -0700)]
Allow NFS activity to defer snapshot unmounts
Accessing a snapshot via NFS should cause an auto-unmount of that
snapshot to be deferred until such as time as the snapshot is idle.
This is analogous to the zpl_revalidate logic employed by locally
mounted snapshots.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3794
Commit torvalds/linux@4246a0b63bd8f56a1469b12eafeb875b1041a451
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.
bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592bf ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
Don Brady [Thu, 17 Sep 2015 23:55:22 +0000 (17:55 -0600)]
Add large block support to zpios(1) benchmark
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.
`zpios ... -S size | --blocksize size ...`
This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`
Example run below
(non realistic results from a VM and output abbreviated for space)
Tab-indent continuation lines in the "scan:" section of "zpool status".
All other sections use a tab, which makes them easy to parse. Only the
"scan:" section had its continuation lines indented with four spaces.
This makes them consistent with the others.
Signed-off-by: Remy Blank <remy.blank@pobox.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3769
Ned Bass [Wed, 16 Sep 2015 09:49:09 +0000 (02:49 -0700)]
Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit 0282c4137e7409e6d85289f4955adf07fac834f5. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
Richard Yao [Fri, 18 Sep 2015 12:32:52 +0000 (08:32 -0400)]
Discard on zvols should not exceed the length of a block
37f9dac592bf5889c3efb305c48ac39b4c7dd140 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.
Reported-by: Richard Connon <richard.connon@zynstra.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3798
Arne Jansen [Fri, 11 Sep 2015 16:18:56 +0000 (09:18 -0700)]
Illumos 6214 - zpools going south
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t. This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held. See Illumos 6214
for a detailed analysis of the race.
HDR_GET_COMPRESS() macro was removed from arc_buf_info().
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3757
Brian Behlendorf [Tue, 18 Aug 2015 20:51:20 +0000 (13:51 -0700)]
Prefetch start and end of volumes
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
Richard Yao [Mon, 7 Sep 2015 16:03:19 +0000 (12:03 -0400)]
Reintroduce IO accounting on zvols on Linux 3.19+
zfsonlinux/zfs@e20cd6f7a8922709b1aa2ecefd783390102d79e0 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503bc40e32d7f54a9b817264e81ce131b4 provided
suitable symbols for restoring this functionality 4 months later. We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3741
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 24923575255422525836565501
timestamp message 1441141408 spa=tank async request task=1 1441141408 txg 70 open pool version 5000; software version 5000/5; ... 1441141409 spa=tank async request task=32 1441141409 txg 72 import pool version 5000; software version 5000/5; ... 1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
Brian Behlendorf [Fri, 28 Aug 2015 21:54:32 +0000 (14:54 -0700)]
Support accessing .zfs/snapshot via NFS
This patch is based on the previous work done by @andrey-ve and
@yshui. It triggers the automount by using kern_path() to traverse
to the known snapshout mount point. Once the snapshot is mounted
NFS can access the contents of the snapshot.
Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server. To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally. System administators desiring this
functionally must explicitly enable it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2797
Closes #1655
Closes #616
Richard Yao [Fri, 10 Oct 2014 15:23:23 +0000 (11:23 -0400)]
Support secure discard on zvols
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.
Richard Yao [Fri, 4 Jul 2014 22:43:47 +0000 (18:43 -0400)]
zvol processing should use struct bio
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Brian Behlendorf [Mon, 31 Aug 2015 23:46:01 +0000 (16:46 -0700)]
Add temporary mount options
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #985
Closes #3351
Richard Yao [Mon, 31 Aug 2015 15:36:32 +0000 (11:36 -0400)]
VDEV_REQ_FUA should be mapped to REQ_FUA
Pre-2.6.37 kernels support REQ_FUA in request flags, but not in BIO
flags. zvols are the only consumer of VDEV_REQ_FUA and since they are
passed requests, they should be obey the REQ_FUA flag like later
kernels. This optimization will only matter on 2.6.36 and 2.6.37 because
the zvol rework changes things to use bio, where we no longer are able
to distinguish on earlier kernels
Tim Chase [Mon, 31 Aug 2015 01:59:23 +0000 (20:59 -0500)]
Dbuf hash table should be sized as is the arc hash table
Commit 49ddb315066e372f31bda29a5c546a9eccc8b418 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table. The dbuf hash table's size should be determined
similarly.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3721
Allow for easy turning of a pools reserved free space. Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3724
James Lee [Sun, 30 Aug 2015 18:36:41 +0000 (14:36 -0400)]
Reorder zfs-* services to allow /var on separate dataset
ZED depends on /var. When /var is a separate dataset, it must be
mounted before starting ZED. This change moves the zfs-zed service
from starting first, to starting after zfs-mount, but before zfs-share.
As discussed in issue #3513, ZED does not need to start first in order
to consume events made during the zfs-import and zfs-mount services.
The events will be queued and can be handled later in the boot process.
ZED may, however, handle sharing in the future, so it should be started
before the zfs-share service.
This commit also stops the zfs-import service from writing temp files
to /var/tmp on shutdown and it corrects the return code for the OpenRC
service.
Other OpenRC-specific changes noted in issue #3513 were reitereated in
issue #3715 and committed in da619f3.
Signed-off-by: James Lee <jlee@thestaticvoid.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3513
Richard Yao [Sat, 29 Aug 2015 16:01:07 +0000 (12:01 -0400)]
Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
Before read locking z_teardown_inactive_lock, we need to check if we have
already had write lock on it. Otherwise, we would deadlock on ourself when
doing rollback:
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Fri, 24 Apr 2015 23:21:13 +0000 (16:21 -0700)]
Linux 3.18 compat: Snapshot auto-mounting
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels. And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients. This patch makes
the following core improvements.
* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.
* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process. This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.
* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.
* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds. This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.
* Snapshots in active use will not be automatically unmounted. As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.
* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated. This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots. This patch modifies the
behavior such that only negative dentries are invalidated. This solves
the issue and may result in a performance improvement.
Andrey Vesnovaty [Mon, 12 Aug 2013 18:47:04 +0000 (21:47 +0300)]
zfsctl: No need to sync ctldir inodes
There's no metadata to write to disk for ctldir inodes. So we check if
a inode belongs to the ctldir in zpl_commit_metadata, and returns
immediately if it is.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2797
loli10K [Sat, 29 Aug 2015 19:52:44 +0000 (21:52 +0200)]
Fix small typo
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.
Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3714
Tim Chase [Thu, 27 Aug 2015 19:50:39 +0000 (14:50 -0500)]
Initialize the taskq entry embedded within struct vdev
As part of the stack reduction effort in 50b25b2187134ac7b19cf93bd35a420223f1d343, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object. The rationale is the same as is
described in f467b05a265abcfb8e5a3269f79d08f36a58646a.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3709
Add new keyword 'slot' to vdev_id.conf
This selects from where to get the slot number for a SAS/SATA disk
Needed to enable access to the physical position of a disk in a
Supermicro 2027R-AR24NV .
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3693
Tim Chase [Sun, 23 Aug 2015 14:58:11 +0000 (09:58 -0500)]
Allow recovery from corrupted snapshot maps
If the ZAP object containing a snapshot map is corrupted due to an
unrecoverable checksum error or otherwise, dsl_dataset_name() will
normally panic the system due to its VERIFY.
This patch attempts to allow a recovery avenue from such situations by
manufacturing a descriptive snapshot name and then ignoring the error.
Scrubbing a pool with this type of corruption will then show the affected
object in the error list rather than panicking.
The recovery code is only enabled when the zfs_recover module parameter
is set.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3705
Brian Behlendorf [Mon, 24 Aug 2015 21:18:48 +0000 (14:18 -0700)]
Check large block feature flag on volumes
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3591
Brian Behlendorf [Fri, 28 Aug 2015 00:01:59 +0000 (17:01 -0700)]
Limit max_hw_sectors_kb to 16M
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.
This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs. The solution is to restore the maximum
size to ~10M. This patch specifically changes it to 16M which is
close enough.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3684
Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.
With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.
The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3511
Closes #3640
Chunwei Chen [Mon, 11 May 2015 14:22:56 +0000 (22:22 +0800)]
Add compatibility layer for {kmap,kunmap}_atomic
Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2.
We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but
ignore the 2nd for newer kernel.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Tue, 28 Jul 2015 23:45:17 +0000 (16:45 -0700)]
Linux 4.2 compat: vfs_rename()
The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place. This way there can never
exist a partial version of the file, it's all or nothing.
Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms. Even doing
basic I/O to a file system from within the kernel is strongly
discouraged. In order to support this at all the vn_rename()
implementation ends up being complex and fragile. So fragile
that recent Linux 4.2 changes have broken it.
While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely. Then all this complex, fragile code can
be removed. Achieving this is straight forward because
config_write() is the only consumer of vn_rename().
This patch reworks spa_config_write() to update the cache file
in place. The file will be truncated, written out, and then
synced to disk. If an error is encountered the file will be
unlinked leaving the system in a consistent state.
This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file. However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.
If this were to somehow happen it poses no risk to pool. Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.
Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3653
Brian Behlendorf [Tue, 18 Aug 2015 18:20:22 +0000 (11:20 -0700)]
Handle zap_lookup() failure in ddt_object_load()
Failing to lookup a name in the spa_ddt_stat_object should not result
in a panic in ddt_object_load(). The error can be safely returned to
the caller for handling resulting in a useful user error message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3370
Chris Dunlop [Sun, 9 Aug 2015 12:38:18 +0000 (22:38 +1000)]
Linux 4.1 compat: configure bdi_setup_and_register()
Pull struct backing_dev_info off the stack: by linux-4.1 it's grown
past our 1024 byte stack frame warning limit resulting in an incorrect
configure result.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #3671
Tim Chase [Thu, 30 Jul 2015 04:11:32 +0000 (23:11 -0500)]
ztest: display non-index properties properly at verbose level 6
At verbosity levels of 6 or greater, ztest_dsl_prop_set_uint64() attempts
to display the value of all properties as indexed values regardless of
whether the property is an indexed value or simply an un-indexed integer.
This patch causes the numeric value of the property to be displayed if
zfs_prop_index_to_string() fails.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3649
Chris Dunlap [Tue, 28 Jul 2015 22:52:40 +0000 (15:52 -0700)]
Rework zed_notify_email for configurable PROG/OPTS
This commit reworks the zed_notify_email() function to allow
configuration of the mail executable and command-line arguments.
ZED_EMAIL_PROG specifies the name or path of the executable responsible
for sending notifications via email. This variable defaults to "mail".
ZED_EMAIL_OPTS specifies command-line options passed to ZED_EMAIL_PROG.
The following keyword substitutions are performed:
- @ADDRESS@ is replaced with the recipient email address(es)
- @SUBJECT@ is replaced with the notification subject
This variable defaults to "-s '@SUBJECT@' @ADDRESS@".
ZED_EMAIL_ADDR replaces ZED_EMAIL (although the latter is retained
for backward compatibility). This variable can contain multiple
addresses as long as they are delimited by whitespace.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3634
Closes #3631
Brian Behlendorf [Tue, 28 Jul 2015 18:30:00 +0000 (11:30 -0700)]
Update arc_memory_throttle() to check pageout
This brings the behavior of arc_memory_throttle() back in sync with
illumos. The updated memory throttling policy roughly goes like this:
* Never throttle if more than 10% of memory is free. This threshold
is configurable with the zfs_arc_lotsfree_percent module option.
* Minimize any throttling of kswapd even when free memory is below
the set threshold. Allow it to write out pages as quickly as
possible to help alleviate the memory pressure.
* Delay all other threads when free memory is below the set threshold
in order to avoid compounding the memory pressure. Buffers will be
evicted from the ARC to reduce the issue.
The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
Brian Behlendorf [Mon, 27 Jul 2015 20:17:32 +0000 (13:17 -0700)]
Update arc_available_memory() to check freemem
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages. This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim. Conceptually
this brings arc_available_memory() back in sync with illumos.
It is also desirable that the target amount of free memory be tunable
on a system. While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.
zfs_arc_sys_free - The target number of bytes the ARC should leave
as free memory on the system. This value can
checked in /proc/spl/kstat/zfs/arcstats and
setting this module option will override the
default value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
Brian Behlendorf [Mon, 27 Jul 2015 18:55:03 +0000 (11:55 -0700)]
Bound zvol_threads module option
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3614
Chunwei Chen [Thu, 5 Mar 2015 19:52:26 +0000 (11:52 -0800)]
Fix "BUG: Bad page state" caused by writeback flag
Commit d958324 fixed the deadlock between page lock and range lock by
unlocking the page lock before acquiring the range lock. However,
this created a new issue #3075.
The problem is that if we can't set the write back bit before releasing
the page lock. Then other processes will be unaware that the page is
under active write back. They may therefore truncate the page,
invalidate the page, or not honor the sync semantics.
To workaround this problem we re-dirty the page before dropping the
page lock. While this doesn't prevent the page from being truncated
it does ensure it won't be invalidated. Then the range lock and the
page lock are reacquired in the correct deadlock-free order.
Once both locks are safely held the page state can be rechecked. If
all is well and the page is in the expect state the dirty bit can be
removed, the write back bit set, and the page removed from the skip
count. If not the page will be handled as appropriate.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3075
Brian Behlendorf [Fri, 24 Jul 2015 17:08:31 +0000 (10:08 -0700)]
Align thread priority with Linux defaults
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority. Non-I/O filesystem
processes run with the default priority. ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active. The priorities have been updated to the following:
Note that under Linux the meaning of a processes priority is inverted
with respect to illumos. High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
This patch depends on https://github.com/zfsonlinux/spl/pull/466.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3607