Brian Behlendorf [Mon, 20 Jun 2016 21:28:51 +0000 (14:28 -0700)]
Add log_must_{retry,busy} helpers
Add helpers which automatically retry the provided command when
the error message matches the provided keyword. This provides an
easy way to handle the asynchronous nature of some ZFS commands.
For example, the `zfs destroy` command may need to be retried in
the case where the block device is unexpected busy. This can be
accomplished as follows:
log_must_busy $ZFS destroy ...
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5002
liuhuang [Sun, 21 Aug 2016 23:40:54 +0000 (07:40 +0800)]
Update zfs_mount_005_pos.ksh and zfs_mount_010_neg.ksh
Update zfs_mount_005_pos.ksh and zfs_mount_010_neg.ksh to reflect
the expected Linux behavior. The is_linux wrapper is used so the
test case may be used on Linux and non-Linux platforms.
Signed-off-by: liuhuang <liu.huang@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5000
cao [Tue, 30 Aug 2016 11:32:22 +0000 (19:32 +0800)]
Delete unused zfsctl_snapdir_inactive declaration
zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7
this is declaration remains even though the implementation was
removed in commit 278bee93. Removed fastreboot_disable_highpil
which is also unused.
Signed-off-by: caoxuewen cao.xuewen@zte.com.cn Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5042
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.
Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413
Closes #5040
Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'
- Difference from upstream in module/zfs/dmu_zfetch.c,
uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance
- Variables have been initialized at the beginning of the function
(void dmu_zfetch) to resemble the order of occurrence and account
for C99, C11 mode errors.
GeLiXin [Thu, 25 Aug 2016 08:40:20 +0000 (16:40 +0800)]
Fix: Build warnings with different gcc optimization levels in debug mode
This fix resolves warnings reported during compiling with different gcc
optimization levels in debug mode,
Test tools:
gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago)
List of warnings:
CFLAGS=-O1 ./configure --enable-debug ;make
../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’:
../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function
../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here
CFLAGS=-Os ./configure --enable-debug ; make
libzfs_dataset.c: In function ‘zfs_prop_set_list’:
libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5022
Brian Behlendorf [Thu, 25 Aug 2016 20:24:01 +0000 (20:24 +0000)]
Fix cv_timedwait_hires
The user space implementation of cv_timedwait_hires() was always passing
a relative time to pthread_cond_timedwait() when an absolute time is
expected. This was accidentally introduced in commit 206971d2.
Replace two magic values with their corresponding preprocessor macro.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #5024
GeLiXin [Thu, 11 Aug 2016 03:15:37 +0000 (11:15 +0800)]
Add zfs_arc_meta_limit_percent tunable
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.
People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.
Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4957
GeLiXin [Mon, 22 Aug 2016 03:20:22 +0000 (11:20 +0800)]
Fix: Array bounds read in zprop_print_one_property()
If the loop index i comes to (ZFS_GET_NCOLS - 1), the cbp->cb_columns[i + 1]
actually read the data of cbp->cb_colwidths[0], which means the array
subscript is above array bounds.
Luckily the cbp->cb_colwidths[0] is always 0 and it seems we haven't
looped enough times to exceed the array bounds so far, but it's really
a secluded risk someday.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5003
Matthew Ahrens [Wed, 20 Jul 2016 22:42:13 +0000 (15:42 -0700)]
OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7004
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109
Closes #4641
Closes #4972
Matthew Ahrens [Wed, 20 Jul 2016 22:39:55 +0000 (15:39 -0700)]
OpenZFS 7003 - zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7003
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108
Closes #4972
heary-cao [Sat, 6 Aug 2016 07:08:51 +0000 (15:08 +0800)]
Fix spa config generate memory leak in spa_load_best function
When spa retry load succeeds and spa recovery is requested it may
leak in spa_load_best function. Always free the generated config
when it is not assigned to the spa.
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4940
Paul Dagnelie [Tue, 9 Aug 2016 21:06:39 +0000 (23:06 +0200)]
OpenZFS 7176 - Yet another hole birth issue
This is another bug in the long line of hole-birth related issues. In
this particular case, it was discovered that a previous hole-birth fix
(illumos bug 6513, commit bc77ba73) did not cover as many cases as we
thought it did. While the issue worked in the case of hole-punching
(writing zeroes to a large part of a file), it did not deal with
truncation, and then writing beyond the new end of the file.
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7176
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad
Upstream-bugs: DLPX-46009
Porting notes:
- Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ;
keep previous position of the initialization
Nikolay Borisov [Tue, 16 Aug 2016 20:00:16 +0000 (23:00 +0300)]
Fix do_link portion of ctime test
From the man page of dirname: " Both dirname() and basename()
may modify the contents of path, so it may be desirable to pass
a copy when calling one of these functions." And in fact on linux
using dirname actually changes the contents of the passed parameter as
evident from the following failure when running the ctime test:
link(/root/zfs-mount, /root/zfs-mount/link_file)
Fix this by creating a copy of the input parameter and passing that
to dirname, thus not compromising the original parameter, allowing
the creation of hard link to succeed.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4977
Matthew Ahrens [Thu, 21 Jul 2016 05:50:26 +0000 (22:50 -0700)]
It is not necessary to zero struct dbuf_hold_impl_data
Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a
considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to
`kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`,
which is around 2KiB. This structure is used as a stack, to limit the size of
the C stack as dbuf_hold() calls itself recursively. We make a recursive call
to hold the parent's dbuf when the requested dbuf is not found. The vast
majority of the time, the parent or grandparent indirect dbuf is cached, so the
number of recursive calls is very low. However, we initialize this entire
array for every call to dbuf_hold().
To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()`
instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all
members of the struct before they are used. I observed ~5% performance
improvement on a workload which creates many files.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4974
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset
overhead of time function.
- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()`
are introduced.
- Implementation mutex lock is replaced by atomic variable.
- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.
- `fletcher_4_native/byteswap()` methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).
- Added `fletcher_4_native_varsize()` special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.
- Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.
Fletcher4 implementation using avx512f instruction set
Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.
Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.
Table shows result of full (native) fletcher4 calculation for different buffer size:
Add support for AVX-512 family of instruction sets
This patch adds compiler and runtime tests (user and kernel) for following
instruction sets: avx512f, avx512cd, avx512er, avx512pf, avx512bw, avx512dq,
avx512vl, avx512ifma, avx512vbmi.
note: Linux support for AVX-512F (Foundation) instruction set started with
linux v3.15
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
GeLiXin [Tue, 9 Aug 2016 09:49:51 +0000 (17:49 +0800)]
Fix incorrect pool state after import
Import a raidz pool which has a vdev with a bad label, zpool status
shows the right state of the dev, but the wrong state of the pool.
The pool state should be DEGRADED, not ONLINE.
We examine the label in vdev_validate while in spa_load_impl, the bad
label can be detected but doesn't propagate its state to the parent.
There are other chances to propagate state in the following vdev_load
if we failed to load DTL, but our pool is raidz1 which can tolerate a
faulted disk. So we lost the last chance to correct the pool state.
Propagate the leaf vdev's state to parent if its label was corrupted,
as is done elsewhere in vdev_validate.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com>
Closes #4948
Hans Rosenfeld [Wed, 27 Jul 2016 22:29:15 +0000 (15:29 -0700)]
OpenZFS 5997 - FRU field not set during pool creation and never updated
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283
Porting Notes:
In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.
Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux. In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.
The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.
luozhengzheng [Fri, 12 Aug 2016 09:41:28 +0000 (17:41 +0800)]
Fix a typo in ZIL write handling comment
The following comment in zil.h
* WR_COPIED:
* If we know we'll immediately be committing the
* transaction (FSYNC or FDSYNC), then we allocate a larger
* log record here for the data and copy the data in.
The word "the" should be "then".
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4961
Jason Zaman [Thu, 11 Aug 2016 15:59:03 +0000 (23:59 +0800)]
icp: mark asm files with noexec stack
If there is no explicit note in the .S files, the obj file will mark it
as requiring an executable stack. This is unneeded and causes issues on
hardened systems.
More info:
https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart
Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4947
Closes #4962
Jason Zaman [Tue, 9 Aug 2016 16:56:56 +0000 (00:56 +0800)]
icp: add no_const for PaX Compat
The constify plugin will automatically constify a class of types that contain
only function pointers. The icp structs fail to build if this is enabled with
the following error. The no_const attribute makes the plugin skip those
structs.
module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’:
module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘*dst_ops->cou.cou_v1.co_control_ops’
*((dst)->ops) = *((src)->ops);
^
module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’
KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops);
^
Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4947
Closes #4962
Brian Behlendorf [Thu, 11 Aug 2016 21:58:13 +0000 (14:58 -0700)]
Reorder HAVE_BIO_RW_* checks
The HAVE_BIO_RW_* #ifdef's must appear before REQ_* #ifdef's
in the bio_is_flush() and bio_is_discard() macros. Linux 2.6.32
era kernels defined both of values and the HAVE_BIO_RW_* must be
used in this case. This resulted in a panic in zconfig test 5.
Matthew Ahrens [Wed, 13 Jan 2016 18:45:08 +0000 (10:45 -0800)]
OpenZFS 7263 - deeply nested nvlist can overflow stack
nvlist_pack() and nvlist_unpack are implemented recursively, which can
cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist
which contains an nvlist, which contains an nvlist, which...
Unprivileged users can pass an nvlist to the kernel via certain ioctls
on /dev/zfs, which the kernel will unpack without additional permission
checking or validation. Therefore, an unprivileged user can cause the
kernel's stack to overflow and panic.
Ideally, these functions would be implemented non-recursively. As a
quick fix, this patch limits the depth of the recursion and returns an
error when attempting to pack and unpack a deeply-nested nvlist.
Signed-off-by: Adam Leventhal <ahl@delphix.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Prakash Surya <prakash.surya@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7263
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d
Non-Linux OpenZFS implementations require additional support to be
used a root pool. This code should simply be removed to avoid
confusion and improve readability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
Linux 4.8 compat: Fix removal of bio->bi_rw member
All users of bio->bi_rw have been replaced with compatibility wrappers.
This allows the kernel specific logic to be abstracted away, and for
each of the supported cases to be documented with the wrapper. The
updated interfaces are as follows:
* void blk_queue_set_write_cache(struct request_queue *, bool, bool)
* boolean_t bio_is_flush(struct bio *)
* boolean_t bio_is_fua(struct bio *)
* boolean_t bio_is_discard(struct bio *)
* boolean_t bio_is_secure_erase(struct bio *)
* VDEV_WRITE_FLUSH_FUA
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
Build user-space with different gcc optimization levels
This fix resolves warnings reported during compiling of user-space
libraries with different gcc optimization levels.
Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora).
The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast.
List of warnings:
[GCC 4.9.2 -Os]
libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Og]
fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Ofast]
u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 6.1.1]
Resolved in PR #4907
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4937
Chunwei Chen [Tue, 9 Aug 2016 00:26:21 +0000 (17:26 -0700)]
Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer
Starting from Linux 4.7, get_acl will set acl cache pointer to temporary
sentinel value before calling i_op->get_acl. Therefore we can't compare
against ACL_NOT_CACHED and return.
Since from Linux 3.14, get_acl already check the cache for us, so we
disable this in zpl_get_acl.
Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4944
Closes #4946
GeLiXin [Tue, 2 Aug 2016 02:58:42 +0000 (10:58 +0800)]
Fix call zfs_get_name() with invalid parameter
zfs_get_name() expects a parameter of type zfs_handle_t *zhp , but
gets an invalid parameter type of zfs_handle_t **zhp actually in
libzfs_dataset_cmp(), which may trigger a coredump if called.
libzfs_dataset_cmp() working normally so far, just because all the
callers only give datasets of type ZFS_TYPE_FILESYSTEM to it, we
compared their mountpoint and return, luckily.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4919
The posix_acl_valid() function has been updated to require a
user namespace. Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4922
Nikolay Borisov [Wed, 3 Aug 2016 17:19:04 +0000 (20:19 +0300)]
Linux 4.8 compat: new s_user_ns member of struct super_block
Kernel 4.8 paved the way to enabling mounting a file system inside a
non-init user namespace. To facilitate this a s_user_ns member was
added holding the userns in which the filesystem's instance was
mounted. This enables doing the uid/gid translation relative to
this particular username space and not the default init_user_ns.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4928
arc_meta_limit should be updated when arc_max is changed.
When arc_max is increased, arc_meta_limit will not be updated to 3/4
of the new arc_c_max value. This was done originally to preserve any
existing maximum value. This turned out to be counter intuitive to
users and this fix changes that behavior. If zfs_arc_meta_limit is
non-default, it will be picked up later in the ARC tuning function.
Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4893
Fix gcc -Warray-bounds check for dump_object() in zdb
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) an array bounds warnings
is detected in the zdb the dump_object() function. The analysis is
correct but difficult to interpret because this is implemented as a
macro. Rework the ZDB_OT_NAME in to a function and remove the case
detected by gcc which is a side effect of the DMU_OT_IS_VALID() macro.
zdb.c: In function ‘dump_object’:
zdb.c:1931:288: error: array subscript is outside array bounds
[-Werror=array-bounds]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #4907
Brian Behlendorf [Sat, 30 Jul 2016 00:10:11 +0000 (17:10 -0700)]
Fix gcc self-comparison warning
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is
detected by gcc in metaslab_alloc(). Resolve the warning by passing
a physical size of 0 to BP_SET_BIRTH() as it done by other callers.
module/zfs/metaslab.c: In function ‘metaslab_alloc’:
module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates
to true [-Werror=tautological-compare]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4907
Add a `make lint` target which maps to a cppcheck target. As with
the shellcheck target it will only run when cppcheck is installed.
This allows a `make lint` build check to be incrementally added to
the automated testing for distribution which provide cppcheck.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4915
Config of a hot spare or l2cache device will leak memory in function
add_config(). At the start of this function, when dealing with a
config which belongs to a hot spare not currently in use or a l2cache
device the config should be freed.
Signed-off-by: liaoyuxiangqin <guo.yong33@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4910
Leaks reported by using AddressSanitizer, GCC 6.1.0
Direct leak of 4097 byte(s) in 1 object(s) allocated from:
#1 0x414f73 in process_options cmd/ztest/ztest.c:721
Direct leak of 5440 byte(s) in 17 object(s) allocated from:
#1 0x41bfd5 in umem_alloc ../../lib/libspl/include/umem.h:88
#2 0x41bfd5 in ztest_zap_parallel cmd/ztest/ztest.c:4659
#3 0x4163a8 in ztest_execute cmd/ztest/ztest.c:5907
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4896
Colin Ian King [Fri, 29 Jul 2016 11:40:30 +0000 (12:40 +0100)]
libzfs: Fix missing va_end call on ENOSPC and EDQUOT cases
The switch statement in function zfs_standard_error_fmt for the
ENOSPC and EDQUOT cases returns immediately and unlike all other
cases in the switch this does not perform the va_end call.
Perform a break which ends up calling va_end rather than returning
immediately.
Found by static analysis with CoverityScan 0.8.5
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4900
Nikolay Borisov [Fri, 29 Jul 2016 17:02:59 +0000 (20:02 +0300)]
Move assignment of i_blkbits field
Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time
zfs_inode_update_impl is called. Since this value never changes
move its assignment to at inode creation time.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4906
Nikolay Borisov [Fri, 29 Jul 2016 16:43:23 +0000 (19:43 +0300)]
Unify license of icp module with the rest of zfs
The newly added icp module uses a hardcoded value of CDDL for the license,
however in local development one might want to change that to something
else in order to facilitate compiling against lock debugging enabled kernel.
All modules of the zfs use the ZFS_META_LICNSE string which is replaced with
the value held in the META file. One can modify the value in the META file
once and then rerun the configure to have all modules' licenses changed.
Change the icp module license string to be ZFS_META_LICENSE so that it
falls under the same paradigm.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4905
In zfs_ioc_log_history() function the tsd_set() function is called
with NULL which causes the zfs_allow_log_destroy() to be run. In
this case the passed value will be NULL. This is normally entirely
safe because strfree() maps directly to kfree() which may be passed
a NULL. However, since alternate implementations of strfree() may
not handle this gracefully add a check for NULL.
Observed under an embedded Linux 2.6.32.41 kernel running the
automated testing while running the ZFS Test Suite.
Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4872
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags. This included
changing the encoding of bi_rw. It places REQ_OP_* in high order
bits and other stuff in low order bits. This encoding is done
through the new helper function bio_set_op_attrs. For complete
details refer to:
Brian Behlendorf [Wed, 27 Jul 2016 18:06:17 +0000 (18:06 +0000)]
Linux 4.8 compat: REQ_PREFLUSH
The REQ_FLUSH flag was renamed REQ_PREFLUSH to avoid confusion with
REQ_OP_FLUSH. See https://github.com/torvalds/linux/commit/28a8f0d3
for complete details.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
Brian Behlendorf [Wed, 27 Jul 2016 02:23:53 +0000 (02:23 +0000)]
Linux 4.8 compat: submit_bio()
The rw argument has been removed from submit_bio/submit_bio_wait.
Callers are now expected to set bio->bi_rw instead of passing it
in. See https://github.com/torvalds/linux/commit/4e49ea4a for
complete details.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
Brian Behlendorf [Wed, 29 Jun 2016 20:59:51 +0000 (13:59 -0700)]
Fix zdb crash with 4K-only devices
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #4479
Brian Behlendorf [Tue, 26 Jul 2016 18:15:46 +0000 (11:15 -0700)]
Enable history test cases
Updated test case history_001_pos.ksh so it can run in tree. The
original test case assumed /usr/sbin/zfs and /usr/sbin/zpool were
the only valid locations for these utilities. The same modification
has already been made too history_common.kshlib.
The only other failing test case was history_010_pos and that was
the result of the ":linux" suffix not being appended when checking
the long output in the test case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4882
Colin Ian King [Wed, 27 Jul 2016 08:26:38 +0000 (09:26 +0100)]
void integer overflow on computation of refquota_slack
DMU_MAX_ACCESS should be cast to a uint64_t otherwise the
multiplication of DMU_MAX_ACCESS with spa_asize_inflation will
be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS
is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will
lead to an overflow.
Found by static analysis with CoverityScan 0.8.5
CID 150942 (#1 of 1): Unintentional integer overflow
(OVERFLOW_BEFORE_WIDEN)
overflow_before_widen: Potentially overflowing expression 67108864 * spa_asize_inflation with type int (32 bits, signed)
is evaluated using 32-bit arithmetic, and then used in a context
that expects an expression of type uint64_t (64 bits, unsigned).
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4889
Brian Behlendorf [Thu, 23 Jun 2016 22:25:17 +0000 (15:25 -0700)]
Multi-thread 'zpool import' for blkid
Commit 519129f added support to multi-thread 'zpool import' for
the case where block devices are scanned for under /dev/. This
commit generalizes that logic and applies it to the case where
device names are acquired from libblkid.
The zpool_find_import_scan() and zpool_find_import_blkid()
functions create an AVL tree containing each device name. Each
entry in this tree is dispatched to a taskq where the function
zpool_open_func() validates the device by opening it and reading
the label. This may result in additional entries being added
to the tree and those device paths being verified.
This is largely how the upstream OpenZFS code behaves but due to
significant differences the non-Linux code has been dropped for
readability. Additionally, this code makes use of taskqs and
kmutexs which are normally not available to the command line tools.
Special care has been taken to allow their use in the import
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #4794
Tim Chase [Wed, 13 Jul 2016 12:42:40 +0000 (07:42 -0500)]
Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
Tim Chase [Fri, 8 Jul 2016 15:33:01 +0000 (10:33 -0500)]
Fix sync behavior for disk vdevs
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer. This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.
In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.
The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency. This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.
To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.
For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4858
Brian Behlendorf [Mon, 25 Jul 2016 21:15:01 +0000 (14:15 -0700)]
Fix uninitialized variable in avl_add()
Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.
module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
avl_insert(tree, new_node, where);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Nikolay Borisov [Sun, 22 May 2016 11:15:57 +0000 (14:15 +0300)]
Remove znode's z_uid/z_gid member
Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
Nikolay Borisov [Mon, 30 May 2016 17:37:36 +0000 (20:37 +0300)]
Check whether the kernel supports i_uid/gid_read/write helpers
Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
Tom Caputi [Fri, 22 Jul 2016 20:19:29 +0000 (16:19 -0400)]
Fix for metaslab_fastwrite_unmark() assert failure
Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.
Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4267
Tom Caputi [Thu, 21 Jul 2016 03:29:51 +0000 (23:29 -0400)]
Fix for compilation error when using the kernel's CONFIG_LOCKDEP
Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
Tom Caputi [Thu, 12 May 2016 14:51:24 +0000 (10:51 -0400)]
Illumos Crypto Port module added to enable native encryption in zfs
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).
Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.
In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.
- Implementation lock replaced with atomic variable
- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`
- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813
- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392
- Minor fixes and cleanups
- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.
Wait iput_async before evict_inodes to prevent race
Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes. This means it could
destroy the inode while we are still using it.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4854
In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue
was fixed in gcc 4.6.
To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.
Disable zimport testing against master where this flaw exists:
TEST_ZIMPORT_VERSIONS="installed"
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4862
Roman Strashkin [Tue, 12 Jul 2016 17:53:53 +0000 (20:53 +0300)]
Fix filesystem destroy with receive_resume_token
It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'. Try to remove the hidden child (%recv) and after
that try to remove the target dataset. If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4818
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.
The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.
Adds a pair of fields to an existing zcommon module parameter:
- zfs_fletcher_4_impl (str)
"sse2" - new SSE2 implementation if available
"ssse3" - new SSSE3 implementation if available
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789
Chris Dunlop [Thu, 14 Jul 2016 14:44:38 +0000 (00:44 +1000)]
Use native inode->i_nlink instead of znode->z_links
A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.
We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.
In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c).
Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4838
Issue #227
Tim Chase [Sun, 10 Jul 2016 14:09:02 +0000 (09:09 -0500)]
Prevent null dereferences when accessing dbuf kstat
In arc_buf_info(), the arc_buf_t may have no header. If not, don't try
to fetch the arc buffer stats and instead just zero them.
The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4837
Brian Behlendorf [Thu, 14 Jul 2016 16:27:33 +0000 (09:27 -0700)]
Enable zpool_upgrade test cases
Creating the pool in a striped rather than mirrored configuration
provides enough space for all upgrade tests to run. Test case
zpool_upgrade_007_pos still fails and must be investigated so
it has been left disabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4852
Gvozden Neskovic [Tue, 28 Jun 2016 17:49:53 +0000 (19:49 +0200)]
Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.
The patch covers low-end and older x86 CPUs. Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower. Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.
Peng [Wed, 8 Jun 2016 07:22:07 +0000 (15:22 +0800)]
Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.
This change partially reverts e89260a. This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
fh_to_dentry should return ESTALE when generation mismatch
When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.
Gvozden Neskovic [Wed, 29 Jun 2016 20:31:25 +0000 (22:31 +0200)]
Allow building with `CFLAGS="-O0"`
If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.
Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4799
Brian Behlendorf [Wed, 29 Jun 2016 18:26:30 +0000 (11:26 -0700)]
Merge branch 'illumos-2605'
Adds support for resuming interrupted zfs send streams and include
all related send/recv bug fixes from upstream OpenZFS.
Unlike the upstream implementation this branch does not change
the existing ioctl interface. Instead a new ZFS_IOC_RECV_NEW ioctl
was added to support resuming zfs send streams. This was done by
applying the original upstream patch and then reverting the ioctl
changes in a follow up patch. For this reason there are a handful
on commits between the relevant patches on this branch which are
not interoperable. This was done to make it easier to extract
the new ZFS_IOC_RECV_NEW and submit it upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4742
Brian Behlendorf [Tue, 28 Jun 2016 20:31:21 +0000 (13:31 -0700)]
Vectorized fletcher_4 must be 128-bit aligned
The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned. This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations. Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330