Brian Behlendorf [Fri, 20 Feb 2015 18:28:25 +0000 (10:28 -0800)]
Fix O_APPEND open(2) flag
As described in flags section of open(2):
O_APPEND:
The file is opened in append mode. Before each write(2), the
file offset is positioned at the end of the file, as if with
lseek(2). O_APPEND may lead to corrupted files on NFS filesys-
tems if more than one process appends data to a file at once.
This is because NFS does not support appending to a file, so the
client kernel has to simulate it, which can't be done without a
race condition.
This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem. However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3124
Steffen Müthing [Mon, 16 Feb 2015 03:08:04 +0000 (04:08 +0100)]
Add required files to initramfs
The dracut module installs the udev rules and the vdev_id utility for creating
the /dev/disk/by-vdev/ names, but omits some additional utilities and the
config file required by vdev_id.
Signed-off-by: Steffen M<C3><BC>thing <steffen.muething@iwr.uni-heidelberg.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3110
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file. This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.
This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side. It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2820
Brian Behlendorf [Thu, 12 Feb 2015 23:05:21 +0000 (15:05 -0800)]
Skip bad DVAs during free by setting zfs_recover=1
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal. This indicates that somehow a damaged
DVA was written to disk and that should be impossible.
However, we have seen a handful of reports over the years of pools
somehow being damaged in this way. Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.
Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3099
Issue #3090
Issue #2720
Brian Behlendorf [Thu, 30 Oct 2014 18:11:00 +0000 (11:11 -0700)]
Change VERIFY to ASSERT in mutex_destroy()
There have been multiple reports of 'zdb' tripping the VERIFY in
mutex_destroy() because pthread_mutex_destroy() returns EBUSY.
Exactly how this can happen still needs to be explained, but this
doesn't strictly need to be fatal for non-debug builds. Therefore,
this patch converts the VERIFY to an ASSERT until the root cause
is determined and resolved.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2027
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.
This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2930
Brian Behlendorf [Fri, 30 Jan 2015 19:25:19 +0000 (11:25 -0800)]
Retire zio_cons()/zio_dest()
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code. They were introduced as a workaround
to avoid issue #2523. Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches. Today
this code works well and there's no compelling reason to keep
this functionality. In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold. Convert them to use vmem_alloc() so the
best allocator is used.
Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings. Concurrently the large allocation warning threshold
was increased from 8k to 32k. The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k. This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3057
Closes #3079
Closes #3081
Tim Chase [Mon, 17 Nov 2014 04:27:58 +0000 (22:27 -0600)]
Produce a full snapshot list for zfs send -p
In order to accelerate zfs receive operations in the face of many
property-containing snapshots, commit 0574855 changed the header nvlist
("fss") of a send stream to exclude snapshots which aren't part of the
stream. This, however, would cause zfs receive -F to erroneously remove
snapshots; it would remove any snapshot which wasn't listed in the header
nvlist.
This patch restores the full list of snapshots in fss[<id>[snaps]] but
still suppresses the properties of non-sent snapshots and also removes a
consistency check in which an error is raised if a listed snapshot does
not have any properties in fss[<id>[snapprops]].
The 0574855 commit also introduced a bug in which zfs send -p of a
complete stream (zfs send -p pool/fs@snap) would exclude the snapshot
properties in fss[<id>[snapprops]]. This patch detects the last snapshot
in a series when no "from" snapshot has been specified and includes its
properties.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2907
Brian Behlendorf [Tue, 18 Nov 2014 22:29:04 +0000 (17:29 -0500)]
Don't read space maps during import for readonly pools
Normally when importing a pool the space maps for all top level
vdevs are read from disk. The space maps will be required latter
when an allocation is performed and free blocks need to be located.
However, if the pool is imported readonly then we are guaranteed
that no allocations can occur. In this case the space maps need
not be loaded.. A similar argument can be made for the DTLs
(dirty time logs).
Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
Lukas Wunner [Wed, 4 Feb 2015 09:45:19 +0000 (10:45 +0100)]
Fix loop in Dracut shutdown script
The shell executes each command of a pipeline in a subshell,
thus $ret always had the same value after the while loop that
it had before the loop (http://mywiki.wooledge.org/BashFAQ/024),
signaling success even if some of the zpools could not be exported.
Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3083
Justin T. Gibbs [Sun, 23 Nov 2014 03:14:24 +0000 (19:14 -0800)]
Illumos 5311 - traverse_dnode may report success when it should not
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ned Bass [Wed, 4 Feb 2015 21:57:50 +0000 (13:57 -0800)]
Fix SA header size accounting
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.
Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.
Also improve the comments in sa_find_sizes().
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3070
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2553
Closes #2495
Chunwei Chen [Fri, 23 Jan 2015 08:05:04 +0000 (16:05 +0800)]
Read spl_hostid module parameter before gethostid()
If spl_hostid is set via module parameter, it's likely different from
gethostid(). Therefore, the userspace tool should read it first before
falling back to gethostid().
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3034
Tim Chase [Tue, 3 Feb 2015 05:55:20 +0000 (23:55 -0600)]
Spurious ENOMEM returns when reading dbufs kstat
Commit 7b2d78a046aa4695d434478a439a9438521d73af fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller. This caused spurious
ENOMEM errors when reading the dbufs kstat.
This commit causes the actual number of characters written to be returned.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3072
avg [Thu, 6 Nov 2014 11:08:02 +0000 (11:08 +0000)]
fix l2arc compression buffers leak
Commit log from FreeBSD:
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write. Also, we have observed that arc_hdr_destroy()
can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
flag in similar circumstances.
Previously the l2arc headers would be freed while leaking their
associated compression buffers. Now the buffers are placed on
l2arc_free_on_write list for delayed freeing. This is similar to
what was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.
In addition to fixing the discovered leaks this change also adds
some protective code to assert that a compression buffer associated
with a l2arc header is never leaked.
A new kstat l2_cdata_free_on_write is added. It keeps a count
of delayed compression buffer frees which previously would have
been leaks.
Tested by: Vitalij Satanivskij <satan@ukr.net> et al
Requested by: many
MFC after: 2 weeks
Sponsored by: HybridCluster / ClusterHQ
Brian Behlendorf [Thu, 29 Jan 2015 23:09:51 +0000 (15:09 -0800)]
Use zio buffers in zil_itx_create()
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K. This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.
In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure. This way we're always sure the
right allocation functions are used. Notice that in the current
code kmem_free() and vmem_free() were both used. This happened to
work because these wrappers map to the same internal SPL function.
This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings. There
was no deadlock here just repeated allocation failures resulting
in increased latency.
Chris Dunlap [Thu, 15 Jan 2015 00:31:21 +0000 (16:31 -0800)]
Cleanup _zed_event_add_nvpair()
When _zed_event_add_var() was updated to be the common routine
for adding zedlet environment variables, an additional snprintf()
was added to the processing of each nvpair. This commit changes
_zed_event_add_nvpair() to directly call _zed_event_add_var()
for nvpair non-array types, thereby removing a superfluous call to
snprintf(). For consistency, the helper functions for converting
nvpair array types are similarly adjusted to add variables.
The _zed_event_value_is_hex() and _zed_event_add_var() functions have
been moved up in the file since forward declarations are not used,
but no changes have been made to these functions.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3042
Chris Dunlap [Sun, 19 Oct 2014 19:05:07 +0000 (12:05 -0700)]
Protect against adding duplicate strings in ZED
The zed_strings container stores strings in an AVL, but does not
check for duplicate strings being added. Within the AVL, strings
are indexed by the string value itself. avl_add() requires the node
being added must not already exist in the tree, and will assert()
if this is not the case.
This should not cause problems in practice. ZED uses this container
in two places. In zed_conf.c, it is used to store the names of
enabled zedlets as zed scans the zedlet directory listing; duplicate
entries cannot occur here since duplicate names cannot occur within
a directory. In zed_event.c, it is used to store the environment
variables (as "NAME=VALUE" strings) that will be passed to zedlets;
duplicate strings here should never happen unless there is a bug
resulting in a duplicate nvpair or environment variable.
This commit protects against adding a duplicate to a zed_strings
container by first checking for the string being added, and removing
the previous entry should one exist. This implements a "last one
wins" policy.
This commit also changes the prototype for zed_strings_add() to allow
the string key (by which it is indexed in the AVL) to differ from
the string value. By adding zedlet environment variables using the
variable name as the key, multiple adds for the same variable name
will result in only the last value being stored.
Finally, this commit routes all additions of zedlet environment
variables through the updated _zed_event_add_var(). This ensures
all zedlet environment variable names are properly converted.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3042
Thank to commit a4430fce691d492aec382de0dfa937c05ee16500 we're
now correctly returning EROFS when opening a zvol on a read-only
pool. Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().
In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
Brian Behlendorf [Thu, 30 Oct 2014 20:17:33 +0000 (13:17 -0700)]
Make `zpool import -d|-c` behave consistently
When importing pools with zpool import -aN there is inconsistent
behavior between '-d /dev/disk/by-id' (or another path) and
'-c /etc/zfs/zpool.cache'.
The difference in behavior is caused by zpool_find_import_cached()
returning an empty nvlist_t when there are no pools to import but
zpool_find_import_impl() returns NULL for the same situation. The
behavior of zpool_find_import_cached() is arguably more correct
because it allows returning NULL to be used for an error case and
not an empty set.
This change resolves the issue by updating get_configs() such that
it returns an empty set instead of NULL when no config is found.
The updated behavior will now always return 0 for this case.
$ zpool import -aN; echo $?
no pools available to import
0
$ zpool import -aN -d /var/tmp/; echo $?
no pools available to import
0
$ zpool import -aN -c /etc/zfs/zpool.cache; echo $?
no pools available to import
0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2080
Brian Behlendorf [Wed, 28 Jan 2015 19:06:14 +0000 (11:06 -0800)]
Merge branch 'arc_summary_draft_v2'
Add a port of arc_summary.py to ZFS on Linux, arc_summary.py is a
standard tool in FreeBSD and Illumos. The version of the script used
for the port originally came from FreeNAS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2920
Kyle Blatter [Mon, 12 Jan 2015 21:31:24 +0000 (13:31 -0800)]
Replace sysctl summary with tunables summary.
The original script displayed tunable parameters using sysctl calls.
This patch modifies this by displaying tunable parameters found in
/sys/modules/zfs/parameters/. modinfo calls are used to capture
descriptions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Kyle Blatter [Wed, 5 Nov 2014 18:08:45 +0000 (10:08 -0800)]
Refactor arc_summary to simplify -p processing
The -p option is used to specify a specific page of output to be
displayed. If omitted, all output pages would be displayed.
arc_summary, as it stood, had really kludgy processing code for
processing the -p option. It relied on a try-except block which was
treated as an if statement and in normal operation would fail any time a
user didn't specify the -p option on the command line. When the
exception was thrown, the script would then display all output pages.
This happened whether the -p option was omitted or malformed. Thus, in
the principle use case, an exception would be raised in order to run the
script as desired. The same except code would be called regardless of
the exception, however, and malformed -p arguments would also cause the
script to execute.
Additionally, this required the function which handles the case where
all output pages were to be displayed, _call_all, to be potentially
called from several locations within main.
This commit refactors the option processing code to simplify it and make
it easier to catch runtime errors in the script. This is done by
specializing the try-except block to only have an exception when the -p
argument is malformed. When the -p option is correctly selected by the
user, it calls a function in the unSub array directly, which will only
display one page of output.
Finally in the context of this refactoring the page breaks have been
removed. Pages seem to have been added into the output in the FreeNAS
version of the script. This patch removes pages from the output to more
closely resemble the freebsd version of the script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
cburroughs [Tue, 11 Feb 2014 15:14:55 +0000 (10:14 -0500)]
Modified arc_summary.py to run on linux
1) Comment out stat sections whose kstats are not currently available
2) Port most of arc_summary to use spl kstats
3) Enable l2arc stats
5) Include compressed l2size
4) Minor style fixes / cleanup
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cburroughs <chris.burroughs@gmail.com> Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
cburroughs [Tue, 11 Feb 2014 14:15:50 +0000 (09:15 -0500)]
Add arc_summary.py from FreeNAS
The arc_summary script is a useful utility for administrators on
other ZFS platforms. It provides a quick and easy way to get a high
level view of the current ARC state.
Historically this was a perl script but it was rewritten in python
for FreeNAS. We've decided to adopt the python version instead of
the perl version for a few reasons.
1) ZoL has no existing perl dependencies, but it does have a python
dependency for scripts such as arcstat.py and dbufstat.py. Using
python for arc_summary.py helps us minimize dependencies.
2) Most major Linux distributions already depend heavily on python
for their core infrastructure. This means it's very likely to
be available even very early in the boot process.
Original source:
https://github.com/freenas/freenas/blob/master/gui/tools/arc_summary.py
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cburroughs <chris.burroughs@gmail.com> Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Tim Chase [Sun, 19 Oct 2014 03:50:01 +0000 (22:50 -0500)]
Fix removal of SA in sa_modify_attrs()
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028
Richard Yao [Sat, 11 Oct 2014 15:01:37 +0000 (11:01 -0400)]
Use kmem_vasprintf() in log_internal()
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.
It also brings this function almost back in sync with Illumos. This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781
Brian Behlendorf [Fri, 16 Jan 2015 22:42:46 +0000 (14:42 -0800)]
Merge branch 'kmem-rework'
The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms. This simplifies the process of porting changes to
Linux from other platforms. This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting. The following key Linux specific
changes have been reverted.
* KM_PUSHPAGE changed back to KM_SLEEP. All contexts where
it is unsafe to perform IO have been marked with PF_FSTRANS.
This context specific mechanism is now used exclusively
and the KM_PUSHPAGE mechanism has been retired.
* The KM_NODEBUG flag has been retired. Allocations larger
than 32K should use vmem_alloc()/vmem_free(). Depending
on the size of the allocation either kmalloc() or vmalloc()
will be used internally, but no warning will be printed.
* Pre-allocated vdev IO buffers and the dedicated SA spill
block cache have been retired. It is now safe and reliable
to allocate buffers of the needed size without fear of
deadlocking. This reduces our memory footprint and paves
the way for larger block sizes.
Depends on zfsonlinux/spl#414.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2918
Brian Behlendorf [Tue, 16 Dec 2014 19:44:24 +0000 (11:44 -0800)]
Revert "SA spill block cache"
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations. Instead a small dedicated
cache of preallocated SA buffers was kept.
This solution was viable while the maximum block size was limited
to 128K. But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc(). However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.
Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.
This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Sat, 13 Dec 2014 00:40:21 +0000 (16:40 -0800)]
Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers. This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos. A
deadlock in this situation is no longer possible.
However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky. Either case is acceptable
here because we can safely abort the aggregation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As part of the spl kmem/vmem refactoring the kmem_cache_* functions
were split in to their own kmem_cache.h header. This was done in
part so that kmem_* consumers would not be forced to include the
kmem_cache_* functions which mask several Linux SLAB/SLAB functions.
Because of this we now much explicitly include kmem_cache.h in the
zfs_context.h. However, consumers such as Lustre which need access
to the KM_FLAGS but not the kmem_cache_* functions can now safely
just include kmem.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Fri, 21 Nov 2014 00:09:39 +0000 (19:09 -0500)]
Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings
us back in line with upstream. In some cases this means simply
swapping the flags back. For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.
The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory. This is again the
same as upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate. The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.
A careful reader will notice that not all callers have been changed
to vmem_alloc(). Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k. This is desirable because it minimizes the need
for Linux specific code changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Yao [Mon, 3 Nov 2014 14:42:44 +0000 (09:42 -0500)]
Use is_vmalloc_addr() in vdev_disk.c
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Sun, 13 Jul 2014 18:35:19 +0000 (14:35 -0400)]
Mark IO pipeline with PF_FSTRANS
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim. This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction. For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range(). Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read(). The page must be unlocked before taking
the range lock. This patch corrects that issue.
In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
Ned Bass [Tue, 23 Dec 2014 00:54:43 +0000 (16:54 -0800)]
Document zfs_flags module parameter
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter. Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.
zfs_flags (int)
Set additional debugging flags. The following flags may be
bitwise-or'd together.
Ned Bass [Mon, 15 Dec 2014 21:53:00 +0000 (13:53 -0800)]
Don't use AC_LANG_SOURCE for conftest.h source
Using AC_LANG_SOURCE with some versions of autoconf is problematic if
the given source is to be written to a header file. Such versions assume
the contents are to be written to conftest.c and generate shell code to
that effect. The contents of the test program to detect support for
Linux tracepoints were consequently malformed (containing the source for
conftest.h) so the build system incorrectly disabled tracepoints
support. Fix this in ZFS_LINUX_TRY_COMPILE_HEADER by passing the header
source directly to ZFS_LINUX_COMPILE_IFELSE.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
Ned Bass [Sat, 13 Dec 2014 02:07:39 +0000 (18:07 -0800)]
Remove duplicate typedefs from trace.h
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.
Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.
These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
Brian Behlendorf [Fri, 19 Dec 2014 20:57:54 +0000 (12:57 -0800)]
Fix zfs_putpage() lock inversion
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock. To
prevent this we must always manipulate the writeback bit while
holding the range lock. The exact deadlock is as follows:
------ Process A ------ ------ Process B ------
zpl_writepages zpl_fallocate
write_cache_pages zpl_fallocate_common
zpl_putpage zfs_space
zfs_putpage (set bit) zfs_freesp
zfs_range_lock (wait on lock) zfs_free_range (take lock)
[has not yet initiated I/O, truncate_inode_pages_range
the bit will not be cleared] wait_on_page_writeback (wait on bit)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
Ned Bass [Wed, 17 Dec 2014 19:01:42 +0000 (11:01 -0800)]
vdev_id: use mawk-compatible regular expression
Slot mapping in vdev_id doesn't work on systems using mawk as the 'awk'
alternative. A regular expression in map_slot() contains an unquoted
empty string following the alternation (|) operator, which results in an
"missing operand" error with mawk. The solution is to rearrange the
expression so the alternation has two operands.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/pkg-zfs#136
Closes zfsonlinux/zfs#2965
Improve systemd script to not leave stale sharetab
The systemd script zfs-share.service does 'zfs share -a' to share
any required datasets. Unfortunately, /etc/dfs/sharetab is stale
from the previous boot. Delete it before we share.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2883
Brian Behlendorf [Mon, 20 Oct 2014 21:37:47 +0000 (14:37 -0700)]
Fix snapshots with dirty inodes
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode. This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812
Isaac Huang [Tue, 21 Oct 2014 18:20:10 +0000 (12:20 -0600)]
bio_alloc() with __GFP_WAIT never returns NULL
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL. However, we want to keep
the error handling in case this behavior changes in the futre.
Plus fix a small style issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2703
Ned Bass [Fri, 14 Nov 2014 18:21:53 +0000 (10:21 -0800)]
Explicitly include SPL compat headers
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages. Include a
few compatiblity headers explicitly to cope with that change. Also,
sort some linux-specific inclusions alphabetically.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2898
Ned Bass [Fri, 7 Nov 2014 02:18:32 +0000 (18:18 -0800)]
Fix improper null-byte termination handling
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.
- The snprintf() function always produces a null-byte terminated string
for non-negative return values, so it is not necessary to write out a
null-byte as a separate step.
- Also, it is unsafe to use the return value of snprintf() as an offset
for placing a null-byte, because if the output was truncated the return
value is the number of bytes that _would_ have been written had enough
space been available. Therefore the return value may index beyond the
array boundaries.
- Finally, snprintf() accounts for the null-byte when limiting its output
size, so there is no need to pass it a size parameter that is one less
than the buffer size.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2875
smh [Thu, 16 Oct 2014 02:23:27 +0000 (02:23 +0000)]
Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.
In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.
The only way to import such a pool would be to specify -o readonly=on
to the zpool import.
zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.
This issue was filed as illumos 5347 and a more comprehensive fix is
under review. Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.
Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Ahrens mahrens@delphix.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2896
Tim Chase [Tue, 11 Nov 2014 05:26:33 +0000 (23:26 -0600)]
Undirty freed spill blocks.
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written. This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.
The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:
The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
Prakash Surya [Fri, 13 Jun 2014 17:54:48 +0000 (10:54 -0700)]
Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Thu, 6 Nov 2014 21:34:17 +0000 (13:34 -0800)]
cstyle: allow right paren on its own line
Make the style checker script accept right parentheses on their own
lines. This is motivated by the Linux tracepoints macro
DECLARE_EVENT_CLASS.
The code within TP_fast_assign() (a parameter of DECLARE_EVENT_CLASS)
is normal C assignments terminated by semicolons. But the style
checker forbids us from following a semicolon with a non-blank and
from preceding a right parenthesis with white space. Therefore the
closing parenthesis must go on the next line, yet the style checker
foribs us from indenting it for readability. Relaxing the
no-non-blank-after-semicolon rule would open the door to too many bad
style practices. So instead we relax the
no-white-space-before-right-paren rule if the parenthesis is on its
own line. The relaxation is overriden with the -p option so we still
have a way to catch misuse of this style.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Thu, 23 Oct 2014 23:59:27 +0000 (16:59 -0700)]
Fix dprintf format specifiers
Fix a few dprintf format specifiers that disagreed with their argument
types. These came to light as compiler errors when converting dprintf
to use the Linux trace buffer. Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Wed, 22 Oct 2014 00:59:33 +0000 (17:59 -0700)]
Move a few internal ARC strucutres to arc_impl.h
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.
Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Prakash Surya [Mon, 6 Oct 2014 14:32:36 +0000 (16:32 +0200)]
Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO
5213 panic in metaslab_init due to space_map_open returning ENXIO
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Isaac Huang [Thu, 30 Oct 2014 19:29:58 +0000 (13:29 -0600)]
Fix inaccurate field descriptions
The field descriptions from arcstat.py -v for the demand accesses
are inaccurate. They all begin with "Demand Data" yet the fields
actually covered both demand data and demand meta-data accesses.
Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2842
Chris Wedgwood [Thu, 23 Oct 2014 23:00:41 +0000 (16:00 -0700)]
Reduce buf/dbuf mutex contention
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.
This increase in hash table size adds approximating 0.5M to
our fixed memory footprint. This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized. In the meanwhile, this small
change significantly reduces contention for certain workloads.
Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #1291
Alex Zhuravlev [Thu, 13 Nov 2014 18:09:05 +0000 (10:09 -0800)]
Export symbols for ZIL interface
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL. In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2892
Change the zvol helper program to replace any embedded spaces
in the pool or dataset names with '+' to ensure we have valid
symlinks.
The '+' character was choosen because it is not a valid character
for a dataset name but it is allowed by udev. This ensures that
all dataset names with an embedded space will be translated to
a unique /dev/zvol/ symlink.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Signed-off-by: Darik Horn <dajhorn@vanadac.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2834
Marcel Wysocki [Wed, 29 Oct 2014 11:11:43 +0000 (12:11 +0100)]
Add config/compile to config/.gitignore
This file may be added by automake and therefore should be added
to config/.gitignore. For the full list of possible auxiliary
programs see the full automake documentation.
Richard Yao [Thu, 18 Sep 2014 13:47:16 +0000 (09:47 -0400)]
Make systemd-modules-load.service file directory configurable
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. We could handle this by disabling systemd support on
prefix because systemd does not check these paths, but the Gentoo
Council decided that small files such as these should be installed.
That means disabling systemd support on prefix is not an acceptable
workaround. As a consequence, we need some way of control the directory
into which these files are installed.
Making this configurable increases our compliance with the
freedesktop.org specification, which allows these files to be installed
into /etc/modules-load.d:
Richard Yao [Fri, 29 Aug 2014 18:16:41 +0000 (14:16 -0400)]
Make directory into which mount.zfs is installed configurable
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. I could script a workaround inside the ebuild, but it
seemed to make more sense to make this more configurable.
Signed-off-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
Richard Yao [Fri, 29 Aug 2014 17:09:52 +0000 (13:09 -0400)]
Search /usr/local/src for SPL Object Directory
Since we changed the default location for the kernel headers to respect
--prefix in the SPL, we must search that location to prevent user builds
from breaking.
Signed-off-by: Richard Yao <richard.yao@clusterhq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
Tim Chase [Thu, 2 Oct 2014 12:21:08 +0000 (07:21 -0500)]
Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects. It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.
This patch adds support for the new @scan_objects return value semantics.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2837
Matthew Ahrens [Sat, 13 Sep 2014 13:40:05 +0000 (15:40 +0200)]
Illumos 5164-5165 - space map fixes
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.
The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz. The upstream commit incorrectly
used space_map_blksize.
Alex Reece [Mon, 22 Sep 2014 23:42:03 +0000 (01:42 +0200)]
Illumos 4958 zdb trips assert on pools with ashift >= 0xe
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Brian Behlendorf [Thu, 23 Oct 2014 22:26:49 +0000 (15:26 -0700)]
Fix zdb segfault
On 32-bit systems setting 'zfs_arc_max = 256M' in zdb results in the
following segmentation fault. Rather than reverting 0ec0724 which
introduced this flaw this code is only used for 64-bit builds.
Matthew Ahrens [Tue, 16 Sep 2014 20:24:48 +0000 (22:24 +0200)]
Illumos 5169-5171 - zdb fixes
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Matthew Ahrens [Wed, 17 Sep 2014 07:14:39 +0000 (09:14 +0200)]
Illumos 5178 - zdb -vvvvv on old-format pool fails in dump_deadlist()
5178 zdb -vvvvv on old-format pool fails in dump_deadlist()
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
ilovezfs [Sat, 4 Oct 2014 05:20:43 +0000 (22:20 -0700)]
Fix zpool create -t ENOENT bug.
In userland we need to switch over to the temporary name once the
pool has been created, otherwise the root dataset won't mount
and the error "cannot open 'the_real_name': dataset does not exist"
is printed.
Signed-off-by: ilovezfs <ilovezfs@icloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2760
Brian Behlendorf [Wed, 10 Sep 2014 18:59:03 +0000 (11:59 -0700)]
Handle block pointers with a corrupt logical size
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything. This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block. If a blocks checksum is valid then its contents are
trusted by the higher layers.
This system works exceptionally well as long as bad data is never
written with a valid checksum. If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.
One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.
To prevent this from happening the arc_read() function has been
updated to detect this specific case. If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2678
Ned Bass [Thu, 16 Oct 2014 20:52:56 +0000 (13:52 -0700)]
Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp(). Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose. Rather than emulating those interfaces we remove the
redundant checks.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
Matthew Ahrens [Sat, 13 Sep 2014 14:02:18 +0000 (16:02 +0200)]
Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Matthew Ahrens [Fri, 12 Sep 2014 03:45:50 +0000 (05:45 +0200)]
Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!
Steps to reproduce:
* zpool create test c1t1d0
* zfs create test/fs
* zfs snapshot test/fs@a
* zfs clone test/fs@a test/clone
* zfs destroy -d test/fs@a
* zfs clone test/fs@a test/clone2
* zfs snapshot test/clone2@a
* zfs hold hld test/clone2@a
* zfs release hld test/clone2@a
* zfs list -r -t all test
<test/clone2@a has been destroyed>
We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.
5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Matthew Ahrens [Fri, 12 Sep 2014 03:28:35 +0000 (05:28 +0200)]
Illumos 3693 - restore_object uses at least two transactions to restore an object
Restore_object should not use two transactions to restore an object:
* one transaction is used for dmu_object_claim
* another transaction is used to set compression, checksum and most
importantly bonus data
* furthermore dmu_object_reclaim internally uses multiple transactions
* dmu_free_long_range frees chunks in separate transactions
* dnode_reallocate is executed in a distinct transaction
The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups. Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.
3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon
Tim Chase [Tue, 7 Oct 2014 13:01:01 +0000 (08:01 -0500)]
Don't perform ACL-to-mode translation on empty ACL
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL. If no ACL exists, this
processing shouldn't be done. Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:
# create a file with typical mode of 664
echo test > test
chown anyuser test
ls -l test
and the mode will show up as all zeroes. Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.
Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1264