granicus.if.org Git

Update SAs when an inode is dirtied

Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140

Update 69-vdev.rules .gitignore

Commit 2957f38 renamed 60-vdev.rules to 69-vdev.rules but failed
to update the .gitignore file to reflect this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Avoid ELOOP on auto-mounted snapshots

Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816

Linux 3.7 compat, schedule_delayed_work()

Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers. While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.

I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel. But frankly
the compatibility code to use this kernel interface has always
been problematic.

Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1053

Switch KM_SLEEP to KM_PUSHPAGE

When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning. See commit
b8d06fca089fae4680c3a552fc55c512bfb02202 for additional details.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1138

Revert "Fix unlink/xattr deadlock"

This reverts commit b00131d43ca344d4b205a03ab3eb771a060e5087 which
is no longer needed due to e89260a1c8851ce05ea04b23606ba438b271d890.

This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently. This effectively prevents
the deadlock condition from ever occuring.

Therefore we can safely revert back to the upstream synchronous
cleanup code. This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #457

Preserve inode mtime/ctime in .writepage()

When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764

Fix 'zpool create' segfault due to bad syntax

Incorrect syntax should never cause a segfault.  In this case
listing multiple comma delimited options after '-o' triggered
the problem.  For example:

  zpool create -o ashift=12,listsnaps=on

This patch resolves the issue by wrapping the calls which use
hdr with a NULL test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1118

vdev_id support for device link aliases

Add a vdev_id feature to map device names based on already defined
udev device links.  To increase the odds that vdev_id will run after
the rules it depends on, increase the vdev.rules rule number from 60
to 69.  With this change, vdev_id now provides functionality analogous
to zpool_id and zpool_layout, paving the way to retire those tools.

A defined alias takes precedence over a topology-derived name, but the
two naming methods can otherwise coexist. For example, one might name
drives in a JBOD with the sas_direct topology while naming an internal
L2ARC device with an alias.

For example, the following lines in vdev_id.conf will result in the
creation of links /dev/disk/by-vdev/{d1,d2}, each pointing to the same
target as the device link specified in the third field.

  #     by-vdev
  #     name     fully qualified or base name of device link
  alias d1       /dev/disk/by-id/wwn-0x5000c5002de3b9ca
  alias d2       wwn-0x5000c5002def789e

Also perform some minor vdev_id cleanup, such as removal of the unused
-s command line option.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #981

Directory xattr znodes hold a reference on their parent

Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent.  Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it.  Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.

This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.

The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX.  Otherwise we
run the risk of deadlocking on the open create TX.

Ideally the security xattr should be fully constructed
before the new inode is unlocked.  However, doing so would
require far more extensive changes to ZFS.

This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #671

Implemented sharing datasets via SMB using libshare

Add the initial support for the 'smbshare' option using the
existing libshare infrastructure. Because this implementation
relies on usershares samba version 3.0.23 is required.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #493

Make zpool attach -o ashift=... actually work

Commit df83110856950c8e7b16a7e94cdf42b8531b9cc8 missed update to
getopt() call, while delivering all the rest. This commit adds
"o" to getopt().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #566

Add load_nvlist() error handling

Add the missing error handling to load_nvlist().  There's no good
reason this needs to be fatal.  All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1120

Allow GPT+EFI vdevs for root pools

Commit 57a4edd allows the bootfs property to be set on any pool.
However, many of the zpool commands still prevent you from using
EFI labeled devices for the root pool.  For example:

    # zpool attach rpool /dev/sda /dev/sdb
    cannot label 'sdb': EFI labeled devices are not supported on
    root pools. on root devices.

For non-Solaris builds such as Linux disable this error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1077

Disable page allocation warnings for super block

Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings. The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101

Verify --with-linux source directory exists

Previously this check was only performed when ./configure was
attempting to autodetect your kernel source directory. But we
should also handle the case where --with-linux was provided
and is obviously wrong. This way we catch the error before
invoking make and compiling the source with an incorrect
autoconf results.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#162

vdev_id fails to handle complex device topologies

While expanding positional parameters shell requires non-single
digits to be enclosed in braces. When the SAS topology is
non-trivial the number of positional parameters generated internally
by vdev_id script (using set -- ...) easily crosses single digit limit
and vdev_id fails to generate links.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1119

Make vdev_id POSIX sh compatible

Full bash may not be available in all environments where udev helpers
run, such as in an initial ramdisk. To avoid breakage in this case,
remove use of bash-specific features such as variable arrays and the
`declare' keyword from the vdev_id script.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #870

Fix NULL deref when zvol_alloc() fails

If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown. To avoid this entirely the zv->objset
line is moved up in to the success block.

Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1109

Increase ZFS_OBJ_MTX_SZ to 256

Increasing this limit costs us 6144 bytes of memory per mounted
filesystem, but this is small price to pay for accomplishing
the following:

* Allows for up to 256-way concurreny when performing lookups
  which helps performance when there are a large number of
  processes.

* Minimizes the likelyhood of encountering the deadlock
  described in issue #1101.  Because vmalloc() won't strictly
  honor __GFP_FS there is still a very remote chance of a
  deadlock.  See the zfsonlinux/spl@043f9b57 commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1101

Recreate minors when renaming zvols

When a zvol with snapshots is renamed the device files under
/dev/zvol/ are not renamed. This patch resolves the problem
by destroying and recreating the minors with the new name so
the links can be recreated bu udev.

Original-patch-by: Suman Chakravartula <schakrava@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #408

mount.zfs: canonicalize mount point for mtab

Canonicalize the mount point passed to the mount.zfs helper.
This way a clean path is always added to mtab which ensures
the umount can properly locate and remove the entry.

Test case:
$ mkdir /mnt/foo
$ mount -t zfs zpool/foo /mnt/../mnt/foo////
$ umount /mnt/foo
$ cat /etc/mtab | grep zpool/foo
zpool/foo /mnt/../mnt/foo//// zfs rw 0 0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #573

Merge branch 'ashift'

This branch adds some overdue ashift improvements.

  * Add '-o ashift' to 'zpool add' and 'zpool attach'
  * Improve AF hard disk detection
  * Allow 'zpool import' to handle increases in ashift

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add "-o ashift" to zpool add and zpool attach

When adding devices to an existing pool "ashift" property is
auto-detected.  However, if this property was overridden at
the pool creation time (i.e. zpool create -o ashift=12 tank ...)
this may not be what the user wants.  This commit lets the user
specify the value of "ashift" property to be used with newly
added drives. For example,

    zpool add -o ashift=12 tank disk1
    zpool attach -o ashift=12 tank disk1 disk2

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #566

Improve AF hard disk detection

Use the bdev_physical_block_size() interface to determine the
minimize write size which can be issued without incurring a
read-modify-write operation.  This is used to set the ashift
correctly to prevent a performance penalty when using AF hard
disks.

Unfortunately, this interface isn't entirely reliable because
it's not uncommon for disks to misreport this value.  For this
reason you may still need to manually set your ashift with:

  zpool create -o ashift=12 ...

The solution to this in the upstream Illumos source was to add
a white list of known offending drives.  Maintaining such a list
will be a burden, but it still may be worth doing if we can
detect a large number of these drives.  This should be considered
as future work.

Reported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #916

Illumos #2671: zpool import should not fail if vdev ashift has increased

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Refererces to Illumos issue:
      https://www.illumos.org/issues/2671

This patch has been slightly modified from the upstream Illumos
version.  In the upstream implementation a warning message is
logged to the console.  To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.

The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool.  Depending on your workload
this may impact pool performance.  Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.

NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #955

zfs-0.6.0-rc12

Fix hard coded path in 60-vdev.rules.in

The udev data directory was hard coded in 60-vdev.rules.in. That causes
a problem when a distribution changes the location of the directory.
This was not an issue in the past because virtually all distributions
used the same path, but that is beginning to change following a decision
by the systemd developers to change the directory location to reflect
their take-over of udev maintainership. The testing branch of Gentoo
Linux adopted this change, which enabled the hardcoded directory
location to trigger a regression.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1085

Fix "allocating allocated segment" panic

Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record. The mmap
case which relies on zfs_putpage() was not taking the range
lock. This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #541

Fix zvol+btrfs hang

When using a zvol to back a btrfs filesystem the btrfs mount
would hang.  This was due to the bio completion callback used
in btrfs assuming that lower level drivers would never modify
the bio->bi_io_vecs after they were submitted via bio_submit().
If they are modified btrfs will miscalculate which pages need
to be unlocked resulting in a hang.

It's worth mentioning that other file systems such as ext[234]
and xfs work fine because they do not make the same assumption
in the bio completion callback.

The most straight forward way to fix the issue is to present
the semantics expected by btrfs.  This is done by cloning the
bios attached to each request and then using the clones bvecs
to perform the required accounting.  The clones are freed after
each read/write and the original unmodified bios are linked back
in to the request.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469

Merge remote branch 'eris/stats'

Bring in support for the new KSTAT_TYPE_TXG type. This allows for
additional visibility in to the txg handling.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Log I/Os longer than zio_delay_max (30s default)

There have been reports of ZFS deadlocking due to what appears to
be a lost IO.  This patch addes some debugging to determine the
exact state of the IO which neither 1) completed, 2) failed, or
3) timed out after zio_delay_max (30) seconds.

This information will be logged using the ZFS FMA infrastructure
as a 'delay' event and posted to the internal zevent log.  By
default the last 64 events will be kept in the log but the limit
is configurable via the zfs_zevent_len_max module option.

To dump the contents of the log use the 'zpool events -v' command
and look for the resource.fs.zfs.delay event.  It will include
various information about the pool, vdev, and zio which may shed
some light on the issue.

In the context of this change the 120 second kernel blocked thread
watchdog has been disabled for synchronous IOs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #930

Add txgs-<pool> kstat file

Create a kstat file which contains useful statistics about the
last N txgs processed.  This can be helpful when analyzing pool
performance.  The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ddt_object_count() error handling

The interface for the ddt_zap_count() function assumes it can
never fail.  However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail.  Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.

Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails.  The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.

This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller.  This allows a pool which has
be damaged in this way to be safely rewound for import.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #910

Revert "Don't ashift-align vdev read requests."

This reverts commit a5c20e2a0a9046c06d86615fbf51dc04f12bba14 which
accidentally introduced a regression for real 4k sector devices.
See issue #1065 for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1065

Remove 'Resized bio's/dio' warning

The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless. To prevent needless confusion and worry the message
has been removed.

kernel: WARNING: Resized bio's/dio to 32

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update spare and cache device names on import

During 'zpool import' all ZPOOL_CONFIG_PATH names are supposed
to be updated by fix_paths().  This was not happening for spare
and cache devices because the proper names were getting filtered
out of the pool_list_t->names.  Interestingly, the names were
being filtered because the spare and cache devices do not
contain the pool name in their vdev label.

The fix is to exclude the device path from the list only if:

  1) has a valid ZPOOL_CONFIG_POOL_NAME key in the label, and
  2) that pool name does not match the specified pool name.

Since the label is valid and because it does properly store the
vdev guid it will be correctly assembled without the pool name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #725

Allow 'zpool replace' to use short device names

The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path.  Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.

To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined.  All posible expansions are then compared against
the comparison path.  Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.

In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.

The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev.  However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.

These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #544
Closes #976

Quote snapshot and mountpoint for .zfs automount

When automounting a snapshot in the .zfs/snapshot directory
make sure to quote both the dataset name and the mount point.
This ensures that if either component contains spaces, which
are allowed, they get handled correctly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1027

Merge branch 'zil-performance'

This brnach brings some ZIL performance optimizations, with
significant increases in synchronous write performance for
some workloads and pool configurations.

See the individual commit messages for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1013

Use the slog even with logbias=throughput.

In the current code, logbias=throughput implies the following:
1) All synchronous writes are logged in indirect mode.
2) The slog is not used.

(1) makes sense because it avoids writing the data twice, which is
obviously a good thing when the user wants maximum pool throughput.

(2), however, is a surprising decision. Considering all writes are
indirect, the log record doesn't contain the actual data, only pointers
to DMU blocks. As a result, log records written in logbias=throughput
mode are quite small, and as such, it doesn't make any sense to write
them to the main pool since slogs are usually optimized for small
synchronous writes.

In fact, the current behavior is actually harmful for performance,
because log blocks and data blocks from dmu_sync() seldom have the same
allocation size and as a result are usually allocated from different
metaslabs. This means that if a spindle has to write both log blocks and
DMU blocks (which is likely to happen under heavy load), it will have to
seek between the two. Allocating the log blocks from the slog pool
instead of the main pool avoids these unnecessary seeks.

This commit makes ZFS use the slog on datasets with logbias=throughput.
Real-life performance testing shows a 50% synchronous write performance
increase with some large commit sizes, and no negative effect in other
cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013

Add FASTWRITE algorithm for synchronous writes.

Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:

1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;

2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;

3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.

The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.

This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.

The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.

metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().

ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.

A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.

The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013

Add atomic_sub_* functions to libspl.

Both the SPL and the ZFS libspl export most of the atomic_* functions,
except atomic_sub_* functions which are only exported by the SPL, not by
libspl. This patch remedies that by implementing atomic_sub_* functions
in libspl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013

Merge branch 'condvar'

Auditing the code to verify that all instances of cv_signal() and
cv_broadcast() are called under the proper associated mutex turned
up several races. None of these have been conclusively seen in the
wild but the following patch set resolves them.

For reference, from the cv_signal(9F) man page:

  cv_signal() signals the condition and wakes one blocked thread.
  All blocked threads can be unblocked by calling cv_broadcast().
  You must acquire the mutex passed into cv_wait() before calling
  cv_signal() or cv_broadcast()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1048

Condition variable usage, zp->r_{rd,wr}_cv

The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Condition variable usage, zilog->zl_cv_batch

The following incorrect usage of cv_signal and cv_broadcast()
was caught by code inspection. The cv_signal and cv_broadcast()
functions must be called under the associated mutex to preventing
racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Condition variable usage, zevent_cv

The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Do not return /dev/loop-control in unused_loop_device

The function unused_loop_device in /usr/libexec/zfs/common.sh
returns /dev/loop-control on the first call. This device is NOT
a loop device (https://github.com/torvalds/linux/commit/770fe30)
it is a control device. This in turn causes the script zconfig.sh
to fail with:

zpool-create.sh: Error 1 creating /tmp/zpool-vdev0 ->
/dev/loop-control loopback

The patch makes the function return /dev/loop[0-9]* which are
loop devices.

Signed-off-by: Andrew Reid <ColdCanuck@nailedtotheperch.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #797

Switch KM_SLEEP to KM_PUSHPAGE

In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1038

Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE

Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520

Return positive error number in zfsctl_shares_lookup.

Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes #626 #885 #947 #977

Disable ztest deadman timer

The ztest deadman timer has been causing false positives in the
testing VMs. To make it easier to spot possible regressions
I'm disabling this timer. The buildbot test infrastructure
will still mark ztest instances which take to long to complete
as failures.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1018

Merge branch 'linux-3.6'

This branch adds the required compatibility code to support the
Linux 3.6 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #873

Linux 3.6 compat, iops->mkdir()

Use .mkdir instead of .create in 3.3 compatibility check. Linux 3.6
modifies inode_operations->create's function prototype. This causes
an autotools Linux 3.3. compatibility check for a function prototype
change in create, mkdir and mknode to fail. Since mkdir and mknode
are unchanged, we modify the check to examine it instead.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873

Linux 3.6 compat, iops->create()

As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create. Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873

Linux 3.6 compat, iops->lookup()

As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup. Instead
only the inamedata->flags are passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873

Linux 3.6 compat, sget()

As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.

ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873

Linux 3.6 compat, sops->write_super() removed

The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.

ZFS never made use of this callback so it can simply be removed
from the super_operations structure.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873

Don't ashift-align vdev read requests.

Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.

This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.

For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.

We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.

As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
  the path that the additional, useless padding data has to travel.
  For example, when using 4k-sector drives that lie about their sector
  size, using 512b read requests instead of 4k means that there will
  be less data traveling down the ATA/SCSI interface, even though the
  drive actually reads 4k from the platter.

The only exception is raidz, because raidz needs to read the whole
allocated block for parity.

This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1022

Remove vmem_size() consumers

There are currently three vmem_size() consumers all of which are
part of the ARC implemention.  However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.

* arc_evict_needed() - This is actually dead code.  Arena support
was never added to the SPL and zio_arena is always NULL.  This
support isn't needed so we simply remove this dead code.

* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free.  However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC.  So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.

* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #831

Fix zfs_txg_timeout module parameter

Allow the zfs_txg_timeout variable to be dynamically tuned at run
time.  By pulling it down out of the variable declaration it will
be evaluted each time through the loop.

The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c.  This prevents
potential type mismatches if the global variable needs to be used
elsewhere.

Move the module_param() code in to the same source file where
zfs_txg_timeout is declared.  This is the most logical location.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix zfs_write_limit_max integer size mismatch on 32-bit systems

Commit c409e4647f221ab724a0bd10c480ac95447203c3 introduced a
number of module parameters. This required several types to be
changed to accomidate the required module parameters Linux macros.

Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart. If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.

The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.

We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #749

Make zfs_immediate_write_sz a module paramater

zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1032

txg is spelled as tgx in places

Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1030

zfs.8: add missing info about dedup, mlslabel

These sections were missing from the `zfs.8` man page. Add them.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1026

Switch KM_SLEEP to KM_PUSHPAGE

Prevent snapshot_check to initiate I/O during memory allocation.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1023

Set default zvol elevator to noop

It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device.  Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available.  This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.

We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes.  It's
better to simply manually tune the scheduler for these kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1017

Align DISCARD requests on zvols.

Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.

Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.

This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1010

Merge branch 'illumos-ztest'

This branch is a port of the ztest backwards compatibility
testing option. It includes the original upstream Illumos
patch plus several followup patches to address concerns in
the original change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Realpath arg 2 must be a minimum of PATH_MAX

The realpath(3) function expects that when a buffer is passed
for the 'resolved_path' that it be at least PATH_MAX in length.
If it's not a buffer overflow may occur.

Therefore the passed buffer size is changed from MAXNAMELEN to
MAXPATHLEN.  We also take this opertunity to dynamically allocate
the buffer to keep it off the stack.

  warning: call to '__realpath_chk_warn' declared with attribute
  warning: second argument of realpath must be either NULL or at
  least PATH_MAX bytes long buffer [enabled by default]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Verify the return value for warn_unused_result functions

Under Linux the following functions are flagged with the
attribute warn_unused_result, this triggers a warning when
ever they are used without checking the return value.

To handle this case we check the result VERIFY().  It's
better to detect this immediately on failure rather than
segfault farther down in the function.

  ../../cmd/ztest/ztest.c:6033:2: warning:
  ignoring return value of 'asprintf', declared with
  attribute warn_unused_result [-Wunused-result]
  ../../cmd/ztest/ztest.c:739:3: warning:
  ignoring return value of 'realpath', declared with
  attribute warn_unused_result [-Wunused-result]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Replace tempnam() with mkstemp()

The use of tempnam() is racy and it should be avoided in favor of
mkstemp().  According to the Linux tempnam(3) man page.

  "Although tempnam() generates names that are difficult to guess,
  it is nevertheless possible that between the time that tempnam()
  returns a pathname, and the time that the program opens it, another
  program might create that pathname using open(2), or create it as
  a symbolic link.  This can lead to security holes.  To avoid such
  possibilities, use the open(2) O_EXCL flag to open the  pathname.
  Or better yet, use mkstemp(3) or tmpfile(3)."

This issue was flagged by gcc.

  ztest.o: In function `setup_data_fd': cmd/ztest/ztest.c:5822:
  warning: the use of `tempnam' is dangerous, better use `mkstemp'

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Minimize ztest stack frame size

To ensure ztest behaves as similarly as possible to the kernel
implementation of ZFS we attempt to honor the kernel stack limits.
This includes keeping the individual stack frame sizes under 1K
in size. We currently use gcc to detect and enforce this limit.

Therefore to get this building cleanly with full debugging enabled
the stack usage in the following functions has been reduced by
moving the buffer to the heap.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use dynamic file descriptor numbers in ztest.

Currently, ztest expects to get 3 and 4 as the file descriptors for
data and random files, respectively. This is quite fragile and breaks
easily if ztest is run with these file descriptors already opened
(e.g. in a complex shell script).

This patch fixes the issue by removing the assumptions on the file
descriptor numbers that open() returns.

For the random file (/dev/urandom), the new code doesn't rely on a
shared file descriptor; instead, it reopens the file in the child.

For the data file, the new code writes the file descriptor number into
a "ZTEST_FD_DATA" environment variable so that it can be recovered
after the execv() call.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix mmap() usage in ztest.

illumos/illumos-gate@ad135b5d644628e791c3188a6ecbd9c257961ef8
Illumos changeset: 13700:2889e2596bd6

Note that this is only a partial port of the aforementioned Illumos
changeset.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

Ported to zfsonlinux by: Etienne Dechamps <etienne.dechamps@ovh.net>

Illumos #1950: ztest backwards compatibility testing option.

illumos/illumos-gate@420dfc9585ff67e83ee7800a7ad2ebe1a9145983
Illumos changeset: 13571:a5771a96228c

1950 ztest backwards compatibility testing option

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Switch KM_SLEEP to KM_PUSHPAGE

This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fc for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1002

Illumos #3100: zvol rename fails with EBUSY when dirty.

illumos/illumos-gate@2e2c135528b3edfe9aaf67d1f004dc0202fa1a54
Illumos changeset: 13780:6da32a929222

3100 zvol rename fails with EBUSY when dirty

Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #995

Illumos #2399: zfs manual page does not document use of "zfs diff".

illumos/illumos-gate@3b8be6bf4fd2c744dfb8b5ce2a6c85ad0a2c8f75
Illumos changeset: 13773:00c2a08cf1bb

2399 zfs manual page does not document use of "zfs diff"

Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #940

Illumos #3129, #3130

illumos/illumos-gate@d6afdce20f8481c95471dd821bc8ec0dbde66213
Illumos changeset: 13794:7c5e0e746b2c

3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
     0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Ported by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #994

Temporarily disable the reguid test.

Currently, ztest fails with the following error:

error: Pool 'ztest' has encountered an uncorrectable I/O failure
and the failure mode property for this pool is set to panic.

We know how to fix it (see issue #939), but it may take some time
before we get around to merging the fix, which has some heavy
dependencies.

In the mean time, it is not ideal to be unable to use ztest just
because of a small isolated issue, so this patch works around the
problem by disabling the reguid test. This is just a temporary hack to
keep ztest usable.

The reguid test will be enabled again when the proper fix is merged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #997

Fix ztest vdev file paths.

Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:

snprintf(path, sizeof (path), ztest_dev_template, ...);

This worked fine until 40b84e7aec6392187722e61e5a4a853b530bf60f, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.

As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".

This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989

Fix VOP_CLOSE() in userspace.

Currently, for unknown reasons, VOP_CLOSE() is a no-op in userspace.
This causes file descriptor leaks. This is especially problematic with
long ztest runs, since zpool.cache is opened repeatedly and never
closed, resulting in resource exhaustion (EMFILE errors).

This patch fixes the issue by making VOP_CLOSE() do what it is supposed
to do.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989

Create threads in detached state in userspace.

Currently, thread_create(), when called in userspace, creates a
joinable (i.e. not detached thread). This is the pthread default.

Unfortunately, this does not reproduce kthreads behavior (kthreads
are always detached). In addition, this contradicts the original
Solaris code which creates userspace threads in detached mode.

These joinable threads are never joined, which leads to a leakage of
pthread thread objects ("zombie threads"). This in turn results in
excessive ressource consumption, and possible ressource exhaustion in
extreme cases (e.g. long ztest runs).

This patch fixes the issue by creating userspace threads in detached
mode. The only exception is ztest worker threads which are meant to be
joinable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989

Modify vdev_elevator_switch() to use elevator_change()

As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.

Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #906

Force 4K blocksize when testing ext2 on zvol.

Currently, mkfs.ext2 on zconfig.sh zvols tries to use a 8K blocksize,
probably because by default zvol exposes an optimal I/O size of 8K.

Unfortunately, a ext2 blocksize of 8K is not supported by the kernel,
so the resulting filesystem is unmountable.

This patch fixes the issue by making sure the blocksize is 4K. We have
to use -F to force it else mkfs.ext2 won't allow us to use a blocksize
smaller than the optimal I/O size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #979

Implement .commit_metadata hook for NFS export

In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook. All it takes there
is to make sure changes are committed to ZIL. Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #969

zvol_probe should return NULL when the device isn't found.

Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #949
Closes #931
Closes #789
Closes #743
Closes #730

Illumos #2703: add mechanism to report ZFS send progress

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
https://www.illumos.org/issues/2703

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #1948: zpool list should show more detailed pool info

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
https://www.illumos.org/issues/1948

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #685

Switch KM_SLEEP to KM_PUSHPAGE

This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca089fae4680c3a552fc55c512bfb02202
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973

Seg fault 'zpool import -d /dev/disk/by-id -a'

Introduced by commit 44867b6d6effc1628dd00c36821ab044f89fb988.
We should of course check to ensure best isn't NULL before
attempting to dereference it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #974

zfs-0.6.0-rc11

Illumos #2088 zdb could use a reasonable manual page

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
https://www.illumos.org/issues/2088

Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #682

Improve `zpool import` search behavior

The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths.  These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.

This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):

  /dev/disk/by-vdev   - Custom rules, use first if they exist
  /dev/disk/zpool     - Custom rules, use first if they exist
  /dev/mapper         - Use multipath devices before components
  /dev/disk/by-uuid   - Single unique entry and persistent
  /dev/disk/by-id     - May be multiple entries and persistent
  /dev/disk/by-path   - Encodes physical location and persistent
  /dev/disk/by-label  - Custom persistent labels
  /dev                - UNSAFE device names will change

The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable.  This must be a colon
delimited list of paths which are searched for vdevs.  If the
'zpool import -d' option is specified only those listed paths
will be searched.

Finally, when multiple paths to the same device are found.  If one
of the paths is an exact match for the path used last time to import
the pool it will be used.  When there are no exact matches the
prefered path will be determined by the provided search order.

This means you can still import a pool and force specific names by
providing the -d <path> option.  And the prefered names will persist
as long as those paths exist on your system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #965

Switch KM_SLEEP to KM_PUSHPAGE

This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca089fae4680c3a552fc55c512bfb02202
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917

ZFS replay transaction error 5

When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.

The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #933

Clear PG_writeback for sync I/O error case

Commit 2b2861362f7dd09cc3167df8fddb6e2cb475018a accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit. Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961

Fix zdb printf format string for ZIL data blocks

Without this fix the zdb printouts of ZIL data blocks look full of FF
due to printf() handling its arguments as int by default.

Here is the output before the fix

                TX_WRITE            len   4136, txg 1093817, seq 149231
                        foid 4242, offset 0, length f68
                        G FFFFFF8EFFFFFF87FFFFFF91FFFFFFCC 1c
FFFFFFAFFFFFFFC9FFFFFFBAZ FFFFFFC3

And the same after the fix

                TX_WRITE            len   4136, txg 1093817, seq 149231
                        foid 4242, offset 0, length f68
                        G 8E8791CC 1cAFC9BAZ C3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #962

Move iput() after zfs_inode_update()

When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed. This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782