granicus.if.org Git

Support building a zfs-modules-dkms sub package

This commit adds support for building a zfs-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.

By default zfs-modules-dkms-* sub package will be built as part of
the 'make rpm' target.  Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.

Examples:

    # To build packaged binaries as well as a dkms packages
    $ ./configure && make rpm

    # To build only the packaged binary utilities and dkms packages
    $ ./configure && make rpm-utils rpm-dkms

Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
      supported for building the dkms sub package.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #535

Add '--with-spl-timeout' option

When checking for the SPL Module.symvers file, a timeout can now be
passed in which will pause the configure step while it waits for this
file to be generated. By default, the configure behavior is unchanged as
a timeout of 0 is used. If a positive number of seconds is passed,
configure will wait that number of seconds for the Module.symvers file
before moving on.

The main motivation for this change was to support parallel execution of
'./configure && make' for the SPL and ZFS packages in preparation of
supporting DKMS based packages.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #1693: persistent 'comment' field for a zpool

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/1693

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #678

Set zvol discard_granularity to the volblocksize.

Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862

Add missing dependencies to ./copy-builtin

ZFS depends on EFI_PARTITION, ZLIB_DEFLATE and ZLIB_INFLATE, but when
ZFS is integrated with the kernel source tree, menuconfig does not
enforce these dependencies. This can cause build failures in the case of
ZLIB_DEFLATE and ZLIB_INFLATE where symbols are not found. This can also
cause runtime failures in the case of EFI_PARTITION, where the kernel
will not understand GPT partitions when creating pools from raw disks.
We solve this by making menuconfig aware of these dependencies.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #854

Limit the number of blocks to discard at once.

The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.

Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.

This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #858

Illumos #1644, #1645, #1646, #1647, #1708

1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
     destroying multiple snapshots
1708 adjust size of zpool history data

References:
  https://www.illumos.org/issues/1644
  https://www.illumos.org/issues/1645
  https://www.illumos.org/issues/1646
  https://www.illumos.org/issues/1647
  https://www.illumos.org/issues/1708

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  This change has reordered
all of the new ioctl()s to the end of the list.  This should
help minimize this issue in the future.

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>

Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #826
Closes #664

Adding grub2 mkconfig support patch

Added simply for convenience until this, or an equivilant, change
is merged in the upstream grub2 source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #847

Allow '-o remount' for non-legacy datasets

This is done for compatibility with existing Linux infrastructure.

In particular, when using zfs as a root filesystem there are init
scripts which as part of shutdown remount root read-only. Also,
the new systemd infrastructure being used by Fedora expects to be
able to remount a file system read-write.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #847

Merge branch 'builtin-clean'

Support in-tree builtin module building.

These commits add support for compiling the ZFS module as a built-in
kernel module by copying the module code into the kernel source tree.
Here's the procedure:

  - Create your kernel configuration (`.config` file) as usual. This
    has to be done first so that ZFS's configure script is able to
    detect kernel features correctly.
  - Run `make prepare scripts` inside the kernel source tree.
  - Run `./configure --enable-linux-builtin --with-linux=/usr/src/linux-...`
    inside the ZFS directory.
  - Run `./copy-builtin /usr/src/linux-...` inside the ZFS directory.
  - In the kernel source tree, enable the `CONFIG_ZFS` option (e.g. using
    `make menuconfig`). Note that this option depends on `CONFIG_SPL`
    (see zfsonlinux/spl@744038069d3dc65e721b5b8cc5c37d8c7fcbd8c0).
  - Build the kernel as usual.

ZFS module parameters can be set at boot time using the following syntax
on the kernel command line: `zfs.parameter_name=parameter_value`.

Note that you also need to rebuild the userspace tools (see
zfsonlinux/zfs@f09398cec665259a4c2f96726680fbd3b0a3bac3).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Use /sys/module instead of /proc/modules.

When libzfs checks if the module is loaded or not, it currently reads
/proc/modules and searches for a line matching the module name.

Unfortunately, if the module is included in the kernel itself (built-in
module), then /proc/modules won't list it, so libzfs will wrongly conclude
that the module is not loaded, thus making all ZFS userspace tools unusable.

Fortunately, all loaded modules appear as directories in /sys/module, even
built-in ones. Thus we can use /sys/module in lieu of /proc/modules to fix
the issue.

As a bonus, the code for checking becomes much simpler.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Add script for builtin module building.

This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building ZFS as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the zfs source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the zfs
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

When checking for symbol exports, try compiling.

This patch adds a new autoconf function: ZFS_LINUX_TRY_COMPILE_SYMBOL.
This new function does the following:

- Call LINUX_TRY_COMPILE with the specified parameters.
- If unsuccessful, return false.
- If successful and we're configuring with --enable-linux-builtin,
return true.
- Else, call CHECK_SYMBOL_EXPORT with the specified parameters and
return the result.

All calls to CHECK_SYMBOL_EXPORT are converted to
LINUX_TRY_COMPILE_SYMBOL so that the tests work even when configuring
for builtin on a kernel which doesn't have loadable module support, or
hasn't been built yet.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Fake modpost stage for LINUX_COMPILE.

Currently, when building a test case, we're compiling an entire Linux
module from beginning to end. This includes the MODPOST stage, which
generates a "conftest.mod.c" file with some boilerplate module
declaration code.

This poses a problem when configuring for built-in on kernels which have
loadable module support disabled. In this case conftest.mod.c is
referencing disabled code, resulting in a compilation failure, thus
breaking the tests.

This patch fixes the issue by faking the modpost stage when the
--enable-linux-builtin option is provided. It does so by forcing the
modpost command to be /bin/true, and using an empty conftest.mod.c file.
The test module still compiles fine, although the result isn't loadable,
but we don't really care at this point.

Note it is important to preserve the modpost stage when building out of
tree. The ZFS_AC_KERNEL_BLK_END_REQUEST, ZFS_AC_KERNEL_BLK_QUEUE_FLUSH,
and ZFS_AC_KERNEL_BLK_RQ_BYTES configure checks all depend on it to
identify GPL-only symbols.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Make configure builtin-aware.

This patch adds a new option to configure: --enable-linux-builtin. When
this option is used, the following happens:

- Compilation of kernel modules is disabled.

- A failure to find UTS_RELEASE is followed by a suggestion to run
"make prepare" on the kernel source tree.

This patch also adds a new test which tries to compile an empty module
as a basic toolchain sanity test. If it fails and the option was
specified, the error is followed by a suggestion to run "make scripts"
on the kernel source tree.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Don't build packages that haven't been selected.

Currently, when configure --with-config is used, selective compilation
is only effective for the simple "make" case. Package builders (e.g.
make rpm) still build everything (utils and modules). This patch fixes
that.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851

Linux 3.5 compat, end_writeback() changed to clear_inode()

The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784

Linux 3.5 compat, iops->truncate_range() removed

The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784

Linux 3.5 compat, eops->encode_fh() takes inodes

The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784

Fix NULL pointer dereference on PaX/GRSecurity patched Linux 3.3 and later kernels

Support for PaX/GRSecurity patched kernels was developed against Linux
3.2. Unfortunately, an autotools check introduced for a Linux 3.3 API
fails on PaX/GRSecurity patched kernels. This causes the module to be
built against the Linux 3.2 ABI, which results in a NULL pointer
dereference at runtime.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes #794
Closes #809

Disable .zfs directory on 32-bit systems

The .zfs control directory implementation currently relies on
the fact that there is a direct 1:1 mapping from an object id
to its inode number.  This works well as long as the system
uses a 64-bit value to store the inode number.

Unfortunately, the Linux kernel defines the inode number as
an 'unsigned long' type.  This means that for 32-bit systems
will only have 32-bit inode numbers but we still have 64-bit
object ids.

This problem is particularly acute for the .zfs directories
which leverage those upper 32-bits.  This is done to avoid
conflicting with object ids which are allocated monotonically
starting from 0.  This is likely to also be a problem for
datasets on 32-bit systems with more than ~2 billion files.

The right long term fix must remove the simple 1:1 mapping.
Until that's done the only safe thing to do is to disable the
.zfs directory on 32-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ddt_object_load() error handling

Add the missing error handling to ddt_object_load().  There's no
good reason this needs to be fatal.  It is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add 'inline' keyword

The '__attribute__((always_inline))' does not strictly imply
'inline'.  Newer versions of gcc detect this misuse and issue
the following warning.  Including the missing 'inline' resolves
the build warning.

    ./module/zfs/dsl_scan.c:758:1:error: always_inline function
    might not be inlinable [-Werror=attributes]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix build failures on PaX/GRSecurity patched kernels

Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.

To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link(). This avoids compiler errors
in the PaX/GRSecurity dialect.

Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459c59bc74b7a7cd3af7311da2b1cd2c3.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #484

Move partition scanning from userspace to module.

Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808

Move zfs.release generation to configure step

Previously, the zfs.release file was created at 'make install' time.
This is slightly problematic when the file is needed without running
'make install'. Because of this, the step creating the file was removed
from 'make install' and replaced with a more appropriate zfs.release.in
file.

As a result, the zfs.release file will now be created earlier as part
of the 'configure' step as opposed to the 'make install' step.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add PowerPC to supported VTOCs

This code was was inherited from Solaris which was careful to define
the expected VTOC for various supported architectures. While this
check may have made sense there it's something we should be able to
safely drop under Linux.

However, I'm not quite ready to do that yet. So for the moment I'm
just doing the very safe thing of adding PowerPC as a supported type.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix efi_use_whole_disk() when efi_nparts == 128.

Commit e5dc681a changed EFI_NUMPAR from 9 to 128. This means that the
on-disk EFI label has efi_nparts = 128 instead of 9. The index of the
reserved partition, however, is still 8. This breaks
efi_use_whole_disk(), which uses efi_nparts-1 as the index of the
reserved partition.

This commit fixes efi_use_whole_disk() when the index of the reserved
partition is not efi_nparts-1. It rewrites the algorithm and makes it
more robust by using the order of the partitions instead of their
numbering. It assumes that the last non-empty partition is the reserved
partition, and that the non-empty partition before that is the data
partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Use the right device path when relabeling.

Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short
partition device name, which is obviously wrong because (1)
zpool_relabel_disk() expects a full, absolute path to use with open()
and (2) efi_write() must be called on an opened disk device, not a
partition device.

With this patch, zpool_relabel_disk() gets called with a full disk
device path. The path is determined using the same algorithm as
zpool_find_vdev().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Fix error handling for "zpool online -e".

The error handling code around zpool_relabel_disk() is either inexistent
or wrong. The function call itself is not checked, and
zpool_relabel_disk() is generating error messages from an unitialized
buffer.

Before:

    # zpool online -e homez sdb; echo $?
    `: cannot relabel 'sdb1': unable to open device: 2
    0

After:

    # zpool online -e homez sdb; echo $?
    cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2
    1

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Illumos #1949, #1953

1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
https://www.illumos.org/issues/1949
https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665

Illumos #1748: desire support for reguid in zfs

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665

Relicense zfs.gentoo.in from GPLv2 to 2-clause BSD

As the Gentoo sys-fs/zfs maintainer, I receive license compatibility
questions and at times, those questions can be harassing. I feel that
the presence of the GPL in Gentoo's package metadata promotes such
questions. zfs.gentoo.in is the only GPLv2 licensed file in ZFS, so I
have taken the liberty of contacting all contributors to this file to
request permission to relicense it.

All of the contributors to this file have agreed to relicense it under
the 2-clause BSD license. I have added their Signed-offs to this commit,
in order of first contribution. Thank you everyone for being so
understanding.

Signed-off-by: devsk <devsku@gmail.com>
Signed-off-by: Alexey Shvetsov <alexxy@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Tselischev <andrewtselischev@gmail.com>
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Kyle Fuller <inbox@kylefuller.co.uk>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes #819

Use ULL suffix in constants

The lack of the ULL suffix causes warnings such as the following on
32-bit systems:

  In function 'zfsctl_is_snapdir':
  zfs-0.6.0//module/zfs/zfs_ctldir.c:151: warning: integer constant
  is too large for 'long' type

We add the ULL suffix to fix that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #813

Update incorrect ddt_zap_lookup() assertion

When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated. The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>

Remove Chaos 4.x RPM support

The Chaos 4.x distribution is based on RHEL 5.x which is no longer
supported by ZoL since it uses a 2.6.18 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support debug and debug-devel sub packages

This commit adds support for building debug and debug-devel sub packages
of the zfs-modules main package. This is to allow building packages
which are built against a debug kernel. By default, only packages are
built against a regular non-debug kernel. This can be toggled by passing
the '--with kernel-debug' parameter to rpmbuild.

Examples:

    # To build packages against only the non-debug kernel
    $ rpmbuild --rebuild --with kernel --without kernel-debug $SRPM

    # To build packages against only the debug kernel
    $ rpmbuild --rebuild --without kernel --with kernel-debug $SRPM

    # To build packages against debug and non-debug kernel
    $ rpmbuild --rebuild --with kernel --with kernel-debug $SRPM

Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are supported
      for building the debug and debug-devel packages.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ZIL statistics.

The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.

This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.

Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #786

ZFS 0.6.0-rc9

Speed up 'zfs list -t snapshot -o name -s name'

FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450

Add zvol_inhibit_dev module option.

ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.

When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.

The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make zvol_remove_link() print a more useful error message

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Mark zdev.conf as a config file

Prevent 'rpm -Uvh *.rpm" from automatically replacing your vdev.conf
file by flagging it as a non replacable config file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #486

Workaround for failing zvol_id

This is not a proper fix. It is just a workaround for the stack
smashing detected by gcc in zvol_id. We simply disable the gcc
stack protector for now when building the zvol_id udev helper.
Once the root cause is resolved this patch should be reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issues #569

Make zil_slog_limit a tunable module parameter.

zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.

The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #783

Retry removal of busy minors

When failing to remove a zvol device link because it's busy, wait
a bit and retry in a loop instead of giving up immediately. This
technique is similar to the loop in zpool_label_disk_wait(), with
the same goal: waiting for the asynchronous udev processes to finish
their work.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #692

Include <locale.h> to avoid error: 'LC_ALL' undeclared.

When compiling ZFS with CFLAGS=-O0 it will trigger the following error.
Resolve the issue by properly including locale.h.

  ../../cmd/mount_zfs/mount_zfs.c: In function 'main':
  ../../cmd/mount_zfs/mount_zfs.c:318:2: warning: implicit declaration
      of function 'setlocale' [-Wimplicit-function-declaration]
  ../../cmd/mount_zfs/mount_zfs.c:318:19: error: 'LC_ALL' undeclared
      (first use in this function)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #724

Linux 3.4 compat, d_make_root() replaces d_alloc_root()

torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776

Honor logbias when writing to ZVOLs.

The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #774

Improve CONFIG_DEBUG_LOCK_ALLOC error message

The configure script error message for kernels built with
CONFIG_DEBUG_LOCK_ALLOC may give the impression that the issue is
strictly with license compliance. To avoid confusion add some words
indicating that the linking stage will fail if the build continues.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #773

Improve 'zpool import' EBUSY error message

When a device is already open O_EXCL by another process the
`zpool import` will correctly fail.  However, the default failure
message isn't very helpful.  It may in fact be harmful if you
take its advise and destroy your pool.

    cannot import 'tank': pool is busy
            Destroy and re-create the pool from
            a backup source.

Improve the error message in the EBUSY case to simply print a
message indicating that the devices are current in use.  The user
will need to manually identify which process has the device open
exclusively and why.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add /dev/mapper/ to search path

When creating pools short device names may be used when those
devices appear in certain well known locations under /dev/.
This change adds /dev/mapper/ to that list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add vdev_id for JBOD-friendly udev aliases

vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name.  The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive.  This is particularly helpful when it
comes to tasks like replacing failed drives.  Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory.  The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.

The only currently supported topologies are sas_direct and sas_switch:

o  sas_direct - a channel is uniquely identified by a PCI slot and a
   HBA port

o  sas_switch - a channel is uniquely identified by a SAS switch port

A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'.  In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.

vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch.  The script
could be extended to support other topologies as well.  The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file.  However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.

vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.

Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #713

Extend CONFIG_DEBUG_LOCK_ALLOC check

The CONFIG_DEBUG_LOCK_ALLOC check at configure time was added to
detect when mutex_lock() is defined as a GPL-only symbol. However,
the check as written only inferred this from this configuration
setting, it never actually checked. This change introduces that
missing check to prevent false positives.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Define the needed ISA types for ARM

Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Revert "Disable direct reclaim on zvols"

This reverts commit ce90208cf9e04df966429f115d8831371ea9afce. This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels. This issue was filed as #710.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710

Add mdadm and bc dependencies

The zfs-test package additionally depends on mdadm and bc to
run the zfault.sh tests.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #690

Linux 3.3 compat, iops->create()/mkdir()/mknod()

The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701

Disable direct reclaim on zvols

Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #342

Update ARC memory limits to account for SLUB internal fragmentation

23bdb07d4e4c435205d25d3efdb5fef2d089ce5e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.

This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660

Integrate ARC more tightly with Linux

Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zfs_mdcomp_disable module option

Expose the zfs_mdcomp_disable variable as a module option. This
can be used to disable compression of zfs meta data which is
enabled by default. This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #2067: uninitialized variables in zfs(1M) may make snapshots undestroyable

Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed by: Milan Jurik <milan.jurik@xylab.cz>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
https://www.illumos.org/issues/2067

Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #1946: incorrect formatting when listing output of multiple pools with zpool iostat -v

Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Reference to Illumos issue:
https://www.illumos.org/issues/1946

Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #952: separate intent logs should be obvious in 'zpool iostat' output

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererce to Illumos issue:
https://www.illumos.org/issues/952

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #607

Illumos #1909: disk sync write perf regression when slog is used post oi_148

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680

Use KM_PUSHPAGE in l2arc_write_buffers

There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ZFS list snapshot property alias

Add support for the `zfs list -t snap` alias which is available under
Oracle Solaris 11.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #640

ZFS snapshot alias

For consistency, and because it's handy, add the 'zfs snap' alias which
was introduced by Oracle Solaris 11. This includes an update to the
man page to reflect all the available alias (snap, umount, and recv).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #640

Illumos #1346: zfs incremental receive may leave behind temporary clones

1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
https://www.illumos.org/issues/1346
https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647

Illumos #1475: zfs spill block hold can access invalid spill blkptr

Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648

Illumos #1951: leaking a vdev when removing an l2cache device

1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650

OS-926: zfs panic in zfs_fill_zplprops_impl()

This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling. This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652

Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd

vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655

Improve error message consistency

Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Document the zle compression algorithm

Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Export additional dsl symbols

Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions. They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fixed a NULL pointer dereference bug in zfs_preumount

When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

#!/bin/sh
(
while true; do
zfs create -o mountpoint=legacz tank/bar
zfs destroy tank/bar
done
) &

(
while true; do
mount -t zfs tank/bar /mnt
umount /mnt
done
) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639

Make Gentoo initscript use modinfo

The -l parameter to modprobe has been removed from the latest upstream
code and this change has entered Gentoo. Using modinfo as a substitute
addresses this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #636

Print human readable error message for ENOENT

A cryptic error code is printed when mounting a legacy dataset to a
non-existent mountpoint. This patch changes this behavior to print
"mount point '%s' does not exist", which is similar to the error
message printed when mounting procfs.

The single quotes were added to be consistent with the existing EBUSY
error message, which is the only difference between this error message
and the one that is printed when the same condition occurs when mounting
procfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #633

Properly expose the mfu ghost list kstats

Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats. This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove hard-coded 80 column output

When stdout is detected to be a tty use the number of columns
specified by the terminal. If that fails fall back to a default
80 column width. In the non-tty case allow for 999 column lines.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ZFS 0.6.0-rc8

Fix executable permissions

Caught by lint, this permission change was accidentally introduced
by commit 42cb3819f1a1f536105faac81ffc150f3da90a80. Restore the
correct permissions and while I'm at it add a missing whack-bang
to config/ltmain.sh.

lint: executable-not-elf-or-script: zpool_main.c zfs_main.c

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #620

Add --enable-debug-dmu-tx configure option

Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enhance a dmu_tx_dirty_buf() assertion

The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ZFS_META_RELEASE to module load/unload messages

Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Account for .zfs ctldir inodes

Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used. This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen. They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617

Add .zfs control directory

Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173

Add zio constructor/destructor

Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496

Revert "Add zio constructor/destructor"

This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07b0265f0661ed7851d3aa8d3e75e7a9.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604

ZFS 0.6.0-rc7

Add missing NULL in zpl_xattr_handlers

The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated. This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems. Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594

Use stderr for 'no pools/datasets available' error

The 'zfs list' and 'zpool list' commands output the message
'no datasets/pools available' to stdout. This should go to
stderr and only the available datasets/pools should go to
stdout. Returning nothing to stdout is expected behavior
when there is nothing to list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #581

Add sa_spill_rele() interface

Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle. This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold. Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zio constructor/destructor

Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496

Fix distribution detection

Improve the distribution detection by moving the tests for
distribution specific files first. The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique. This is particularly true in the case of Debian
since so many distributions are based on Debian.

Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Align parition end on 1 MiB boundary

Some devices have exhibited sensitivity to the ending alignment of
partitions.  In particular, even if the first partition begins at 1
MiB, we have seen many sd driver task abort errors with certain SSDs
if the first partition doesn't end on a 1 MiB boundary.  This occurs
when the vdev label is read during pool creation or importation and
causes a delay of about 30 seconds per device.  It can also be
simulated with dd when the pool isn't imported:

  dd if=/dev/sda1 of=/dev/null bs=262144 count=1

For the record, this problem was observed with SMARTMOD
SG9XCA2E200GE01 200GB SSDs.  Unfortunately I don't have a good
explanation for this behavior. It seems to have something to do with
highly fragmented single-sector requests being issued to the device,
which it may not support.  With end-aligned partitions at least
page-sized requests were queued and issued to the driver according
to blktrace. In any case, aligning the partition end is a fairly
innocuous work-around, wasting at most 1 MiB of space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574

Use SA_HDL_PRIVATE for SA xattrs

A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513

Cleanly support debug packages

Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs

  # For example:
  $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm

Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers.  This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>