granicus.if.org Git

Illumos #3829

3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/3829
illumos/illumos-gate@bb6e70758d0c30c09f148026d6e686e21cfc8d18

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3818

3818 zpool status -x should report pools with removed l2arc devices
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
https://www.illumos.org/issues/3818
illumos/illumos-gate@7f2416ef64fb43dab18d9b36c0da64bea37c0df3

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3740

3740 Poor ZFS send / receive performance due to snapshot
     hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3740
  illumos/illumos-gate@a7a845e4bf22fd1b2a284729ccd95c7370a0438c

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. 13fe019870c8779bf2f5b3ff731b512cf89133ef introduced a merge conflict
   in dsl_dataset_user_release_tmp where some variables were moved
   outside of the preprocessor directive.

2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
   conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
   because this commit refactors the code, adding a new KM_SLEEP
   allocation. It is not clear to me whether this should be converted
   to KM_PUSHPAGE.

3. We had a merge conflict in libzfs_sendrecv.c because of copyright
   notices.

4. Several small C99 compatibility fixed were made.

Illumos #3745, #3811

3745 zpool create should treat -O mountpoint and -m the same
3811 zpool create -o altroot=/xyz -O mountpoint=/mnt ignores
     the mountpoint option
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3745
  https://www.illumos.org/issues/3811
  illumos/illumos-gate@8b713775314bbbf24edd503b4869342d8711ce95

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3744

3744 zfs shouldn't ignore errors unmounting snapshots
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3744
  illumos/illumos-gate@fc7a6e3fefc649cb65c8e2a35d194781445008b0

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. There is no clear way to distinguish between a failure when we
   tried to unmount the snapdir of a zvol (which does not exist)
   and the failure when we try to unmount a snapdir of a dataset,
   so the changes to zfs_unmount_snap() were dropped in favor of
   an altered Linux function that unconditionally returns 0.

Illumos #3743

3743 zfs needs a refcount audit
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
https://www.illumos.org/issues/3743
illumos/illumos-gate@b287be1ba86043996f49b1cc34c80cc620f9b841

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3742

3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f7170741490edba9d1d9c697c177c887172bc741

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.

Illumos #3741

3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
https://www.illumos.org/issues/3741
illumos/illumos-gate@3e30c24aeefdee1631958ecf17f18da671781956

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3699, #3739

3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3699
  https://www.illumos.org/issues/3739
  illumos/illumos-gate@013023d4ed2f6d0cf75380ec686a4aac392b4e43

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3582, #3584

3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
https://www.illumos.org/issues/3582
illumos/illumos-gate@0689f76

Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

6977619 NULL pointer deference in sa_handle_get_from_db()

References:
illumos/illumos-gate@44bffe012cad6481c82ad67bacd6b40bd29def2b

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

6939941 problem with moving files in zfs

References:
illumos/illumos-gate@d39ee142a97a7c58f60f7b52c62409f2ff64b234

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit was so old that only two lines applied to the modern
code base.

Illumos #3642, #3643

3642 dsl_scan_active() should not issue I/O to determine if async
     destroying is active
3643 txg_delay should not hold the tc_lock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3642
  https://www.illumos.org/issues/3643
  illumos/illumos-gate@4a92375985c37d61406d66cd2b10ee642eb1f5e7

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting Notes:

1. The alignment assumptions for the tx_cpu structure assume that
   a kmutex_t is 8 bytes.  This isn't true under Linux but tc_pad[]
   was adjusted anyway for consistency since this structure was
   never carefully aligned in ZoL.  If careful alignment does impact
   performance significantly this should be reworked to be portable.

Illumos #3645, #3692

3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3645
  https://www.illumos.org/issues/3692
  illumos/illumos-gate@de8d9cff565e928d0ace86f3ea0e2b15094d61df

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1792
Issue #1775

Illumos #3598

3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3598
  illumos/illumos-gate@be6fd75a69ae679453d9cda5bff3326111e6d1ca

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. include/sys/zfs_context.h has been modified to render some new
   macros inert until dtrace is available on Linux.

2. Linux-specific changes have been adapted to use SET_ERROR().

3. I'm NOT happy about this change.  It does nothing but ugly
   up the code under Linux.  Unfortunately we need to take it to
   avoid more merge conflicts in the future.  -Brian

Illumos #3517

3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
https://www.illumos.org/issues/3517
illumos/illumos-gate@efb4a871d8fd510a833bdca610528dde5ed69e42

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Illumos #3603, #3604: bobj improvements

3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3603
  https://www.illumos.org/issues/3604
  https://www.illumos.org/issues/3871
  illumos/illumos-gate@d04756377ddd1cf28ebcf652541094e17b03c889

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.

Illumos #3588

3588 provide zfs properties for logical (uncompressed) space
     used and referenced
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3588
  illumos/illumos-gate@77372cb0f35e8d3615ca2e16044f033397e88e21

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #3578, #3579

3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3578
  https://www.illumos.org/issues/3579
  illumos/illumos-gate@9eb57f7f3fbb970d4b9b89dcd5ecf543fe2414d5

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #3561, #3116

3561 arc_meta_limit should be exposed via kstats
3116 zpool reguid may log negative guids to internal SPA history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3561
  https://www.illumos.org/issues/3116
  illumos/illumos-gate@20128a0826f9c53167caa9215c12f08beee48e30

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:

1. The spa change was accidentally included in the libzfs_core merge.

2. "Add missing arcstats" (1834f2d8b715d25bafbb0e4a099994f45c3211ae)
   already implemented these kstats a few years ago.

Illumos #3537

3537 want pool io kstats

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  http://www.illumos.org/issues/3537
  illumos/illumos-gate@c3a6601

Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:

1. The patch was restructured to take advantage of the existing
   spa statistics infrastructure.  To accomplish this the kstat
   was moved in to spa->io_stats and the init/destroy code moved
   to spa_stats.c.

2. The I/O kstat was simply named <pool> which conflicted with the
   pool directory we had already created.  Therefore it was renamed
   to <pool>/io

3. An update handler was added to allow the kstat to be zeroed.

Illumos #3522

3522 zfs module should not allow uninitialized variables
Reviewed by: Sebastien Roy <seb@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3522
  illumos/illumos-gate@d5285cae913f4e01ffa0e6693a6d8ef1fbea30ba

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:

1. ZFSOnLinux had already addressed many of these issues because of
   its use of -Wall. However, the manner in which they were addressed
   differed. The illumos fixes replace the ones previously made in
   ZFSOnLinux to reduce code differences.

2. Part of the upstream patch made a small change to arc.c that might
   address zfsonlinux/zfs#1334.

3. The initialization of aclsize in zfs_log_create() differs because
   vsecp is a NULL pointer on ZFSOnLinux.

4. The changes to zfs_register_callbacks() were dropped because it
   has diverged and needs to be resynced.

Add cstyle.pl utility and cstyle.1 man page

Cstyle is the C source style checker used by Illumos.  Since the
original ZFS source was written using these style guidelines they
must also be followed by ZoL for consistency.

The checker has been added to the scripts directory and may be
run on a per file basis.  New patches should be careful to avoid
introducing new style warnings.

Additionally, the 'checkstyle' target has been added to the top
level Makefile and can be used to check the entire source tree.
While Zol has historically attempted to follow the SunOS style
guide the lack of a rigorous style checker has allowed various
warning to be introduced.  Currently there are 2211 reported
style violations and we want to gradually eliminate these from
the tree.

Note the cstyle.1 man page is provided under man/man1/cstyle.1
but since it is a developer utility it is not installed along
with the other man pages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add missing code to zfs_debug.{c,h}

This is required to make Illumos 3962 merge.

Signed-off-by: Richard Yao <ryao@gentoo.org>

Add missing copyright notices from Illumos

This resolves merge conflicts when merging Illumos #3588 and Illumos #4047.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Fix incorrect usage of strdup() in zfs_unmount_snap()

Modifying the length of a string returned by strdup() is incorrect
because strfree() is allowed to use strlen() to determine which slab
cache was used to do the allocation.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Fix order of function calls in zio_free_sync()

The resolution of a merge conflict when merging Illumos #3464 caused us
to invert the order couple of function calls in zio_free_sync() versus
what they are in Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Reintroduce uio_prefaultpages()

This was accidentally removed by overzealous commenting.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Posix ACL Support

This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems.  Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set.  The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.

By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property.  Set the property to
'posixacl' to enable them.  If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.

This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).

  http://www.tuxera.com/community/posix-test-suite/

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #170

Improve xattr property documentation

Extend the xattr property section of zfs(8) such that it covers
both styles of supported xattr. A short discussion of the benefits
and drawbacks of each type is presented to allow users to make an
informed choice.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170

Prevent xattr remove from creating xattr directory

Attempting to remove an xattr from a file which does not contain
any directory based xattrs would result in the xattr directory
being created. This behavior is non-optimal because it results
in write operations to the pool in addition to the expected error
being returned.

To prevent this the CREATE_XATTR_DIR flag is only passed in
zpl_xattr_set_dir() when setting a non-NULL xattr value. In
addition, zpl_xattr_set() is updated similarly such that it will
return immediately if passed an xattr name which doesn't exist
and a NULL value.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170

Add script to fix file names in upstream patches

Added a simple sed script to do a search and replace on the Illumos
ZFS file names and replace them with the ZFS on Linux equivalent.

Example usage:

    # Replace Illumos paths with Linux paths
    $ ./scripts/zfs2zol-patch.sed arc.c.patch > arc.c.patch.linux

    # Ensure the script worked as expected
    $ diff arc.c.patch arc.c.patch.linux

    # Apply the patch using Linux paths
    $ patch -p1 < arc.c.patch.linux

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1679

Restructure zfs_readdir() to fix regressions

This does the following:

1. It creates a uint8_t type value, which is initialized to DT_DIR on
dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise.
This resolves a regression where we return unintialized values as the
directory entry type on dot directories. This was accidentally
introduced by commit 8170d281263e52ff33d7fba93ab625196844df36.

2. It restructures zfs_readdir() code to use `uint64_t offset` like
Illumos instead of `loff_t *pos`. This resolves a regression where
negative ZAP cursors were treated as if they were dot directories.

3. It restructures the function to more closely match the structure of
zfs_readdir() on Illumos and removes the unused variable outcount, which
was only used on Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1750

Add -p switch to "zpool get"

This works the same as the -p switch to "zfs get", displaying full
resolution values for appropriate attributes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1813

Introduce zpool_get_prop_literal interface

This change introduces zpool_get_prop_literal. It's an expanded version
of zpool_get_prop taking one additional boolean parameter. With this
parameter set to B_FALSE it will behave identically to zpool_get_prop.
Setting it to B_TRUE will return full precision numbers for the
following properties:

ZPOOL_PROP_SIZE
ZPOOL_PROP_ALLOCATED
ZPOOL_PROP_FREE
ZPOOL_PROP_FREEING
ZPOOL_PROP_EXPANDSZ
ZPOOL_PROP_ASHIFT

Also introduced is a wrapper function for zpool_get_prop making it
use zpool_get_prop_literal in the background.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1813

Corrected "zfs list -t <type>" syntax

in man page and in command help.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1805

Merge branch 'kstat'

This branch updates several of the zfs kstats to take advantage
of the improved raw kstat functionality.  In addition, two new
kstats and a script called dbufstat.py are introduced.

Updated+New Kstats
* dbufs        - Stats for all dbufs in the dbuf_hash
* <pool>/txgs  - Stats for the last N txgs synced to disk
* <pool>/reads - Stats for rhe last N reads issues by the ARC
* <pool>/dmu_tx_assign - Histogram of tx assign times

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add dbufstat.py command

The dbufstat.py command was added to provide a conveniant way to
easily determine what ZFS is caching.  The script consumes the
raw /proc/spl/kstat/zfs/dbufs kstat data can consolidates it in
to a more human readable form.  This was designed primarily as
a tool to aid developers but it may also be useful for advanced
users who want more visibility in to what the ARC is caching.

When run without options dbufstat.py will default to showing a
list of all objects with at least one buffer present in the
cache.  The total cache space consumed by that object will be
printed on the right along with the object type.  Similar to the
arcstats.py command the -x option may used to display additional
fields.

Two other modes of operation are also supported by dbufstat.py
and the expectation is additional display modes may be added as
needed.  The -t option will summerize the total number of bytes
cached for each object type, and the -b option will show every
dbuf currently cached.

The script was designed to be consistent with arcstat.py and
includes most of the same options and funcationality.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add visibility in to cached dbufs

Currently there is no mechanism to inspect which dbufs are being
cached by the system.  There are some coarse counters in arcstats
by they only give a rough idea of what's being cached.  This patch
aims to improve the current situation by adding a new dbufs kstat.

When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash.  For each dbuf it will dump detailed information
about the buffer.  It will also dump additional information about
the referenced arc buffer and its related dnode.  This provides a
more complete view in to exactly what is being cached.

With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working.  For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming.  Or a
similar list could be generated based on dnode type.  Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>

Add visibility in to dmu_tx_assign times

This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add visibility in to txg sync behavior

This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.

For each txg, the following information is exported:

* Unique txg number (uint64_t)
* The time the txd was born (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
* The number of reserved bytes for the txg (uint64_t)
* The number of bytes read during the txg (uint64_t)
* The number of bytes written during the txg (uint64_t)
* The number of read operations during the txg (uint64_t)
* The number of write operations during the txg (uint64_t)
* The time the txg was closed (hrtime_t)
* The time the txg was quiesced (hrtime_t)
* The time the txg was synced (hrtime_t)

Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times. Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add visibility in to arc_read

This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.

For each arc_read call, the following information is exported:

* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])

From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.

There is still some work to be done, but this should serve as a good
starting point.

Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Revert "Add txgs-<pool> kstat file"

This reverts commit e95853a331529a6cb96fdf10476c53441e59f4e1.

Revert "Add new kstat for monitoring time in dmu_tx_assign"

This reverts commit 92334b14ec378b1693573b52c09816bbade9cf3e.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Increase default udev wait time

When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/.  However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer.  If you also throw in some cheap dodgy hardware
it may take even longer.

To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds.  This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1646

Linux 3.11 compat: Rename LZ4 symbols

Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1789

Dedup-related documentation additions for zpool and zdb.

Document the "-D" and "-T" options and the optional interval
and count or "zpool status".

Also for zpool's man page, use a consistent order for the
various "-T" options to match the program's help output.

Document the effect of additional "-D" options for zdb.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1786

Add missing dsl pool configuration lock

The semantics introduced by the restructured sync task of illumos
3464 require this lock when calling dmu_snapshot_list_next().
The pool is locked/unlocked for each iteration to reduce the
chance of long-running locks.

This was accidentally missed when doing the original port because
ZoL's control directory code is Linux-specific and is in a
different file than in illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1785

Illumos #3552

3552 condensing one space map burns 3 seconds of CPU in spa_sync()
     thread (fix race condition)

References:
  https://www.illumos.org/issues/3552
  illumos/illumos-gate@03f8c366886542ed249a15d755ae78ea4e775d9d

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@e51be06697762215dc3b679f8668987034a5a048, which
ported the Illumos 3552 changes. This fix was added to upstream
rather quickly, but at the time of the port, no one spotted it and
the race was rare enough that it passed our regression tests. I
discovered this when comparing our metaslab.c to the illumos
metaslab.c.

Without this change it is possible for metaslab_group_alloc() to
consume a large amount of cpu time.  Since this occurs under a
mutex in a rcu critical section the kernel will log this to the
console as a self-detected cpu stall as follows:

  INFO: rcu_sched self-detected stall on CPU { 0}
  (t=60000 jiffies g=11431890 c=11431889 q=18271)

Closes #1687
Closes #1720
Closes #1731
Closes #1747

Fix libzfs_core changes to follow GNU libtool guidelines

The GNU libtool documentation states to start with a version of 0:0:0,
rather than 1:1:0. Illumos uses the name libzfs_core.so.1, so to be
consistent, we should go with 1:0:0.

http://www.gnu.org/software/libtool/manual/libtool.html#Updating-version-info

The GNU libtool documentation also provides guidence on how the version
information should be incremented. Doing this does a SONAME bump of the
libzfs and libzpool libraries. This is particularly important on Gentoo
because a SONAME bump enables portage to retain the older libraries
until any packages that link to them are rebuilt. The main example of
this is GRUB2's grub2-mkconfig, which will break unless it is rebuilt
against the new libraries.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751

Generate libraries with correct DT_NEEDED entries

Libraries that depend on other libraries should list them in ELF's
DT_NEEDED field so that programs linking to them do not need to specify
those libraries unless they depend on them as well. This is not the case
in the current code and the consequence is that anything that needs a
library must know its dependencies. This is fragile and caused GRUB2's
configure script to break when a dependency was added on libblkid in
libzfs.

This resolves that problem by using LIBADD/LDADD to specify libraries in
Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED
entries are generated and prevents GRUB2's configure script from
breaking in the presence of a libblkid dependency. This also removes
unneeded dependencies from various files.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751

Fix libblkid support

libblkid support is dormant because the autotools check is broken and
liblkid identifies ZFS vdevs as "zfs_member", not "zfs". We fix that
with a few changes:

First, we fix the libblkid autotools check to do a few things:

1. Make a 64MB file, which is the minimum size ZFS permits.
2. Make 4 fake uberblock entries to make libblkid's check succeed.
3. Return 0 upon success to make autotools use the success case.
4. Include stdlib.h to avoid implicit declration of free().
5. Check for "zfs_member", not "zfs"
6. Make --with-blkid disable autotools check (avoids Gentoo sandbox violation)
7. Pass '-lblkid' correctly using LIBS not LDFLAGS.

Second, we change the libblkid support to scan for "zfs_member", not
"zfs".

This makes --with-blkid work on Gentoo.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751

Update detach section of zpool(8)

The detach section of the zpool(8) man page now suggests the
offline command. Using offline may be more appropriate for
certain situations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1776

Export symbols dsl_pool_config_{enter,exit}

These are needed by consumers (i.e. Lustre) who wish to use the
dsl_prop_register() interface to register callbacks when pool
properties of interest change. This interface requires that the
DSL pool configuration lock is held when called.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1762

Fix memory leak false positive in log_internal()

When building the spl with --enable-debug-kmem-tracking a memory
leak is detected in log_internal().  This happens to be a false
positive because the memory was freed using strfree() instead of
kmem_free().  All kmem_alloc()'s must be released with kmem_free()
to ensure correct accounting.

  SPL: kmem leaked 135/5641311 bytes
  address          size  data             func:line
  ffff8800cba7cd80 135   ZZZZZZZZZZZZZZZZ log_internal:456

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Update drive database

Add Corsair Force GS drive (obtained from drive_id)
Add Kingston HyperX 3K (obtained from drive_id)
Add OCZ Vertex 4 drive (obtained from drive_id)
Add Samsung SM843T enterprise drive (obtained from drive_id)
Add entries for additional sizes of Intel 320/330/335/520 series
Add Cruical C400 (obtained from Illumos user's sd.conf)
Add Toshiba SSD (obtained from Illumos user's sd.conf)
Add Samsung's first SLC SSD (obtained from drive_id)
Add OCZ Core Series (obtained from drive_id)
Add Intel DC S3700 (obtained from drive_id)

Notes:

1. The drive identifer obtained for the Samsung SM843T was MZ7WD480. The
rest were extrapolated. The additional entries were checked with Google
to verify that such drives exist in the wild.

2. The additional entries for Intel drives were extrapolated from
existing entries. The additional entries were checked with Google to
verify that such drives exist in the wild.

3. The "ATA C400-MTFDDAC512M" and "ATA TOSHIBA THNSNH51" entries
are from the sd.conf of gcbirzan on freenode. Additional entries were
extrapolated from them and checked with Google.

4. I obtained the Samsung MCCOE64G entry from an actual drive. The
Samsung MCCOE32G entry was extrapolated from it and checked with
Google.

5. I obtained the SSDSC2BA10 from a 100GB Intel DC S3700 drive and
extrapolated the entries for the additional models.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1752

Export addition dsl_prop_* symbols

The recent sync task restructuring in 13fe019 introduced several
new symbols which should be exported for use by consumers such
as Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Allocate the ioctl "output" nvlist with KM_PUSHPAGE.

Some ZFS errors such as certain snapshot failures can occur in
the sync task context. Because they may require additional memory
allocations, the initial nvlist must be allocated with KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737

Fix several new KM_SLEEP warnings

A handful of allocations now occur in the sync path and need
to use KM_PUSHPAGE. These were introduced by commit 13fe019.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737

Fix spa_deadman() TQ_SLEEP warning

The spa_deadman() and spa_sync() functions can both be run in the
spa_sync context and therefore should use TQ_PUSHPAGE instead of
TQ_SLEEP.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1734
Closes #1749

Removing unneeded mutex for reading vq_pending_tree size

Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:

* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
while the read is made) the worst case would be that a single
read could be queued on a mirror side which more busy than thought

The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1739

Reduce the stack usage of dsl_dataset_remove_clones_key

dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen
this happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.

See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:

        Depth    Size   Location    (42 entries)
        -----    ----   --------
  0)    11192      72   __kmalloc+0x2e/0x240
  1)    11120     144   kmem_alloc_debug+0x20e/0x500
  2)    10976      72   dbuf_hold_impl+0x4a/0xa0
  3)    10904     120   dbuf_prefetch+0xd3/0x280
  4)    10784      80   dmu_zfetch_dofetch.isra.5+0x10f/0x180
  5)    10704     240   dmu_zfetch+0x5f7/0x10e0
  6)    10464     168   dbuf_read+0x71e/0x8f0
  7)    10296     104   dnode_hold_impl+0x1ee/0x620
  8)    10192      16   dnode_hold+0x19/0x20
  9)    10176      88   dmu_buf_hold+0x42/0x1b0
10)    10088     144   zap_lockdir+0x48/0x730
11)     9944     128   zap_cursor_retrieve+0x1c4/0x2f0
12)     9816     392   dsl_dataset_remove_clones_key.isra.14+0xab/0x190
13)     9424     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
14)     9032     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
15)     8640     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
16)     8248     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
17)     7856     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
18)     7464     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
19)     7072     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
20)     6680     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
21)     6288     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
22)     5896     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
23)     5504     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
24)     5112     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
25)     4720     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
26)     4328     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
27)     3936     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
28)     3544     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
29)     3152     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
30)     2760     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
31)     2368     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
32)     1976     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
33)     1584     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
34)     1192     232   dsl_dataset_destroy_sync+0x311/0xf60
35)      960      72   dsl_sync_task_group_sync+0x12f/0x230
36)      888     168   dsl_pool_sync+0x48b/0x5c0
37)      720     184   spa_sync+0x417/0xb00
38)      536     184   txg_sync_thread+0x325/0x5b0
39)      352      48   thread_generic_wrapper+0x7a/0x90
40)      304     128   kthread+0xc0/0xd0
41)      176     176   ret_from_fork+0x7c/0xb0

This change reduces the stack usage in dsl_dataset_remove_clones_key
by allocating structures in heap, not in stack.  This is not a fundamental
fix, as one can create an arbitrary large data set that runs over any
fixed size stack, but this will make the problem far less likely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Closes #1726

Fix zpl_mknod() return values

The zpl_mknod() function was incorrectly negating its return value.
This doesn't cause any problems in the success case, but it does
prevent us from returning the correct error code for a failure.
The implementation of this function is now consistent with all
the other zpl_* functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1717

Fix uninitialized variables

When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1716

Stop runtime pointer modifications in autotools checks

c38367c73f592ca9729ba0d5e70b5e3bc67e0745 was meant to eliminate runtime
function pointer modifications in autotools checks because they were
prone to false negatives on kernels hardened by the PaX project.
Unfortunately, I missed the xattr_handler and super_block->s_bdi
autotools checks. Recent changes to PaX constified
xattr_handler->get/set, which lead me to discover this oversight.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1433

Fix dmu_objset_find_dp() KM_SLEEP warning

After the restructuring in 13fe019 The 'zfs rename' command will
result in a KM_SLEEP being called in the sync context. This may
deadlock due to reclaim so it was changed to KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1711

Illumos #3464

3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
https://www.illumos.org/issues/3464
illumos/illumos-gate@3b2aab18808792cbd248a12f1edf139b89833c13

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1495

Illumos #2882, #2883, #2900

2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb1ea25fd0e9ea68b9380dd7a6709025

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.

Tag zfs-0.6.2

META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use directory xattrs for symlinks

There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.

Fortunately, the only existing use case for this are symlinks with
SA based xattrs. Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468

Revert "Evict meta data from ghost lists + l2arc headers"

This reverts commit fadd0c4da1e2ccd6014800d8b1a0fd117dd323e8 which
introduced a regression in honoring the meta limit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close #1660

Implement database to workaround misreported physical sector sizes

This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.

This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:

```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```

The database was created with the help of the freenode and ZFSOnLinux
communities.

Some notes:

1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:

INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M

The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.

2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.

3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.

4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.

5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1652

Linux 3.11 compat: fops->iterate()

Commit torvalds/linux@2233f31aade393641f0eaed43a71110e629bb900
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1653
Issue #1591

Fix z_wr_iss_h zio_execute() import hang

Because we need to be more frugal about our stack usage under
Linux.  The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy.  Those two contexts are the sync
thread and what ever thread is performing spa initialization.

Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed.  If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again.  The system
will get stuck spinning re-dispatching the zio and making no
forward progress.

To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.

In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1455

Don't specifically open /etc/mtab - it is done in libzfs_init()
a few lines further down and we can share the open file handle.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498

No point in rewind() mtab in zfs_unshare_proto(). We're not really
reading the file, but instead use libzfs_mnttab_find() which does
the nessesary freopen() for us.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498

Use setmntent() OR fopen()

For the same reasons it's used in libzfs_init(), this was just
overlooked because zinject gets minimal use.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498

Fix for re-reading /etc/mtab in zfs_is_mounted()

When /etc/mtab is updated on Linux it's done atomically with
rename(2).  A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab.  This means that we
must close the old file and open the new file to get the updated
contents.  Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1611

Illumos #3098 zfs userspace/groupspace fail

3098 zfs userspace/groupspace fail without saying why when run as non-root
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/3098
illumos/illumos-gate@70f56fa69343b013f47e010537cff8ef3a7a40a5

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1596

Illumos #3618 ::zio dcmd does not show timestamp data

3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35da47582b7afd38734d2f0d9c6deb40

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1643

Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS

When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1589

Evict meta data from ghost lists + l2arc headers

When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists. Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.

To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.

If this is insufficient we request that the VFS release
some inodes and dentries. This will result in the release
of some dnodes which are counted as 'other' metadata.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Allow arc_evict_ghost() to only evict meta data

The default behavior of arc_evict_ghost() is to start by evicting
data buffers.  Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.

This is ideal for honoring arc_c since it's preferable to keep the
meta data cached.  However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.

To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with.  The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.

All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change.  New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #3964 L2ARC should always compress metadata buffers

3964 L2ARC should always compress metadata buffers
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>

References:
https://www.illumos.org/issues/3964

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1379

Illumos #3137 L2ARC compression

3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1379

Return -1 from arc_shrinker_func()

This is analogous to SPL commit zfsonlinux/spl@b9b3715. While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1579

Return correct type and offset from zfs_readdir

zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.

On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop. The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.

Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.

Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1624

Illumos #3639 zpool.cache should skip over readonly pools

3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
illumos/illumos-gate@fb02ae025247e3b662600e5a9c1b4c33ecab7d72
https://www.illumos.org/issues/3639

Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:

$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C

This is a problem because if the machine reboots we import all pools in
the cache file as read-write.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Write dirty inodes on close

When the property atime=on is set operations which only access
and inode do cause an atime update.  However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache.  Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.

This behavior may in part explain why zfs's reclaim logic has
been observed to be slow.  When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.

Obviously we don't want to force a write to disk for every
atime update, these needs to be batched.  The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks.  However, to do that right requires proper unification
of some fields in the znode/inode.  Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.

Until that work gets done we have to settle for some middle
ground.  The simplest and safest thing we can do for now is
to write the dirty inode on last close.  This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.

Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results.  This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.

w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch:  ~3s to drop 500,000 inodes with drop_caches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix man page for the sync property

The help output of for zfs set/get says that sync can be one of

standard | always | disabled

but the man pages claim it can be

sync=default | always | disabled

the accepted value is standard, this changes the manpage to give the
correct values.

Signed-off-by: Steven Burgess <sburgess@dattobackup.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1634

Fix the default checksum algorithm in the manpage

The manpage reports fletcher2, but in zio.h ZIO_CHECKSUM_ON_VALUE
is defined to ZIO_CHECKSUM_FLETCHER_4.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1628

Add kmod repo integration

When the kmod packaging infrastructure was originally added the
dependency on the rpmfusion yum repositories was disabled.  This
was done at the time in favour of getting local builds working.

Now the time has come to conditionally re-enable that functionality
so we can properly provide binary kmod packages.

  ./configure --with-config=srpm
  make SRPM_DEFINE_KMOD='--define="repo rpmfusion"' srpm-kmod
  mock rebuild zfs-kmod-x.y.z-r.el6.src.rpm

One nice benefit of finishing this work is that the generic and
fedora spl-kmod spec files can be merged again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Export additional dmu symbols

The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

dmu_tx: Fix possible NULL pointer dereference

dmu_tx_hold_object_impl can return NULL on error. Check for this
condition prior to dereferencing pointer. This can only occur if
the passed object was invalid or unallocated.

Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1610

Remove b_thawed from arc_buf_hdr_t

The code involving b_thawed appears to be dead, so lets discard it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614

Remove arc_data_buf_alloc()/arc_data_buf_free()

These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614

Remove zio_alloc_arena

We declare zio_alloc_arena using extern, but it does not appear to exist
anywhere in the code. This permits undefined behavior, so lets remove
it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614

Make arc+l2arc module options writable

The l2arc module options can be made safely writable. This allows
the options to be changed without unloading/loading the modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Change l2arc_norw default to zero

These days modern SSDs can efficiently service concurrent reads
and writes. When this flag was added that wasn't really the
case for a variety of SSD controllers. But now we can set the
default value to take advantage of this parallelism and only
disable this as needed for specific troublesome hardware.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix inaccurate arcstat_l2_hdr_size calculations

Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:

1. When allocate a l2arc_buf_hdr_t we add its memory consumption
   instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
   the buffer only stays in l2arc we should also add the memory
   its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
   arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
   in step 1 and the same for transfering arc bufs from l2arc only
   state.

The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:

1. Fdisked a SATA disk into two partitions, one partition for zpool
   storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
   prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
   in /proc/spl/kstat/zfs/arcstats. After the reading ended the
   l2_hdr_size and l2_size were shown like this:

      l2_size             4       4403780608
      l2_hdr_size         4       0

   which was weird.

4. After applying this patch and reran step 1-3, the results were
   as following:

      l2_size             4       4306443264
      l2_hdr_size         4       535600

   these numbers made sense, on 64-bit systems the
   sizeof(l2arc_buf_hdr_t) is 16 bytes.  Assue all blocks cached by
   l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
   blocks are equal-sized, the theoretical result will be a little
   bigger, as we can see.

Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:

probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}

It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only.  I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>