granicus.if.org Git

Assert alignment in umem_alloc_aligned

Valgrind suggests that the address we are returning is not properly
aligned, so lets add an assertion.

==87740==  Address 0x1012a22a is 554 bytes inside a block of size 4,096
alloc'd
==87740==    at 0x4C2BBA0: memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x4C2BCC7: posix_memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x52FA845: zio_buf_alloc (umem.h:101)
==87740==    by 0x52F6226: zil_alloc_lwb (zil.c:463)
==87740==    by 0x52F8559: zil_commit (zil.c:566)
==87740==    by 0x40611D: ztest_freeze (ztest.c:5909)
==87740==    by 0x4066A7: ztest_init (ztest.c:6048)
==87740==    by 0x407AF4: main (ztest.c:6226)

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)

Since no special signal handling was done in sighandler(),
it's simpler to just set SIGINT handler to SIG_DFL, which
terminates the script.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2179

Revert "sighandler() should take 2 arguments"

This reverts commit 0bb89b6c594259829556f6dea5a89e722f214fd3 in
favor of a cleaner implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2182

Fix regression introduced in port of Illumos #3744

Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66774959499a89bf3680d09c6e541ce8, which
ported the Illumos 3744 changes.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2156

Remove unused option -r from 'zpool import'

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2178

Switch ztest mmap(2) ASSERTs to VERIFYs

This is just a small bit of cleanup to ensure ztest fails early
on systems where mmap(2) is not functioning. For the automated
testing which is the primary consumer of ztest there is no
functional change because debugging is always enabled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2177

Free props in ztest_init()

Valgrind complained about this and it's absolutely right. The
props nvlist was not being freed in ztest_init.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:

TypeError: sighandler() takes no arguments (2 given)

This patch corrects the error by updating the signal handler
to take the correct number of arguments.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2182

Make arcstat.py default to one header per screen

Today arcstat.py prints one header every hdr_intr (20 by default)
lines. It would be more consistent with out utilities like vmstat
if hdr_intr defaulted to terminal window size, i.e. one header
per screenful of outputs.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2183

Fix NAME section of manpages zhack and fsck.zfs.

In Debian GNU/Linux a program called 'linitian' is used to make sure
that packages conforms to the Debian GNU/Linux packaging guide lines.
This fixes the problem reported as:

W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz
W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz

Not something that ZoL needs to addhere to, but every other man page
have their NAME section in a special way - why not these two as well?

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2161

Illumos #4088 use after free in arc_release()

4088 use after free in arc_release()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
https://www.illumos.org/issues/4088
illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772

From the illumos issue:

A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2162

Use KM_PUSHPAGE in spa_add() for spa_label_features.

The spa_label_features nvlist is used in the sync context during pool
version upgrade.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2168

Export symbols dsl_sync_task{_nowait}

These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Improve reporting of tx assignment wait times

Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign().  This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.

One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded.  However, this case should not often arise in practice,
because most callers currently use one of these forms:

  dmu_tx_assign(tx, TXG_WAIT);
  dmu_tx_assign(tx, waited ?  TXG_WAITED : TXG_NOWAIT);

The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed.  Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.

Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram.  Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1915

replace nreserved with ndirty in txgs kstat

The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007bf97cdf34869c1ffbd0ce753873a3d.

Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead.  To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io().  This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io().  To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().

Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696

dmu_tx kstat cleanup

A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.

- The sync task restructuring commit
  13fe019870c8779bf2f5b3ff731b512cf89133ef removed the code
  that bumpted dmu_tx_quota. The counter is now bumped in two
  cases, instead of just the one case as before (after the result
  of dsl_dataset_check_quota call). The second case is where
  we check the requested reservation against the actual pool size,
  as this is an implicit quota of sorts.

- The write throttle restructuring commit
  e8b96c6007bf97cdf34869c1ffbd0ce753873a3d makes dmu_tx_how and
  dmu_tx_inflight obsolete, so they are removed.

Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1914

Invalidate Linux buffer cache on vdevs upon each flush

Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.

A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.

This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.

Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2150

Inform OpenRC that ZFS uses mtab

p_l in #zfsonlinux reported that he had issues mounting filesystems that
were resolved by adding rc_need="mtab" to /etc/init.d/zfs. Closer
inspection revealed that we do have a race, but it is not clear how this
race caused mounting to fail. What is clear is that this race should be
fixed, so lets add the proper `use mtab` line to handle it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2148

Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel

4574 get_clones_stat does not call zap_count in non-debug kernel

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
https://www.illumos.org/issues/4574
illumos/illumos-gate@03d1795fa6f720eafbee821ad37f4343c391cfe4

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2147

Fix zap_lookup() in feature_is_supported().

The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2141

Include l2asize in arcstat

For consistency with upstream pull in the l2asize update after
reworking it from Perl to Python.

References:
https://github.com/mharsch/arcstat/pull/11
https://github.com/mharsch/arcstat/pull/12

Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2122

Remove recursion from dsl_dir_willuse_space()

Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.

Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2069

Merge branch 'arc-changes'

This stack of patches has been empirically shown to drastically improve
the hit rate of the ARC for certain workloads. As a result, fewer reads
to disk are required, which is generally a good thing and can
drastically improve performance if the workload is disk limited.

For the impatient, I'll summarize the results of the tests performed:

    * Test 1 - Creating many empty directories. This test saw 99.9%
               fewer reads and 12.8% more inodes created when running
               *with* these changes.

    * Test 2 - Creating many empty files. This test saw 4% fewer reads
               and 0% more inodes created when running *with* these
               changes.

    * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer
               reads and 4.9% more inodes created when running *with*
               these changes.

    * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer
               reads and 0% more inodes created (but took 6.9% fewer
               seconds to complete) when running *with* these changes.

    * Test 5 - Rsync'ing a dataset with many empty directories. This
               test saw 36.2% fewer reads and 66.2% more inodes created
               when running *with* these changes.

    * Test 6 - Rsync'ing a dataset with many empty files. This test saw
               30.9% fewer reads and 0% more inodes created (but took
               24.3% fewer seconds to complete) when running *with*
               these changes.

    * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw
               30.8% fewer reads and 173.3% more inodes created when
               running *with* these changes.

For the patient, the following consists of more a more detailed
description of the tests performed and the results gathered.

All the tests were run using identical machines, each with a pool
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run
twice, once *without* this set of patches and again *with* this set of
patches to highlight the performance changes introduced. The first four
workloads tested were:

    ** NOTE: None of these tests were run to completion. They ran for a
             set amount of time and then were terminated or hit ENOSPC.

    1. Creating many empty directories:

       * fdtree -d 10 -l 8 -s 0 -f 0 -C
         -> 111,111,111 Directories
         ->           0 Files
         ->           0 KiB File Data

    2. Creating many empty files:

       * fdtree -d 10 -l 5 -s 0 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         ->             0 KiB File Data

    3. Creating many 4 KiB files:

       * fdtree -d 10 -l 5 -s 1 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         -> 4,444,440,000 KiB File Data

    4. Creating many 4096 KiB files:

       * fdtree -d 10 -l 5 -s 1024 -f 10000 -C
         ->           111,111 Directories
         ->     1,111,110,000 Files
         -> 4,551,106,560,000 KiB File Data

Results for these first four tests are below:

                  | Time (s) |   inodes |  reads |    writes |
                --+----------+----------+--------+-----------+
    Test 1 Before |    65069 | 37845363 | 831975 |   3214646 |
    Test 1 After  |    65069 | 42703608 |    778 |   3327674 |
                --+----------+----------+--------+-----------+
    Test 2 Before |    65073 | 54257583 | 208647 |   2413056 |
    Test 2 After  |    65069 | 54255782 | 200038 |   2533759 |
                --+----------+----------+--------+-----------+
    Test 3 Before |    65068 | 49857744 | 487130 |   5533348 |
    Test 3 After  |    65071 | 52294311 |  16078 |   5648354 |
                --+----------+----------+--------+-----------+
    Test 4 Before |    34854 |  2448329 | 385870 | 162116572 |
    Test 4 After  |    32419 |  2448329 |   2339 | 162175706 |
                --+----------+----------+--------+-----------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

As you can see from the table above, running with this patch stack
*significantly* reduced the number of reads performed in 3 out of the 4
tests (due to an improved ARC hit rate).

In addition to the tests described above, which specifically targeted
creates only, three other workloads were tested. These additional tests
were targeting rsync performance against the datasets created in the
previous tests. A brief description of the workloads and results for
these tests are below:

    ** NOTE: Aside from (6), these tests didn't run to completion. They
             ran for a set amount of time and then were terminated.

    5. Rsync the dataset created in Test 1 to a new dataset:

       * rsync -a /tank/test-1 /tank/test-5

    6. Rsync the dataset created in Test 2 to a new dataset:

       * rsync -a /tank/test-2 /tank/test-6

    7. Rsync the dataset created in Test 3 to a new dataset:

       * rsync -a /tank/test-3 /tank/test-7

Results for Test 5, 6, and 7 are below:

                  | Time (s) |   inodes |    reads |  writes |
                --+----------+----------+----------+---------+
    Test 5 Before |    93041 | 17921014 | 47632823 | 4094848 |
    Test 5 After  |    93029 | 29785847 | 30376206 | 4484459 |
                --+----------+----------+----------+---------+
    Test 6 Before |    15290 | 54264474 |  6018331 |  733087 |
    Test 6 After  |    11573 | 54260826 |  4155661 |  617285 |
                --+----------+----------+----------+---------+
    Test 7 Before |    93057 | 10093749 | 41561635 | 3659098 |
    Test 7 After  |    93045 | 27587043 | 28773151 | 5612234 |
                --+----------+----------+----------+---------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2110

Set "arc_meta_limit" to 3/4 arc_c_max by default

Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Split "data_size" into "meta" and "data"

Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prioritize "metadata" in arc_get_data_buf

When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Remove "arc_meta_used" from arc_adjust calculation

Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prune metadata from ghost lists in arc_adjust_meta

To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Revert "Return -1 from arc_shrinker_func()"

This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable arc_p adapt dampener by default

It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Allow "arc_p" to drop to zero or grow to "arc_c"

Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable aggressive arc_p growth by default

For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Adjust arc_p based on "bytes" in arc_shrink

In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f1657c05a4287226eeda1f53ae431b8a7
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Set zfs_arc_min to 4MB

Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110

Merge branch 'issue-2094'

This patch stack was designed to accomplish the following:

* Cleanly address the pool incompatibility which was accidentally
  introduced post-0.6.2 but pre-0.6.3.  This required adding
  infrastructure which can be used from now on to notify system
  administrators of errata which affect their pool.

* Add additional automated regression testing to ensure this type
  of compatibility issue is caught prior to the change being merged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2094

Add erratum for issue #2094

ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit.  Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.

The additional field has been removed by the previous commit.  However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied.  This will return the pools to a state in which they
may be imported by other implementations.

The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted.  A message similar to one of the following means your
pool must be scrubbed to restore compatibility.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
         however there is a compatibility issue which should be corrected
         by running 'zpool scrub'
    see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...

$ zpool status
  pool: zol-0.6.2-173
state: ONLINE
  scan: pool compatibility issue detected.
   see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...

If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported.  Further advice on how to proceed will be
provided by the error message as follows.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
         active asynchronous destroy.  Revert to an earlier version and
         allow the destroy to complete before updating.
         see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...

Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:

zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.

The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys.  Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.

This patch does have one limitation that should be mentioned.  It will not
detect errata #2 on a pool unless errata #1 is also present.  It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.

End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL.  The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root.  This will display a non-zero
value for any pool with an active asynchronous destroy.

Lastly, it is expected that no user data has been lost as a result of
this erratum.

Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094

Add generic errata infrastructure

From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool.  These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate.  The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation.  Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.

To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected.  This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.

To add a new errata the following changes must be made:

* A new errata identifier must be assigned by adding a new enum value
  to the zpool_errata_t type.  New enums must be added to the end to
  preserve the existing ordering.

* Code must be added to detect the issue.  This does not strictly
  need to be done at pool import time but doing so will make the
  errata visible in 'zpool import' as well as 'zpool status'.  Once
  detected the spa->spa_errata member should be set to the new enum.

* If possible code should be added to clear the spa->spa_errata member
  once the errata has been resolved.

* The show_import() and status_callback() functions must be updated
  to include an informational message describing the errata.  This
  should include an action message describing what an administrator
  should do to address the errata.

* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
  updated to describe the errata.  This space can be used to provide
  as much additional information as needed to fully describe the errata.
  A link to this documentation will be automatically generated in the
  output of 'zpool import' and 'zpool status'.

Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094

Use expected zpool_status_t type

Both the zpool_import_status() and zpool_get_status() functions
return the zpool_status_t enum. This explicit type should be used
rather than the more generic int type.

This patch makes no functional change and should only be considered
code cleanup. It happens to have been done in the context of #2094
because that's when I noticed this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094

Revert changes to zbookmark_t

Commit 1421c89142376bfd41e4de22ed7c7846b9e41f95 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.

The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094

Add zimport.sh compatibility test script

Verify that an assortment of known good reference pools can be imported
using different versions of the ZoL code.

By default references pools for the major ZFS implementation will be
checked against the most recent ZoL tags and the master development branch.
Alternate tags or branches may be verified with the '-s <src-tag> option.
Passing the keyword "installed" will instruct the script to test whatever
version is installed.

Preferentially a reference pool is used for all tests.  However, if one
does not exist and the pool-tag matches one of the src-tags then a new
reference pool will be created using binaries from that source build.
This is particularly useful when you need to test your changes before
opening a pull request.

New reference pools may be added by placing a bzip2 compressed tarball
of the pool in the scripts/zpool-example directory and then passing
the -p <pool-tag> option.  To increase the test coverage reference pools
should be collected for all the major ZFS implementations.  Having these
pools easily available is also helpful to the developers.

Care should be taken to run these tests with a kernel supported by all
the listed tags.  Otherwise build failure will cause false positives.

EXAMPLES:

The following example will verify the zfs-0.6.2 tag, the master branch,
and the installed zfs version can correctly import the listed pools.
Note there is no reference pool available for master and installed but
because binaries are available one is automatically constructed.  The
working directory is also preserved between runs (-k) preventing the
need to rebuild from source for multiple runs.

zimport.sh -k -f /var/tmp/zimport \
     -s "zfs-0.6.1 zfs-0.6.2 master installed" \
     -p "all master installed"

--------------------- ZFS on Linux Source Versions --------------
                zfs-0.6.1       zfs-0.6.2       master          0.6.2-180
-----------------------------------------------------------------
Clone SPL       Skip Skip Skip Skip
Clone ZFS       Skip Skip Skip Skip
Build SPL       Skip Skip Skip Skip
Build ZFS       Skip Skip Skip Skip
-----------------------------------------------------------------
zevo-1.1.1      Pass Pass Pass Pass
zol-0.6.1       Pass Pass Pass Pass
zol-0.6.2-173   Fail Fail Pass Pass
zol-0.6.2       Pass Pass Pass Pass
master          Fail Fail Pass Pass
installed       Pass Pass Pass Pass

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #2094

Propagate errors when registering "relatime" property callback.

Various errors can occur when registering property callbacks. As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks. This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2117

Add default FILEDIR path to zpool-config scripts

Allow the caller of the zpool-create.sh script to override
the default path where file vdevs are created. This allows
for greated flexibilty when scripting.

Additionally, update the default path from /tmp/ to /var/tmp/
because these days /tmp/ is likely a ramdisk. Even though
these files are sparse they may grow large in which case they
should be backed by a physical device.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2120

Add explicit Conflicts for zfs-fuse packages

zfs-fuse provides the same commands and man page names as ZoL.
Changing the names on either side would make each incompatible with
all existing documentation about ZFS. Providing bit identical files
is not possible due to differing codebases.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1866

Fix zconfig.sh test 9

Commit ba6a240 adjusted the behavior of 'zfs create -V'. The
caller is no longer guaranteed that udev will have finished
creating the /dev/ entries by the time to command exits. It
is therefore required that we explicitly block waiting for
udev to settle for this test to run reliably.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add systemd unit files for ZFS startup

This adds systemd unit files replacing the functionality offered by
the SysV init script found in etc/init.d.

It has been developed and tested on Fedora 19, Fedora 20
and openSuSE 13.1.

Four unit files and one target are offered.

zfs-import-cache.service:
    Import pools from /etc/zfs/zpool.cache. This unit will wait for
    udev to settle.
zfs-import-scan.service:
    Import pools by scanning /dev/disk/by-id for zvols. This unit will
    only run if /etc/zfs/zpool.cache is not present. This unit will wait
    for udev to settle
zfs-mount.service:
    Mount ZFS native filesystems. It contains a dependency to be loaded
    before local-fs.target.
zfs-share.service:
    Share NFS/SMB filesystems. This unit contains a dependency that
    will cause it to be restarted whenever the smb or nfs-server unit
    is restarted, restoring the shares added.
zfs.target:
    This target pulls in the other units in order to start ZFS. It's
    the only unit that can be enabled/disabled, all other services
    are static and pulled in by dependencies. It will honour zfs=off
    and zfs=no options on the kernel command line.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2108

Fix corrupted l2_asize in arcstats

Commit e0b0ca9 accidentally corrupted the l2_asize displayed in
arcstats. This was caused by changing the l2arc_buf_hdr.b_asize
member from an int to uint32_t type. There are places in the
code where this field is cast to a uint64_t resulting in the
b_hits member being treated as part of b_asize.

To resolve the issue the type has been changed to a uint64_t,
and the b_hits member is placed after the enum to prevent the
size of the structure from increasing.

This is a good example of exactly why it's a bad idea to use
ambiguous types (int) in these structures.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1990

4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Refences:
https://www.illumos.org/issues/4188
illumos/illumos-gate@bb411a08b05466bfe0c7095b6373bbc1587e259a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2091

Illumos 4504 traverse_visitbp: visit group before user

4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>

References:
  https://illumos.org/issues/4504
  http://code.delphix.com/illumos-4504
  http://svnweb.freebsd.org/base?view=revision&revision=260812

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2079

Implement relatime.

Add the "relatime" property. When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access. This behavior
is compatible with the Linux "relatime" mount option.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2064
Closes #1917

Fix error message in zpios

The chunksize must always be strictly smaller than the regionsize.

Signed-off-by: Andrew Uselton <andrew.c.uselton@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2072

Call gethrtime() only once per new txg creation

When transitioning current open TXG into QUIESCE state and opening
a new one txg_quiesce() calls gethrtime():
  - to mark the birth time of the new TXG
  - to record the SPA txg history kstat
  - implicitely inside spa_txg_history_add()

These timestamps are practically the same, so that the first one
can be used instead of the other two.  The only visible difference
is that inside spa_txg_history_add() the time spent in kmem_zalloc()
will be counted towards the opened TXG.

Since at this point the new TXG already exists (tx->tx_open_txg
has been already incremented) it is actually a correct accounting.

In any case this extra work is only happening when spa_txg_history
kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect
the normal processing in any way.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Issue #2075

Add additional state TXG_STATE_WAIT_FOR_SYNC for txg.

In several cases when digging into kstats we can found two txgs
in SYNC state, e.g.

txg     birth            state  nreserved  nread      nwritten ...
985452  258127184872561  C      0          373948416  2376272384 ...
985453  258129016180616  C      0          378173440  28793344 ...
985454  258129016271523  S      0          0          0 ...
985455  258130864245986  S      0          0          0 ...
985456  258130867458851  O      0          0          0 ...

However only first txg (985454) is really syncing at this moment.
The other one (985455) marked as SYNCED is actually in a post-QUIESCED
state and waiting to start sync.   So, the new TXG_STATE_WAIT_FOR_SYNC
state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to
reveal this situation.

txg     birth            state  nreserved  nread      nwritten ...
1086896 235261068743969  C      0          163577856  8437248 ...
1086897 235262870830801  C      0          280625152  822594048 ...
1086898 235264172219064  S      0          0          0 ...
1086899 235264936134407  W      0          0          0 ...
1086900 235264936296156  O      0          0          0 ...

Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2075

Use enum type(zfetch_dirn_t) instead

Fix code with zfetch_dirn_t, which is more readable and clear.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2068

Allow chown/chgrp when no ACL SAs exist.

From the comment in the commit:

Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES
SA in which case ENOENT is returned from zfs_acl_node_read() when the
SA can't be located. Allow chown/chgrp to succeed in these cases rather
than returning an error that makes no sense in the context of the caller.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfs-osx/zfs#86
Closes #1911
Closes #2029

vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)

The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on. In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1928

Remove incorrect use of EXTRA_DIST for man pages

Setting the 'dist_' prefix is the correct way to instruct Automake
to include these files in the distribution. The EXTRA_DIST variable
is reserved for files which are not covered by the automatic rules.

http://www.gnu.org/software/automake/manual/automake.html#Basics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

vdev_id: support per-channel slot mappings

The vdev_id udev helper currently applies slot renumbering rules to
every channel (JBOD) in the system.  This is too inflexible for systems
with non-homogeneous storage topologies.  The "slot" keyword now takes
an optional third parameter which names a channel to which the mapping
will apply.  If the third parameter is omitted then the rule applies to
all channels.  The first-specified rule that can match a slot takes
precedence.  Therefore a channel-specific rule for a given slot should
generally appear before a generic rule for the same slot number.  In
this way a custom slot mapping can be applied to a particular channel
and a default mapping applied to the rest.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2056

Use long holds in zvol_set_volsize()

Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.

This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice. The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.

This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2039

Disable GCCs aggressive loop optimization

GCC >+ 4.8's aggressive loop optimization breaks some of the iterators
over the dn_blkptr[] pseudo-array in dnode_phys. Since dn_blkptr[] is
defined as a single-element array, GCC believes an iterator can only
access index 0 and will unroll the loop into a single iteration.

One way to resolve the issue would be to cast the array to a pointer
and fix all the iterators that might break.  The only loop where it
is known to cause a problem is this loop in dmu_objset_write_ready():

    for (i = 0; i < dnp->dn_nblkptr; i++)
            bp->blk_fill += dnp->dn_blkptr[i].blk_fill;

In the common case where dn_nblkptr is 3, the loop is only executed a
single time and "i" is equal to 1 following the loop.

The specific breakage caused by this problem is that the blk_fill of
root block pointers wouldn't be set properly when more than one blkptr
is in use (when no indrect blocks are needed).

The simple reproducing sequence is:

zpool create tank /tank.img
zdb -ddddd tank 0

Notice that "fill=31", however, there are two L0 indirect blocks with
"F=31" and "F=5". The fill count should be 36 rather than 31. This
problem causes an assert to be hit in a simple "zdb tank" when built
with --enable-debug.

However, this approach was not taken because we need to be absolutely
sure we catch all instances of this unwanted optimization.  Therefore,
the build system has been updated to detect if GCC supports the
aggressive loop optimization.  If it does the optimization will be
explicitly disabled using the -fno-aggressive-loop-optimization option.

Original-fix-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2010
Closes #2051

Properly link zpool command to libblkid

31fc19399e597e3391f19f1392ab120f1de0d5f2 incorrectly removed $(LIBBLKID)
from cmd/zpool/Makefile.am. This meant that the toolchain was not given
-lblkid, which resulted in the following build failure on Ubuntu 13.10:

/usr/bin/ld: zpool_vdev.o: undefined reference to symbol
'blkid_put_cache@@BLKID_1.0'
/lib/x86_64-linux-gnu/libblkid.so.1: error adding symbols: DSO missing
from command line
collect2: error: ld returned 1 exit status

That commit reworked various Makefile.am to follow best practices, so we
reintroduce $(LIBBLKID) in a manner consistent with that, rather than
explicitly reverting the change.

Reproduction of this issue was done on a Gentoo Linux system by
executing the following commands:

zfs create -o mountpoint=/mnt/ubuntu-13.10 rpool/ROOT/ubuntu-13.10
debootstrap --variant=buildd --arch amd64 saucy /mnt/ubuntu-13.10 http://archive.ubuntu.com/ubuntu/
mount -o bind /dev /mnt/ubuntu-13.10/dev/
mount -o bind /proc/ /mnt/ubuntu-13.10/proc/
mount -o bind /sys/ /mnt/ubuntu-13.10/sys/
cp /etc/resolv.conf /mnt/ubuntu-13.10/etc/
(cd /mnt/ubuntu-13.10/root/ && git clone git://github.com/zfsonlinux/zfs.git)
chroot /mnt/ubuntu-13.10/
apt-get install git autoconf libtool zlib1g-dev uuid-dev libblkid-dev
\#apt-get install alien fakeroot vim
cd /root/zfs
./autogen.sh
./configure --with-config=user --prefix=/usr
make

That will create a Ubuntu 13.10 chroot, fetch the sources and build
test. At this point, cmd/zpool/Makefile.am was modified and the
following commands were run to verify that the build issue was resolved:

git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Although it is not shown here, the absence of libblkid-dev enables ZFS
to build successfully without the patch. This could explain how this
escaped detection until recently. A test without libblkid-dev was done
to verify that the patch did not cause a regression in the absence of
libblkid:

apt-get remove libblkid-dev
git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Additionally, the commands themselves were tested against my live system
from within the chroot to ensure basic functionality. My live system had
corresponding kernel modules already installed and basic commands such
as `zpool list` and `zfs list` worked without incident. Lastly, this
patch was also build tested on Gentoo Linux, where it caused no
problems.

At time of writing, these steps can be used to reproduce these results
on any modern Linux system that has debootstrap installed. On Gentoo,
installing debootstrap can be done with `emerge dev-util/debootstrap`.
The current ZFSOnLinux HEAD revision as of writing is
fd23720ae14dca926800ae70e6c8f4b4f82efc08. Once this is fixed in HEAD,
either that revision or another before this fix and after
31fc19399e597e3391f19f1392ab120f1de0d5f2 will be needed to reproduce
this issue.

Lastly, it remains to be seen why the toolchains on the systems
performing regression tests did not catch this. This is not a
ZFS-specific issue, but it is something that we will want to explore in
the future.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2038

Prevent duplicate mnttab cache entries

Under Linux its possible to mount the same filesystem multiple
times in the namespace.  This can be done either with bind mounts
or simply with multiple mount points.  Unfortunately, the mnttab
cache code is implemented using an AVL tree which does not support
duplicate entries.  To avoid this issue this patch updates the
code to check for a duplicate entry before adding a new one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Martin <mgmartin.mgm@gmail.com>
Closes #2041

Drain iput taskq outside z_teardown_lock

It's unsafe to drain the iput taskq while holding the z_teardown_lock
as a writer. This is because when the last reference on an inode is
dropped it may still have pages which need to be written to disk.
This will be done through zpl_writepages which will acquire the
z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're
holding the lock as a writer in zfs_sb_teardown the unmount will
deadlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #1988

Force LZ4_FORCE_SW_BITCOUNT for Sparc

This change was proposed for Sparc but it's not clear to me
why it's required. Proper support exists in the lz4 code to
detect the endianness and the required builtins are available
for gcc. Still I'm including the patch because it will only
impact Sparc and it may resolve a case which hasn't occured
to me.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700

Fix zfs_getattr_fast types

On Sparc sp->blksize will be a 64-bit value which is then cast
incorrectly to a 32-bit value. For big endian systems this
results in an incorrect value for sp->blksize. To resolve the
problem local variables of the correct size are used and then
assigned to sp->blksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700

Fix nvlist 'Bus Error' for Sparc

The mis-aligned memory accesses in nvpair_native_embedded() and
nvpair_native_embedded_array() will cause a 'Bus Error' for
architectures such as Sparc which not fully byte addressible.
To avoid this issue care is taken to avoid dereferencing the
potentially mis-aligned packed nvlist_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700

Use local variable to read zp->z_mode

When accessing the zp->z_mode through the SA bulk interface we
expect that 64-bits are available to hold the result.  However,
on 32-bit platforms mode_t will only be 32-bits so we cannot
pass it to SA_ADD_BULK_ATTR().  Instead a local uint64_t variable
must be used and the result assigned to zp->z_mode.

This went unnoticed on 32-bit little endian platforms because
the bytes happen to end up in the correct 32-bits.  But on big
endian platforms like Sparc the zp->z_mode will always end up
set to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700

Define the needed ISA types for Sparc

Add the minimum required ISA types to support the Sparc
architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700

Add ddt, ddt_entry, and l2arc_hdr caches

Back the allocations for ddt tables+entries and l2arc headers with
kmem caches. This will reduce the cost of allocating these commonly
used structures and allow for greater visibility of them through the
/proc/spl/kmem/slab interface.

Signed-off-by: John Layman <jlayman@sagecloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1893

Remove unconditional sharetab update

Removes the unconditional sharetab update when running any zfs
command. This means the sharetab might become out of date if
users are manually adding/removing shares with exportfs. But
we shouldn't punish all callers to zfs in order to handle that
unlikely case. In the unlikely event we observe issues because
of this it can always be added back to just the share/unshare
call paths where we need an up to date sharetab.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845

Enable /etc/mtab cache to improve performance

Re-enable the /etc/mtab cache to prevent the zfs command from
having to repeatedly open and read from the /etc/mtab file.
Instead an AVL tree of the mounted filesystems is created and
used to vastly speed up lookups. This means that if non-zfs
filesystems are mounted concurrently the 'zfs mount' will not
immediately detect them.  In practice that will rarely happen
and even if it does the absolute worst case would be a failed
mount.  This was originally disabled out of an abundance of
paranoia.

NOTE: There may still be some parts of the code which do not
consult the mtab cache.  They should be updated to check the
mtab cache as they as discovered to be a problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845

Add UNSHARING of filesystems and EXPORTING pools

As a 'stop' action ensure the filesystem is unshared before
it is unmounted, just in case. Additionally, export the pool
so it may be cleanly imported by a different host.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2003

Fix the creation of ZPOOL_HIST_CMD pool history entries.

Move the libzfs_fini() after the zpool_log_history() call so the
ZPOOL_HIST_CMD entry can get written.

Fix the handling of saved_poolname in zfsdev_ioctl()
which was broken as part of the stack-reduction work in
a16878805388c4d96cb8a294de965071d138a47b.

Since ZoL destroys the TSD data in which the previously successful
ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY
ioctl has a very important restriction: it can only successfully write
a long entry following a successful ioctl() if no intervening vops have
been performed. Some of zfs subcommands do perform intervening vops and
to do the logging themselves. At the moment, the "create" and "clone"
subcommands have been modified appropriately.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1998

Properly handle updates of variably-sized SA entries.

During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.

Previously, a size-changing update of a variably-sized SA that occurred
when there were other variably-sized SAs in the bonus buffer would cause
the subsequent SAs to be corrupted. The most common case in which this
would occur is when a mode change caused the ZPL_DACL_ACES entry to
change size when a ZPL_DXATTR (SA xattr) entry already existed.

The following sequence would have caused a failure when xattr=sa was in
force and would corrupt the bonus buffer:

open(filename, O_WRONLY | O_CREAT, 0600);
...
lsetxattr(filename, ...); /* create xattr SA */
chmod(filename, 0650); /* enlarges the ACL */

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1978

Register correct handlers for nvlist_{dup,pack,unpack}

This change is related to commit 81eaf15 which ensured the correct
allocation handlers were installed for nvlist_alloc(). The nvlist
functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer
from the same issue and have been updated accordingly.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1937

Add full SELinux support

Four new dataset properties have been added to support SELinux.  They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8).  When one of
these properties is set to something other than 'none'.  That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.

For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:

  $ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media

This will ensure the filesystem is automatically mounted with that
rootcontext.  It is equivalent to manually specifying the rootcontext
with the -o option like this:

  $ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media

By default all four contexts are set to 'none'.  Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.

Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #1504

cstyle: Resolve C style issues

The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821

cstyle: Allow spaces in all comments

Update the cstyle.pl script to allow pictures in all comments not
just header comments. Recent changes from Illumos such as d3cc8b1
have relocated various pictures in the standard block comments to
make the code more readable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1821

cstyle: Exclude several files from 'make checkstyle'

The zfs_config.h header and *.mod.c files are both products
of the build process. They must be excluded from the style
check because they are not part of the pristine source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1821

Illumos #4208

4208 Typo in zfs_main.c: "posxiuser"

Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: Will Guyette <will.guyette@delphix.com>
Reviewed by: Eric Diven <eric.diven@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/4208
illumos/illumos-gate@f38cb554a534c6df738be3f4d23327e69888e634

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1986

Add zfs_send_corrupt_data module option

Tuning setting to ignore read/checksum errors when sending data.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1982
Issue #1897

Cause zfs.spec to place dracut files properly

This is an extension of commit ffb2111. As the fedora conditional
has been added, this allows centos/rhel-6 to fall back to the
proper directory (/usr/share/dracut)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1984

Handle acl flags from util-linux mount command

Add acl, noacl and posixacl to option_map, avoiding ENOENT error
case when mount from util-linux-2.24 execs mount.zfs with any of
those flags

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968

Fix grammar in parse_options() error message

A minor grammar error was corrected in in the parse_options()
error handling for the ENOENT case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968

Fix z_sync_cnt decrement in zfs_close

The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.

Here is an example of what's wrong:

Process A B
open(O_SYNC)
z_sync_cnt = 1
open(O_SYNC)
z_sync_cnt = 2
close()
z_sync_cnt = 0

So z_sync_cnt is 0 even if B still has the file with O_SYNC.

Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented. This
is safe because generic_file_open has no side effects.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962

Silence e2fsck warning in zconfig.sh

When running zconfig.sh test 7 and 8 cause the following warning to
be printed to the console.  It's caused because we're snapshoting a
mounted ext2 filesystem which is not in a 'clean' state.  This is
to be expected since we have no guarentees about the on-disk
consistency of the filesystem.

EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended

To silence the warning and preserve the intent of these test cases
they have been updated to unmount the filesystem prior to snapshoting
them.  This ensures the ext2 filesystem is in a consistent state
when the snapshot is taken.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1972

cstyle: zvol.c

Update zvol.c to conform to the style guidelines, verified by
running cstyle.pl on the source file. This patch contains
no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #1821

Update zfs(8) Snapshots section

The Snapshots section of the zfs(8) man page is incorrect and should
have been updated as part of #1312. Snapshots of volumes can be
accessed independently and their visibility is determined by the
'snapdev=hidden|visible' property. This is analogous to the existing
'snapdir=hidden|visible' property.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1921

Sync /dev/zfs ioctl ordering

In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.

1) Sync ZoL's ioctl ordering such that it matches Illumos.  For
   historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
   ioctls were out of order.

2) Move Linux and FreeBSD specific ioctls in to their own reserved
   ranges.  This allows us to preserve the existing ordering when
   new ioctls are added by either Illumos or FreeBSD.  When an
   ioctl is no longer needed it should be retired in place.

This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules.  However, it should allow for a
much stabler interface going forward.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1973

Remove ZFC_IOC_*_MINOR ioctl()s

Early versions of ZFS coordinated the creation and destruction
of device minors from userspace.  This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel.  This significantly simplified the code.

However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux.  This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel.  Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes.  This
will minimize, but not eliminate, the disruption to end users.

Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code.  While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.

NOTES:

1) This patch introduces one subtle change in behavior which
   could not be easily avoided.  Prior to this change callers
   of 'zfs create -V ...' were guaranteed that upon exit the
   /dev/zvol/ block device link would be created or an error
   returned.  That's no longer the case.  The utilities will no
   longer block waiting for the symlink to be created.  Callers
   are now responsible for blocking, this is why a 'udev_wait'
   call was added to the 'label' function in scripts/common.sh.

2) The read-only behavior of a ZVOL now solely depends on if
   the ZVOL_RDONLY bit is set in zv->zv_flags.  The redundant
   policy setting in the gendisk structure was removed.  This
   both simplifies the code and allows us to safely leverage
   set_disk_ro() to issue a KOBJ_CHANGE uevent.  See the
   comment in the code for futher details on this.

3) Because __zvol_create_minor() and zvol_alloc() may now be
   called in a sync task they must use KM_PUSHPAGE.

References:
  illumos/illumos-gate@681d9761e8516a7dc5ab6589e2dfe717777e1123

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1969

Illumos #4121 vdev_label_init read only

4121 vdev_label_init should treat request as succeeded when pool
     is read only
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/4121
  illumos/illumos-gate@973c78e94bf9634782164382c9e291bf81161fa5

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1863

Fix atime handling.

Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime. However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.

This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.

It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.

The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949

Fix zstream_t incorrect type

The DMU zfetch code organizes streams with lists not avl trees. A
avl_node_t was mistakenly used for a list_node_t in the zstream_t
type. This is incorrect (but harmless) and when unnoticed because:

1) The list functions explicitly cast the value preventing a warning,
2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and
3) The calculated offset is the same regardless of the type.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1946

Remove MAX when initializing arc_c_max

The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.

The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.

Signed-off-by: david.chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1941

Fix multipath bug in vdev_id caused by inconsistent field numbering

The bug is caused by multipath output like this:

35000c50056bd77a7 dm-15 HP,MB3000FCWDH
size=2.7T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:16:0 sdq  65:0    active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 4:0:52:0 sdfp 130:176 active undef running

Note that the pipe symbols mean that the field numbering is different
between the sdq and sdfp lines.  The fix edits out the pipe symbols.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1692

Revert "Use directory xattrs for symlinks"

This reverts commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7.

A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1648

sa_find_sizes() may compute wrong SA header size

Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header.  This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.

The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode.  The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.

A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648).  Note the
corrupt link target name:

$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec  6 15:40 link -> 90123456701234567

Commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7 worked around this bug
by forcing xattr's on symlinks to be stored in directory format.  This
change implements a proper fix, so the workaround can now be reverted.

The reference link below contains a reproducer for FreeBSD.

References:
  http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1890

Update init script to allow verbose mounts

Allow verbose mounts to make is easier to monitor progress when
mounting a large number of filesystems.

This functionality is disabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1929

Update init script to allow /dev/disk/by-id import

Many people prefer to use by-id at import time instead of using
the cache file. This can be a much better solution than the cache
file in some environments so we're adding some infrastructure to
allow it.

This functionality is disabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1929

Fix 'zfs diff' shares error

When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented.  Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.

However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist.  Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.

There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist.  This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.

This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes #1426
Closes #481

Add module versioning

Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion".  See:

  /sys/module/zavl/version
  /sys/module/zavl/srcversion
  /sys/module/znvpair/version
  /sys/module/znvpair/srcversion
  /sys/module/zunicode/version
  /sys/module/zunicode/srcversion
  /sys/module/zcommon/version
  /sys/module/zcommon/srcversion
  /sys/module/zfs/version
  /sys/module/zfs/srcversion
  /sys/module/zpios/version
  /sys/module/zpios/srcversion

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1923