Aaron Fineman [Wed, 18 Dec 2013 02:33:40 +0000 (02:33 +0000)]
Cause zfs.spec to place dracut files properly
This is an extension of commit ffb2111. As the fedora conditional
has been added, this allows centos/rhel-6 to fall back to the
proper directory (/usr/share/dracut)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1984
Chunwei Chen [Tue, 17 Dec 2013 18:18:25 +0000 (10:18 -0800)]
Fix z_sync_cnt decrement in zfs_close
The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.
Here is an example of what's wrong:
Process A B
open(O_SYNC)
z_sync_cnt = 1
open(O_SYNC)
z_sync_cnt = 2
close()
z_sync_cnt = 0
So z_sync_cnt is 0 even if B still has the file with O_SYNC.
Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented. This
is safe because generic_file_open has no side effects.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962
Brian Behlendorf [Fri, 13 Dec 2013 19:29:06 +0000 (11:29 -0800)]
Silence e2fsck warning in zconfig.sh
When running zconfig.sh test 7 and 8 cause the following warning to
be printed to the console. It's caused because we're snapshoting a
mounted ext2 filesystem which is not in a 'clean' state. This is
to be expected since we have no guarentees about the on-disk
consistency of the filesystem.
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
To silence the warning and preserve the intent of these test cases
they have been updated to unmount the filesystem prior to snapshoting
them. This ensures the ext2 filesystem is in a consistent state
when the snapshot is taken.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1972
Brian Behlendorf [Thu, 12 Dec 2013 22:55:19 +0000 (14:55 -0800)]
Update zfs(8) Snapshots section
The Snapshots section of the zfs(8) man page is incorrect and should
have been updated as part of #1312. Snapshots of volumes can be
accessed independently and their visibility is determined by the
'snapdev=hidden|visible' property. This is analogous to the existing
'snapdir=hidden|visible' property.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1921
Brian Behlendorf [Fri, 13 Dec 2013 22:49:33 +0000 (14:49 -0800)]
Sync /dev/zfs ioctl ordering
In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.
1) Sync ZoL's ioctl ordering such that it matches Illumos. For
historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
ioctls were out of order.
2) Move Linux and FreeBSD specific ioctls in to their own reserved
ranges. This allows us to preserve the existing ordering when
new ioctls are added by either Illumos or FreeBSD. When an
ioctl is no longer needed it should be retired in place.
This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules. However, it should allow for a
much stabler interface going forward.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1973
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1969
George Wilson [Thu, 12 Dec 2013 18:19:54 +0000 (10:19 -0800)]
Illumos #4121 vdev_label_init read only
4121 vdev_label_init should treat request as succeeded when pool
is read only
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Tim Chase [Tue, 10 Dec 2013 22:36:42 +0000 (16:36 -0600)]
Fix atime handling.
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime. However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.
This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.
It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.
The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
Shen Yan [Tue, 10 Dec 2013 06:58:53 +0000 (14:58 +0800)]
Fix zstream_t incorrect type
The DMU zfetch code organizes streams with lists not avl trees. A
avl_node_t was mistakenly used for a list_node_t in the zstream_t
type. This is incorrect (but harmless) and when unnoticed because:
1) The list functions explicitly cast the value preventing a warning,
2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and
3) The calculated offset is the same regardless of the type.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1946
david.chen [Mon, 9 Dec 2013 07:55:01 +0000 (15:55 +0800)]
Remove MAX when initializing arc_c_max
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.
The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.
Signed-off-by: david.chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1941
James Pan [Fri, 6 Dec 2013 22:16:40 +0000 (14:16 -0800)]
sa_find_sizes() may compute wrong SA header size
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header. This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.
The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode. The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.
A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648). Note the
corrupt link target name:
$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567
Commit 6a7c0ccca44ad02c476a111d8f7911fc8b12fff7 worked around this bug
by forcing xattr's on symlinks to be stored in directory format. This
change implements a proper fix, so the workaround can now be reverted.
The reference link below contains a reproducer for FreeBSD.
Update init script to allow /dev/disk/by-id import
Many people prefer to use by-id at import time instead of using
the cache file. This can be a much better solution than the cache
file in some environments so we're adding some infrastructure to
allow it.
This functionality is disabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1929
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented. Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.
However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist. Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.
There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist. This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.
This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Graham Booker <https://github.com/gbooker> Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes #1426
Closes #481
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion". See:
Matthew Ahrens [Thu, 29 Aug 2013 03:01:20 +0000 (20:01 -0700)]
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Matthew Ahrens [Fri, 22 Nov 2013 23:13:18 +0000 (15:13 -0800)]
Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.
Porting Notes:
zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them. This is always true for the standard unmount
case but there are two other cases where it is not strictly true.
* zfs_ioc_rollback() - This is the most common case and it results
from the fact that we aren't unmounting the filesystem. During a
normal unmount the MS_ACTIVE flag will be cleared on the super block
causing iput_final() to evict the inode when its reference count
drops to zero. However, during a rollback MS_ACTIVE remains set
since we're rolling back a live filesystem and need to preserve the
existing super block. This allows inodes with a zero reference count
to stay in the cache thereby violating the assertion.
* destroy_inode() / zfs_sb_teardown() - There exists a small race
between dropping the last reference on an inode and removing it from
the zsb->z_all_znodes list. This is unlikely to occur but could also
trigger the assertion which is incorrect. The inode may safely have
a zero reference count in this case.
Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT. This code is only enabled for default builds so removing
this entirely is a very safe change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1417
Closes #1536
Richard Yao [Wed, 13 Nov 2013 14:30:21 +0000 (09:30 -0500)]
Drive database update
Added:
Adata S396 (obtained from drive_id)
Apple MacBookAir3,1 SSD (obtained from drive_id)
Apple MacBookPro10,1 SSD (obtained from drive_id)
Intel 510 (obtained from drive_id)
Intel 710 (obtained from drive_id)
Intel DC S3500 (obtained from drive_id)
Netapp LUN (obtained from illumos user's sd.conf)
OCZ Agility 3 (obtained from drive_id)
OCZ Vertex (obtained from drive_id)
Samsung PM800 (obtained from drive_id)
Sandisk U100 (obtained from drive_id)
Sun Comstar (obtained from illumos user's sd.conf)
Notes:
1. The entries for the Intel DC S3500 were extrapolated from the 800GB
model's entry, which is "ATA INTEL SSDSC2BB80".
2. The entires for the Intel 710 were extrapolated from the 120GG
model's entry, which is "ATA INTEL SSDSA2BZ12".
3. The entires for the Intel 510 were extrapolated from the 250GB
model's entry, which is "ATA INTEL SSDSC2MH25".
4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from
the 512GB model's entry, which is "ATA APPLE SSD SM512E". Google
searches suggest that this is a rebadged Samsung 830.
5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from
the 128GB model's entry, which is "ATA APPLE SSD TS128C". Google
searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based
on Toshiba).
6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct
sector size is through this method. We list it only for reference
purposes, but it is commented out. Similarly, it is not clear what the
right thing to do for Netapp is, so we comment it out.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1907
Tim Chase [Wed, 20 Nov 2013 13:56:56 +0000 (07:56 -0600)]
Some nvlist allocations in hold processing need to use KM_PUSHPAGE.
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit 65c67ea86e9f112177f1ad32de8e780f10798a64.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1852
Closes #1855
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
Trey Dockendorf [Fri, 15 Nov 2013 19:36:24 +0000 (13:36 -0600)]
Change zfs-dkms requirement
Version 2.2.0.3-20 of dkms in the EPEL/Fedora repositories added the
necessary patches to support ZoL, Therefore, the zfs-dkms requirement
on dkms is set to match that version or higher. This allows us to
drop the custom dkms build in the ZoL EPEL/Fedora repositories.
Brian Behlendorf [Fri, 15 Nov 2013 17:59:09 +0000 (09:59 -0800)]
Add I/O Read/Write Accounting
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free. However, the Linux kernel does provide helper
functions allow us to perform our own accounting. These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #313
Closes #1275
This is a first draft of a zfs-module-parameters(5) man page. I have
just extracted the parameter name and its description with modinfo,
then checked the source what type it is and its default value.
This will need more work, preferably someone that actually know these
values and what to use them for.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1856
On some platforms symbols provided by libzfs_core and used by
libzfs were not available to the linker. To avoid this issue
libzfs_core has been added to the list of required libraries
when building utilities which depend on libzfs. This should
have been handled properly by libtool and it's still not
entirely clear why it wasn't on all platforms.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1841
There's a missing semicolon and equals sign in the first hunk of this
commit in config/kernel-bdi.m4. This results in the test always
failing. The effects were noticed when rrdtool, a tool which modifies
files by mmap() and msync(), would have data never get saved to disk
in spite of the files working while the mounted filesystem remains
mounted.
Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #1889
Cyril Plisko [Mon, 26 Aug 2013 06:04:38 +0000 (09:04 +0300)]
Tighten zfs dependency on zfs-kmod
Make zfs depend on the same version of zfs-kmod, rather than on same or
better. When yum repository contains a number of versions the dependency
resolution breaks on trying to install non-latest version.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1677
Brian Behlendorf [Wed, 13 Nov 2013 19:05:17 +0000 (11:05 -0800)]
Reduce stack for traverse_visitbp() recursion
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp(). This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath. To minimize the stack usage
for this call path the following changes were made:
1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
to prevent them from being inlined by gcc. This reduced the
stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
vdev_mirror_io_start() from 144 to 128 bytes.
2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
from the stack to the heap. This reduced the stack usage of
zfsdev_ioctl() from 368 to 112 bytes.
3) The major saving came from slimming down traverse_visitbp() from
from 224 to 144 bytes. Since this function is called recursively
the 80 bytes saved per invokation adds up. The following changes
were made:
a) The 'hard' local variable was replaced by a TD_HARD() macro.
b) The 'pd' local variable was replaced by 'td->td_pfd' references.
c) The zbookmark_t was moved to the heap. This does cost us an
additional memory allocation per recursion by that cost should
still be minimal. The cost could be further reduced by adding
a dedicated zbookmark_t slab cache.
d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
restructured to use the minimum amount of stack. This includes
removing the 'cbp' local variable.
Overall for the offending use case roughly 1584 of total stack space
has been saved. This is enough to avoid overflowing the stack on
stock kernels with 8k stacks. See #1778 for additional details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1778
Tim Chase [Sun, 10 Nov 2013 15:00:54 +0000 (09:00 -0600)]
Some nvlist allocations in hold processing need to use KM_PUSHPAGE.
Commit 95fd54a1c5b93bb2aa3e7dffc28c784b1e21a8bb restructured the
hold/release processing and moved some of the work into the sync task.
A number of nvlist allocations now need to use KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1852
Closes #1855
Tim Chase [Sun, 10 Nov 2013 01:22:06 +0000 (19:22 -0600)]
Fix rollback of mounted filesystem regression
The Illumos #3875 patch reverted a part of ZoL's 7b3e34b which added
special-case error handling for zfs_rezget(). The error handling dealt
with the case in which an all-ones object number ended up being passed
to dnode_hold() and causing an EINVAL to be returned from zfs_rezget().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1859
Closes #1861
Bassu [Fri, 8 Nov 2013 21:16:38 +0000 (02:16 +0500)]
Explain 'zfs list -t snap -o name -s name' speedup
Commit 0cee240 from FreeBSD dramatically speeds up 'zfs list'
performance assuming you're only interested in the dataset
names. This optimization should be mentioned in the man page
to allow end users to take advantage of it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1847
Tim Chase [Thu, 7 Nov 2013 05:55:18 +0000 (23:55 -0600)]
Handle concurrent snapshot automounts failing due to EBUSY.
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1819
Massimo Maggi [Sat, 2 Nov 2013 23:40:26 +0000 (00:40 +0100)]
Honor CONFIG_FS_POSIX_ACL kernel option
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix
ACL support for these kernels. All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.
If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.
"ZFS: Posix ACLs disabled by kernel"
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1825
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context. These were accidentally introduced
by the recent set of Illumos patches. The solution is to
switch to KM_PUSHPAGE.
George Wilson [Fri, 4 Oct 2013 22:13:23 +0000 (14:13 -0800)]
Illumos #4168, #4169, #4170
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Matthew Ahrens [Fri, 30 Aug 2013 09:19:35 +0000 (01:19 -0800)]
Illumos #4082
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
George Wilson [Thu, 29 Aug 2013 18:56:49 +0000 (10:56 -0800)]
Illumos #3954, #4080, #4081
3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Matthew Ahrens [Wed, 21 Aug 2013 04:11:52 +0000 (20:11 -0800)]
Illumos #4047
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Matthew Ahrens [Wed, 14 Aug 2013 19:42:31 +0000 (11:42 -0800)]
Illumos #3996
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
1. zfs_dbgmsg_print() is only used in userland. Since we do not have
mdb on Linux, it does not make sense to make it available in the
kernel. This means that a build failure will occur if any future
kernel patch depends on it. However, that is unlikely given that
this functionality was added to support zdb.
2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
This preserves the existing behavior of minimal noise when running
with -V, and -VV.
3. In vdev_config_generate() the call to nvlist_alloc() was not
changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
the txg_sync context.
George Wilson [Wed, 7 Aug 2013 18:24:34 +0000 (10:24 -0800)]
Illumos #3949, #3950, #3952, #3953
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The deadman thread was removed from ztest during the original
port because it depended on Solaris thr_create() interface.
This functionality should be reintroduced using the more
portable pthreads.
Steven Hartland [Tue, 6 Aug 2013 17:50:40 +0000 (09:50 -0800)]
Illumos #3973
3973 zfs_ioc_rename alters passed in zc->zc_name
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Matthew Ahrens [Mon, 29 Jul 2013 18:58:53 +0000 (10:58 -0800)]
Illumos #3834
3834 incremental replication of 'holey' file systems is slow
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Matthew Ahrens [Wed, 3 Jul 2013 16:13:38 +0000 (08:13 -0800)]
Illumos #3836
3836 zio_free() can be processed immediately in the common case
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Matthew Ahrens [Thu, 16 May 2013 21:18:06 +0000 (14:18 -0700)]
Illumos #3112, #3113, #3114
3112 ztest does not honor ZFS_DEBUG
3113 ztest should use watchpoints to protect frozen arc bufs
3114 some leaked nvlists in zfsdev_ioctl
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Amdur <Matt.Amdur@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
The /proc/self/cmd watchpoint interface is specific to Solaris.
Therefore, the #3113 implementation was reworked to use the more
portable mprotect(2) system call. When the pages are watched they
are marked read-only for protection. Any write to the protected
address range immediately trigger a SIGSEGV. The pages are marked
writable again when they are unwatched.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
3875 panic in zfs_root() after failed rollback
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Matthew Ahrens [Mon, 29 Jul 2013 18:55:16 +0000 (10:55 -0800)]
Illumos #3888
3888 zfs recv -F should destroy any snapshots created since
the incremental source
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
1. Commit 1fde1e37208c2f56c72c70a06676676f04b65998 wrapped a
declaration in dsl_dataset_modified_since_lastsnap in ASSERTV().
The ASSERTV() and local variable have been removed to avoid an
unused variable warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Richard Yao <ryao@gentoo.org>
Issue #1775
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Matthew Ahrens [Thu, 20 Jun 2013 22:43:17 +0000 (14:43 -0800)]
Illumos #3829
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Steven Hartland [Wed, 19 Jun 2013 06:36:40 +0000 (22:36 -0800)]
Illumos #3818
3818 zpool status -x should report pools with removed l2arc devices
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Steven Hartland [Sat, 25 May 2013 02:06:23 +0000 (02:06 +0000)]
Illumos #3740
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. 13fe019870c8779bf2f5b3ff731b512cf89133ef introduced a merge conflict
in dsl_dataset_user_release_tmp where some variables were moved
outside of the preprocessor directive.
2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
because this commit refactors the code, adding a new KM_SLEEP
allocation. It is not clear to me whether this should be converted
to KM_PUSHPAGE.
3. We had a merge conflict in libzfs_sendrecv.c because of copyright
notices.
4. Several small C99 compatibility fixed were made.
Will Andrews [Tue, 11 Jun 2013 17:13:47 +0000 (09:13 -0800)]
Illumos #3745, #3811
3745 zpool create should treat -O mountpoint and -m the same
3811 zpool create -o altroot=/xyz -O mountpoint=/mnt ignores
the mountpoint option
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. There is no clear way to distinguish between a failure when we
tried to unmount the snapdir of a zvol (which does not exist)
and the failure when we try to unmount a snapdir of a dataset,
so the changes to zfs_unmount_snap() were dropped in favor of
an altered Linux function that unconditionally returns 0.
Will Andrews [Tue, 11 Jun 2013 17:13:38 +0000 (09:13 -0800)]
Illumos #3743
3743 zfs needs a refcount audit
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Will Andrews [Tue, 11 Jun 2013 17:12:34 +0000 (09:12 -0800)]
Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Will Andrews [Tue, 11 Jun 2013 17:12:34 +0000 (09:12 -0800)]
Illumos #3741
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Martin Matuska [Thu, 23 May 2013 17:07:25 +0000 (13:07 -0400)]
Illumos #3699, #3739
3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Adam Leventhal [Wed, 28 Aug 2013 23:05:48 +0000 (16:05 -0700)]
Illumos #3582, #3584
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>
George Wilson [Tue, 23 Apr 2013 17:31:42 +0000 (09:31 -0800)]
Illumos #3642, #3643
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting Notes:
1. The alignment assumptions for the tx_cpu structure assume that
a kmutex_t is 8 bytes. This isn't true under Linux but tc_pad[]
was adjusted anyway for consistency since this structure was
never carefully aligned in ZoL. If careful alignment does impact
performance significantly this should be reworked to be portable.
Matthew Ahrens [Wed, 10 Apr 2013 21:54:56 +0000 (13:54 -0800)]
Illumos #3645, #3692
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Matthew Ahrens [Fri, 8 Mar 2013 18:41:28 +0000 (10:41 -0800)]
Illumos #3598
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. include/sys/zfs_context.h has been modified to render some new
macros inert until dtrace is available on Linux.
2. Linux-specific changes have been adapted to use SET_ERROR().
3. I'm NOT happy about this change. It does nothing but ugly
up the code under Linux. Unfortunately we need to take it to
avoid more merge conflicts in the future. -Brian
Yuri Pankov [Thu, 7 Mar 2013 01:57:09 +0000 (17:57 -0800)]
Illumos #3517
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
Matthew Ahrens [Fri, 5 Jul 2013 19:37:16 +0000 (15:37 -0400)]
Illumos #3603, #3604: bobj improvements
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.
Matthew Ahrens [Fri, 22 Feb 2013 09:23:09 +0000 (01:23 -0800)]
Illumos #3588
3588 provide zfs properties for logical (uncompressed) space
used and referenced
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
George Wilson [Wed, 20 Feb 2013 21:30:36 +0000 (13:30 -0800)]
Illumos #3578, #3579
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
George Wilson [Sun, 17 Feb 2013 20:00:54 +0000 (12:00 -0800)]
Illumos #3561, #3116
3561 arc_meta_limit should be exposed via kstats
3116 zpool reguid may log negative guids to internal SPA history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
1. The patch was restructured to take advantage of the existing
spa statistics infrastructure. To accomplish this the kstat
was moved in to spa->io_stats and the init/destroy code moved
to spa_stats.c.
2. The I/O kstat was simply named <pool> which conflicted with the
pool directory we had already created. Therefore it was renamed
to <pool>/io
3. An update handler was added to allow the kstat to be zeroed.
Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
1. ZFSOnLinux had already addressed many of these issues because of
its use of -Wall. However, the manner in which they were addressed
differed. The illumos fixes replace the ones previously made in
ZFSOnLinux to reduce code differences.
2. Part of the upstream patch made a small change to arc.c that might
address zfsonlinux/zfs#1334.
3. The initialization of aclsize in zfs_log_create() differs because
vsecp is a NULL pointer on ZFSOnLinux.
4. The changes to zfs_register_callbacks() were dropped because it
has diverged and needs to be resynced.
Brian Behlendorf [Wed, 30 Oct 2013 18:19:53 +0000 (11:19 -0700)]
Add cstyle.pl utility and cstyle.1 man page
Cstyle is the C source style checker used by Illumos. Since the
original ZFS source was written using these style guidelines they
must also be followed by ZoL for consistency.
The checker has been added to the scripts directory and may be
run on a per file basis. New patches should be careful to avoid
introducing new style warnings.
Additionally, the 'checkstyle' target has been added to the top
level Makefile and can be used to check the entire source tree.
While Zol has historically attempted to follow the SunOS style
guide the lack of a rigorous style checker has allowed various
warning to be introduced. Currently there are 2211 reported
style violations and we want to gradually eliminate these from
the tree.
Note the cstyle.1 man page is provided under man/man1/cstyle.1
but since it is a developer utility it is not installed along
with the other man pages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Yao [Tue, 8 Oct 2013 21:59:42 +0000 (17:59 -0400)]
Fix incorrect usage of strdup() in zfs_unmount_snap()
Modifying the length of a string returned by strdup() is incorrect
because strfree() is allowed to use strlen() to determine which slab
cache was used to do the allocation.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Richard Yao [Mon, 7 Oct 2013 11:30:22 +0000 (07:30 -0400)]
Fix order of function calls in zio_free_sync()
The resolution of a merge conflict when merging Illumos #3464 caused us
to invert the order couple of function calls in zio_free_sync() versus
what they are in Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Massimo Maggi [Mon, 28 Oct 2013 16:22:15 +0000 (09:22 -0700)]
Posix ACL Support
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems. Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.
By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property. Set the property to
'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.
This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).
http://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #170
Brian Behlendorf [Mon, 28 Oct 2013 18:57:15 +0000 (11:57 -0700)]
Improve xattr property documentation
Extend the xattr property section of zfs(8) such that it covers
both styles of supported xattr. A short discussion of the benefits
and drawbacks of each type is presented to allow users to make an
informed choice.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170
Brian Behlendorf [Mon, 28 Oct 2013 16:07:00 +0000 (09:07 -0700)]
Prevent xattr remove from creating xattr directory
Attempting to remove an xattr from a file which does not contain
any directory based xattrs would result in the xattr directory
being created. This behavior is non-optimal because it results
in write operations to the pool in addition to the expected error
being returned.
To prevent this the CREATE_XATTR_DIR flag is only passed in
zpl_xattr_set_dir() when setting a non-NULL xattr value. In
addition, zpl_xattr_set() is updated similarly such that it will
return immediately if passed an xattr name which doesn't exist
and a NULL value.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170