behlendo [Mon, 3 Nov 2008 22:02:15 +0000 (22:02 +0000)]
* spl-09-fix-kmem-track-oops.patch
This fixes an oops when unloading the modules, in the case where memory
tracking was enabled and there were memory leaks. The comment in the
code explains what was the problem.
* spl-10-fix-assert-verify-ndebug.patch
This fixes ASSERT*() and VERIFY*() macros in non-debug builds. VERIFY*()
macros are supposed to check the condition and panic even in production
builds, and ASSERT*() macros don't need to evaluate the arguments.
Also some 32-bit fixes.
behlendo [Mon, 3 Nov 2008 21:51:33 +0000 (21:51 +0000)]
Under Solaris KM_SLEEP ensures success (or at least you hang forever).
That said when working with a finite resource like memory failure really
is always a possibility. It would be far better longer term if the ZFS
code could be weened off this assumption and properly handle the cases
where an allocation fails. Still I've applied the patch to spl-0.3.4
since this layer is supposed to emulate Solaris as closely as possible.
behlendo [Mon, 3 Nov 2008 21:06:04 +0000 (21:06 +0000)]
Add a SPL_AC_TYPE_ATOMIC64_T test to configure for systems which do
already supprt atomic64_t types.
* spl-07-kmem-cleanup.patch
This moves all the debugging code from sys/kmem.h to spl-kmem.c, because
the huge macros were hard to debug and were bloating functions that
allocated memory. I also fixed some other minor problems, including
32-bit fixes and a reported memory leak which was just due to using the
wrong free function.
behlendo [Mon, 3 Nov 2008 20:21:08 +0000 (20:21 +0000)]
Apply a nice fix caught by Ricardo,
* spl-04-fix-taskq-spinlock-lockup.patch
Fixes a deadlock in the BIO completion handler, due to the taskq code
prematurely re-enabling interrupts when another spinlock had disabled
them in the IDE IRQ handler.
behlendo [Mon, 3 Nov 2008 20:07:20 +0000 (20:07 +0000)]
Reviewed and applied spl-01-rm-gpl-symbol-set_cpus_allowed.patch
from Ricardo which removes a dependency on the GPL-only symbol
set_cpus_allowed(). Using this symbol is simpler but in the name
of portability we are adopting a spinlock based solution here
to remove this dependency.
behlendo [Mon, 3 Nov 2008 19:53:23 +0000 (19:53 +0000)]
Reviewed and applied spl-00-rm-gpl-symbol-notifier_chain.patch
from Ricardo which removes a dependency on the GPL-only symbol
needed for a panic time notifier. This funcationality was never
used and this improves our portability.
behlendo [Sun, 10 Aug 2008 03:50:36 +0000 (03:50 +0000)]
Add class / device portability code. Two autoconf tests
were added to cover the 3 possible APIs from 2.6.9 to
2.6.26. We attempt to use the newest interfaces and if
not available fallback to the oldest. This a rework of
some changes proposed by Ricardo for RHEL4.
behlendo [Tue, 5 Aug 2008 04:16:09 +0000 (04:16 +0000)]
Start bringing in Ricardo's spl-00-rhel4-compat.patch, a few chunks
at a time as I audit it. This chunk finishes moving the SPL entirely
off the linux slab on to the SPL implementation. It differs slightly
from the proposed version in that the spl continues to export to
all the Solaris types and functions. These do conflict with the
Linux slab so a module usings these interfaces must not include the
SPL slab if they also intend to use the linux slab. Or they must
explcitly #undef the macros which remap the functioin to their
spl_* equivilants.
A nice side of effect of dropping the entire linux slab is we
don't need to autoconf checks anymore. They kept messing with
the slab API endlessly!
- Remove hash functionality from slab in favor of direct lookups
based of the spl_kmem_obj_t tacked on the end of each object.
This actually isn't so back because we are now allocing large
chunks for the slab and partitioning it ourselves. So there's
not a ton of wasted space. We may suffer a performance hit
however due to alignment issues.
- Remove remaining depenancies on the linux slab implementation.
We're standing on our own now for better or worse.
- Rework slabs to be either kmem or vmem based. If neither
KMC_VMEM of KMC_KMEM are specified we make a decent guess
about what will work best for their based on the object
size. Additionally we provide a kmem_virt() function caller
can use to see if they have a virtual or physical address.
behlendo [Sat, 28 Jun 2008 20:03:11 +0000 (20:03 +0000)]
Remove stray call to spl_cache_free() and remove all the
cycle count which was costing me overhead. It was hurting
performance pretty badly for heavily used caches. I'm also
thinking the hash may be hurting me as well and it might
be worth sticking a pointer in to a little space after the
alloced object.
behlendo [Sat, 28 Jun 2008 05:04:46 +0000 (05:04 +0000)]
Victory! I've reworked caches with large objects which are
based by vmalloc()'ed memory. I now alloc a slab which is
roughly 32*spl_obj_size and in this block of memory I place
the slab descriptor, slab object descriptors, and objects
themselves. This greatly reduces vmalloc lock contention.
Still some minor cleanup remains and fine tuning but
it's working pretty well.
behlendo [Fri, 27 Jun 2008 21:40:11 +0000 (21:40 +0000)]
Further slab improvements, I'm getting close to something which works
well for the expected workloads. Improvement in this commit include:
- Added DEBUG_KMEM_TRACKING #define which can optionally be set
when DEBUG_KMEM is defined to do per allocation tracking. This
allows us to get all the lightweight kmem debugging enabled by
default which is pretty light weight, and only when looking
for a memory leak we can briefly enable the per alloc tracking.
- Added set_normalized_timespec() in to SPL to simply using
the timespec() primatives from within a module.
- Added per-spinlock cycle counters to the slab in an attempt
to run down a lock contention issue. The contended lock
was in vmalloc() but I'm going to leave the cycle counters
in place for a little while until I'm convinced there arn't
other locking improvement possible in the slab.
- Added a proc interface to the slab to export per slab
cache statistics to /proc/spl/kmem/slab for analysis.
- Reworked spl_slab_alloc() function to allocate from kmem for
small allocation and vmem for large allocations. This improved
things considerably but futher work is needed.
behlendo [Thu, 26 Jun 2008 19:49:42 +0000 (19:49 +0000)]
Fix for memory corruption caused by overruning the magazine
when repopulating it. Plus I fixed a few more suble races in
that part of the code which were catching me. Finally I fixed
a small race in kmem_test8.
behlendo [Wed, 25 Jun 2008 20:57:45 +0000 (20:57 +0000)]
Implement per-cpu local caches. This seems to have bough me another
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed. We can still further
improve things on NUMA systems by creating an additional L3 cache per
memory node instead of the current global pool. With luck this won't
be needed. I should also take another look at the locking now that
everything is working. There's a good chance I can tighten it up a
little bit and improve things a little more.
behlendo [Tue, 24 Jun 2008 17:18:15 +0000 (17:18 +0000)]
The first locking issue was due to the semaphore I used. I was trying
to be overly clever and the context switch when the semaphore was busy
was destroying performance. Converting to a simple spin lock bough me
a factor of 50 or so. That said it's still not good enough. Tests
show bad performance and we are still CPU bound. The logical fix is
I need to implement per-cpu hot caches to minimize the SMP contention.
Linux and Solaris both have this, I was hoping to do without but it
looks like that's not to be.
behlendo [Mon, 23 Jun 2008 23:54:52 +0000 (23:54 +0000)]
Add another kmem test to check for lock contention in the slab
allocator. I have serious contention issues here and I needed
a way to easily measure how much the following batch of changes
will improve things. Currently things are quite bad when the
allocator is highly contended, and interestingly it seems to
get worse in a non-linear fashion... I'm not sure why yet.
I'll figure it out tomorrow.
behlendo [Fri, 13 Jun 2008 23:41:06 +0000 (23:41 +0000)]
* : modules/sys/kmem-slab.c : Re-implemented the slab to no
longer be based on the linux slab but to be its own complete
implementation. The new slab behaves much more like the
Solaris slab than the Linux slab.
behlendo [Wed, 4 Jun 2008 06:00:46 +0000 (06:00 +0000)]
Fixes:
1) Ensure mutex_init() never fails in the case of ENOMEM by retrying
forever. I don't think I've ever seen this happen but it was clear
after code inspection that if it did we would immediately crash.
2) Enable full debugging in check.sh for sanity tests. Might as well
get as much debug as we can in the case of a failure.
3) Reworked list of kmem caches tracked by SPL in to a hash with the
key based on the address of the kmem_cache_t. This should speed
up the constructor/destructor/shrinker lookup needed now for newer
kernel which removed the destructor support.
4) Updated kmem_cache_create to handle the case where CONFIG_SLUB
is defined. The slub would occasionally merge slab caches which
resulted in non-unique keys for our hash lookup in 3). To fix this
we detect if the slub is enabled and then set the needed flag
to prevent this merging from ever occuring.
5) New kernels removed the proc_dir_entry pointer from items
registered by sysctl. This means we can no long be sneaky and
manually insert things in to the sysctl tree simply by walking
the proc tree. So I'm forced to create a seperate tree for
all the things I can't easily support via sysctl interface.
I don't like it but it will do for now.
behlendo [Tue, 3 Jun 2008 20:58:55 +0000 (20:58 +0000)]
Fix missing return resulting in a double unlock of &files->file_lock
and a hang on subsequent sys_close. I'm not quite sure why the Fedora
kernel caught this bug the Chaos kernel did not, but I'm glad!
behlendo [Mon, 2 Jun 2008 17:28:49 +0000 (17:28 +0000)]
Breaking the world for a little bit. If anyone is going to continue
working on this branch for the next few days I suggested you work
off of the 0.3.1 tag. The following changes are fairly extensive
and are designed to make the SPL compatible with all kernels in
the range of 2.6.18-2.6.25. There were 13 relevant API changes
between these releases and I have added the needed autoconf tests
to check for them. However, this has not all been tested extensively.
I'll sort of the breakage on Fedora Core 9 and RHEL5 this week.
behlendo [Mon, 19 May 2008 02:49:12 +0000 (02:49 +0000)]
- Properly fix the debug support for all the ASSERT's, VERIFIES, etc can be
compiled out when doing performance runs.
- Bite the bullet and fully autoconfize the debug options in the configure
time parameters. By default all the debug support is disable in the core
SPL build, but available to modules which enable it when building against
the SPL. To enable particular SPL debug support use the follow configure
options:
behlendo [Thu, 15 May 2008 17:10:30 +0000 (17:10 +0000)]
Rework condition variable implementation to be consistent with
other primitive implementations. Additionally ensure that GFP_ATOMIC
is use for allocations when in interrupt context.
behlendo [Mon, 12 May 2008 18:54:08 +0000 (18:54 +0000)]
Enhanse the thread interface to do something quasi inteligent
with the function name passed to be used as a thread name. Leaving
the trailing _thread is just redundant so just strip it this
make the thread names far more readable.
behlendo [Fri, 9 May 2008 21:21:33 +0000 (21:21 +0000)]
Stability hack. Under Solaris when KM_SLEEP is set kmem_cache_alloc()
may not fail. To get this behavior I'd added a retry to the shim layer
even though it is abusive to the VM, at least it should prevent the crash.
Additionally I added a proc counter so I can easily check how often this
is happening. It should be fairly rare, but likely will get worse and
worse the longer the machine has been up.
behlendo [Thu, 8 May 2008 23:21:47 +0000 (23:21 +0000)]
Add an almost feature complete implemenation of kstat. I chose
not to support a few flags (we assert if they are used), and I
did not add the libkstat interface and instead exported everything
to proc for easy access.
behlendo [Tue, 6 May 2008 20:38:28 +0000 (20:38 +0000)]
Lots of fixes here:
- Detailed kmem memory allocation tracking. We can now get on
spl module unload a list of all memory allocations which were
not free'd and where the original alloc was. E.g.
- Shift to using rwsems in kmem implmentation, to simply locking and
improve concurency.
- Shift to using rwsems in mutex implementation, additionally ensure we
never sleep in the init function if non-zero preempt_count or
interrupts are disabled as can happen in a slab cache ctor/dtor.
- Other minor formating fixes and such.
TODO:
- Finish the vmem memory allocation tracking
- Vet all other SPL primatives for potential sleeping during *_init. I
suspect the rwlock implemenation does this and should be fixes just
like the mutex implemenation.
behlendo [Mon, 5 May 2008 20:18:49 +0000 (20:18 +0000)]
Commit adaptive mutexes. This seems to have introduced some new
crashes but it's not clear to me yet if these are a problem with
the mutex implementation or ZFSs usage of it.
Minor taskq fixes to add new tasks to the end of the pending list.
New an improved taskq implementation for the SPL. It allows a
configurable number of threads like the Solaris version and almost
all of the options are supported. Unfortunately, it appears to have
made absolutely no difference to our performance numbers. I need
to keep looking for where we are bottle necking.
Make sure that when calling __vmem_alloc that we
do not have __GFP_ZERO set. Once the memory is allocated
then zero out the memory if __GFP_ZERO is passed to
__vmem_alloc.
Be careful to never use any of the debug infrastructure either
before the debug subsystem is fully set up, or after the debug
subsystem has been torn down.
Stack usage is my enemy. Trade cpu cycles in the debug code to
ensure I never add anything to the stack I don't absolutely need.
All this debug code could be removed from a production build
anyway so I'm not so worried about the performance impact. We
may also consider revisting the mutex and condvar implementation
to ensure no additional stack is used there.
Initial indications are I have reduced the worst case stack
usage to 9080 bytes. Still to large for the default 8k stacks
so I have been forced to run with 16k stacks until I can
reduce the worst offenders.
More fixes to ensure we get good debug logs even if we're in the
process of destroying the stacks. Threshhold set fairly aggressively
top 80% of stack usage.
Update SPL to use new debug infrastructure. This means:
- Replacing all BUG_ON()'s with proper ASSERT()'s
- Using ENTRY,EXIT,GOTO, and RETURN macro to instument call paths
First commit of lustre style internal debug support. These
changes bring over everything lustre had for debugging with
two exceptions. I dropped by the debug daemon and upcalls
just because it made things a little easier. They can be
readded easily enough if we feel they are needed.
Everything compiles and seems to work on first inspection
but I suspect there are a handful of issues still lingering
which I'll be sorting out right away. I just wanted to get
all these changes commited and safe. I'm getting a little
paranoid about losing them.
* modules/spl/spl-kmem.c : Make sure to disable interrupts
when necessary to avoid deadlocks. We were seeing the deadlock
when calling kmem_cache_generic_constructor() and then an interrupt
forced us to end up calling kmem_cache_generic_destructor()
which caused our deadlock.
- Add some spinlocks to cover all the private data in the mutex. I don't
think this should fix anything but it's a good idea regardless.
- Drop the lock before calling the construct/destructor for the slab
otherwise we can't sleep in a constructor/destructor and for long running
functions we may NMI.
- Do something braindead, but safe for the console debug logs for now.
Add hw_serial support based on a usermodehelper which runs
at spl module load time can calls hostid. The resolved hostid
is then fed back in to a proc entry for latter use. It's
not a pretty thing, but it will work for now. The hw_serial
is required for things such as 'zpool status' to work.
Adjust the condition variables to simply sleep uninteruptibly.
This way we don't have to contend with superious wakeups which
it appears ZFS is not so careful to handle anyway. So this is
probably for the best.
Fix race in rwlock implementation which can occur when
your task is rescheduled to a different cpu after you've
taken the lock but before calling RW_LOCK_HELD is called.
We need the spinlock to ensure there is a wmb() there.
- Fix write-only behavior in vn-open()
- Ensure we have at least 1 write-only splat test
- Fix return codes for vn_* Solaris does not use negative return
codes in the kernel. So linux errno's must be inverted.
Update the thread shim to use the current kernel threading API.
We need to use kthread_create() here for a few reasons. First
off to old kernel_thread() API functioin will be going away.
Secondly, and more importantly if I use kthread_create() we can
then properly implement a thread_exit() function which terminates
the kernel thread at any point with do_exit(). This fixes our
cleanup bug which was caused by dropping a mutex twice after
thread_exit() didn't really exit.
Correctly implement atomic_cas_ptr() function. Ideally all of these
atomic operations will be rewritten anyway with the correct arch
specific assembly. But not today.
- Remapped ldi_handle_t to struct block_device * which is much more useful
- Added liunx block device headers to sunldi.h
- Made __taskq_dispatch safe for interrupt context where it turns out we
need to be useing it.
- Fixed NULL const/dest bug for kmem slab caches
- Places debug __dprintf debugging messages under a spin_lock_irqsave
so it's safe to use then in interrupt handlers. For debugging only!
Apparently it's OK for done to be NULL, which was not clear in the
Solaris man page. Anyway, since apparently this usage is accectable
I've updated the function to handle it.
Double large kmalloc warning size to 4 pages. It was 2 pages, and ideally
it should be dropped to one page but in the short term we should be able
to easily live with 4 page allocations.
Fix the nvlist bug, it turns out the user space side of things were
packing the nvlists correctly as little endian, and the kernel space
side of things due to a missing #define were unpacking them as big endian.