Implement per-cpu local caches. This seems to have bough me another
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed. We can still further
improve things on NUMA systems by creating an additional L3 cache per
memory node instead of the current global pool. With luck this won't
be needed. I should also take another look at the locking now that
everything is working. There's a good chance I can tighten it up a
little bit and improve things a little more.
kmem_lock: time (sec) slabs objs hash
kmem_lock: tot/max/calc tot/max/calc size/depth
kmem_lock: 0.
000999926 6/6/1 192/192/32 32768/0
kmem_lock: 0.
000999926 4/4/2 128/128/64 32768/0
kmem_lock: 0.
000999926 4/4/4 128/128/128 32768/0
kmem_lock: 0.
000999926 4/4/8 128/128/256 32768/0
kmem_lock: 0.
000999926 4/4/16 128/128/512 32768/0
kmem_lock: 0.
000999926 4/4/32 128/128/1024 32768/0
kmem_lock: 0.
000999926 4/4/64 128/128/2048 32768/0
kmem_lock: 0.
000999926 8/8/128 256/256/4096 32768/0
kmem_lock: 0.
003999704 24/23/256 768/736/8192 32768/1
kmem_lock: 0.
012999038 44/41/512 1408/1312/16384 32768/1
kmem_lock: 0.
051996153 96/93/1024 3072/2976/32768 32768/2
kmem_lock: 0.
181986536 187/184/2048 5984/5888/65536 32768/3
kmem_lock: 0.
655951469 342/339/4096 10944/10848/131072 32768/4
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@136
7e1ea52c-4ff2-0310-8f11-
9dd32ca42a1c