Henrik Gramner [Sat, 11 May 2013 21:39:09 +0000 (23:39 +0200)]
x86inc: Utilize the shadow space on 64-bit Windows
Store XMM6 and XMM7 in the shadow space in functions that clobbers them.
This way we don't have to adjust the stack pointer as often,
reducing the number of instructions as well as code size.
~2x faster coeff_level_run.
Faster CAVLC encoding: {1%,2%,7%} overall with {superfast,medium,slower}.
Uses the same pshufb LUT abuse trick as in the previous ads_mvs patch.
Steve Borho [Thu, 21 Feb 2013 18:48:40 +0000 (12:48 -0600)]
OpenCL lookahead
OpenCL support is compiled in by default, but must be enabled at runtime by an
--opencl command line flag. Compiling OpenCL support requires perl. To avoid
the perl requirement use: configure --disable-opencl.
When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
device. Lowres intra cost prediction, lowres motion search (including subpel)
and bidir cost predictions are all done on the GPU. MB-tree and final slice
decisions are still done by the CPU. Presets which do not use a threaded
lookahead will not use OpenCL at all (superfast, ultrafast).
Because of data dependencies, the GPU must use an iterative motion search which
performs more total work than the CPU would do, so this is not work efficient
or power efficient. But if there are spare GPU cycles to spare, it can often
speed up the encode. Output quality when OpenCL lookahead is enabled is often
very slightly worse in quality than the CPU quality (because of the same data
dependencies).
x264 must compile its OpenCL kernels for your device before running them, and in
order to avoid doing this every run it caches the compiled kernel binary in a
file named x264_lookahead.clbin (--opencl-clbin FNAME to override). The cache
file will be ignored if the device, driver, or OpenCL source are changed.
x264 will use the first GPU device which supports the required cl_image
features required by its kernels. Most modern discrete GPUs and all AMD
integrated GPUs will work. Intel integrated GPUs (up to IvyBridge) do not
support those necessary features. Use --opencl-device N to specify a number of
capable GPUs to skip during device detection.
Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
as some have bugs in their OpenCL drivers that cause output to be silently
incorrect.
Developed by MulticoreWare with support from AMD and Telestream.
Fiona Glaser [Mon, 4 Mar 2013 23:19:47 +0000 (15:19 -0800)]
weightp: improve scale/offset search, chroma
Rescale the scale factor if the offset clips. This makes weightp more effective
in fades to/from white (and an other situation that requires big offsets).
Search more than 1 scale factor and more than 1 offset, depending on --subme.
Try to find the optimal chroma denominator instead of hardcoding it.
Overall improvement: a few percent in fade-heavy clips, such as a sample from
Avatar: TLA.
Fiona Glaser [Tue, 19 Feb 2013 21:48:44 +0000 (13:48 -0800)]
Add slices-max feature
The H.264 spec technically has limits on the number of slices per frame. x264
normally ignores this, since most use-cases that require large numbers of
slices prefer it to. However, certain decoders may break with extremely large
numbers of slices, as can occur with some slice-max-size/mbs settings.
When set, x264 will refuse to create any slices beyond the maximum number,
even if slice-max-size/mbs requires otherwise.
Fiona Glaser [Fri, 15 Feb 2013 01:22:02 +0000 (17:22 -0800)]
Add slice-min-mbs feature
Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
small slices.
Useful with certain decoders that barf on extremely small slices.
If slice-min-mbs would be violated as a result of slice-max-size, x264 will
exceed slice-max-size and print a warning.
Fiona Glaser [Fri, 8 Feb 2013 23:34:38 +0000 (15:34 -0800)]
quant_4x4x4: quant one 8x8 block at a time
This reduces overhead and lets us use less branchy code for zigzag, dequant,
decimate, and so on.
Reorganize and optimize a lot of macroblock_encode using this new function.
~1-2% faster overall.
Includes NEON and x86 versions of the new function.
Using larger merged functions like this will also make wider SIMD, like
AVX2, more effective.
Add AvxSynth support to the AviSynth input module.
Uses dlopen to load AvxSynth on Linux and OS X.
Allows the use of --demuxer avs for AvxSynth, though the only source filter it
can currently use is FFMS2.
Add a local copy of avxsynth_c.h and its dependent headers in extras/ so that
users don't need to actually have AvxSynth development headers installed to
enable support for it (mirroring the AviSynth behavior).
Fiona Glaser [Sat, 2 Feb 2013 20:37:08 +0000 (12:37 -0800)]
x86: detect Bobcat, improve Atom optimizations, reorganize flags
The Bobcat has a 64-bit SIMD unit reminiscent of the Athlon 64; detect this
and apply the appropriate flags.
It also has an extremely slow palignr instruction; create a flag for this to
avoid massive penalties on palignr-heavy functions.
Improve Atom function selection and document exactly what the SLOW_ATOM flag
covers.
Add Atom-optimized SATD/SA8D/hadamard_ac functions: simply combine the ssse3
optimizations with the sse2 algorithm to avoid pmaddubsw, which is slow on
Atom along with other SIMD multiplies.
Drop TBM detection; it'll probably never be useful for x264.
Invert FastShuffle to SlowShuffle; it only ever applied to one CPU (Conroe).
Detect CMOV, to fail more gracefully when run on a chip with MMX2 but no CMOV.
Fiona Glaser [Sat, 19 Jan 2013 06:55:46 +0000 (22:55 -0800)]
x86: optimize and clean up predictor checking
Branchlessly handle elimination of candidates in MMX roundclip asm.
Add a new asm function, similar to roundclip, except without the round part.
Optimize and organize the C code, and make both subme>=3 and subme<3 consistent.
Add lots of explanatory comments and try to make things a little more understandable.
~5-10% faster with subme>=3, ~15-20% faster with subme<3.
Fiona Glaser [Thu, 10 Jan 2013 21:15:52 +0000 (13:15 -0800)]
Improve lookahead-threads auto selection
Smarter decision to improve fast-first-pass performance in 2-pass encodes.
Dramatically improves CPU utilization on multi-core systems.
Tested on a quad-core Ivy Bridge (12 threads, 1080p):
Fast first pass:
veryfast: ~7% faster
faster: ~11% faster
fast/medium: ~15% faster
slow/slower: ~42% faster
veryslow: ~55% faster
CRF/1-pass:
veryfast: ~9% faster
(all others remained the same)
Diego Biurrun [Thu, 17 Jan 2013 19:30:37 +0000 (11:30 -0800)]
x86inc: rename program_name to private_prefix
Synced from libav.
The new name is more descriptive and will allow defining a separate public
prefix for externally visible library symbols.
Ronald S. Bultje [Wed, 30 Jan 2013 17:48:14 +0000 (09:48 -0800)]
x86-32: use simple nop codes for <= sse
The "CentaurHauls family 6 model 9 stepping 8" family of CPUs (flags:
fpu vme de pse tsc msr cx8 sep mtrr pge mov pat mmx fxsr sse up rng
rng_en ace ace_en) SIGILLs on long nop codes.
Henrik Gramner [Tue, 11 Dec 2012 15:05:34 +0000 (16:05 +0100)]
x86inc: Use VEX-encoded instructions in AVX functions
Automatically use VEX-encoding in AVX/AVX2/XOP/FMA3/FMA4 functions for all instructions that exists in a VEX-encoded version.
This change makes it easier to extend existing code to use AVX2.
Also add support for AVX emulation of a few instructions that were missing before.
Loren Merritt [Sun, 2 Dec 2012 15:56:30 +0000 (15:56 +0000)]
x86inc: activate REP_RET automatically
Now RET checks whether it immediately follows a branch, so the programmer dosen't have to keep track of that condition.
REP_RET is still needed manually when it's a branch target, but that's much rarer.
The implementation involves lots of spurious labels, but that's ok because we strip them.
x86inc: support stack mem allocation and re-alignment in PROLOGUE
Use this in 8-bit loopfilter functions so they can be used if
there is no aligned stack (e.g. x86-32 MSVC or ICC 10.x).
Anton Khirnov [Tue, 13 Nov 2012 20:01:24 +0000 (21:01 +0100)]
lavf input: allocate AVFrame correctly
Allocate AVFrames correctly with avcodec_alloc_frame().
This caused crashes with newer libavcodecs that try to free frame extradata.