Anton Khirnov [Tue, 13 Nov 2012 20:01:24 +0000 (21:01 +0100)]
lavf input: allocate AVFrame correctly
Allocate AVFrames correctly with avcodec_alloc_frame().
This caused crashes with newer libavcodecs that try to free frame extradata.
Attempt to optimize PPS pic_init_qp in 2-pass mode
Small compression improvement; up to ~0.5% in extreme cases.
Helps more with small slice sizes (tiny resolutions or slice-max-size).
Note that this changes the 2-pass stats file format.
Improve slice header QP selection
Use the first macroblock of each slice instead of the last of the previous.
Lets us pick a reasonable initial QP for the first slice too.
Slightly improved compression.
Fiona Glaser [Thu, 11 Oct 2012 20:27:48 +0000 (13:27 -0700)]
Update level dpb size calculation to match newer H.264 spec
Doesn't actually change encoding behavior, but makes it more correct.
Warning messages should now be accurate at higher bit depths and non-4:2:0.
Technically, since it redefines x264_level_t, this is an API version increment.
Diego Biurrun [Wed, 31 Oct 2012 19:23:54 +0000 (12:23 -0700)]
x86inc: only define program_name if the macro is unset.
This allows overriding the value from outside the file.
This can be useful if x86inc.asm is used outside of x264.
Brad Smith [Tue, 21 Aug 2012 06:58:19 +0000 (23:58 -0700)]
Remove special-casing for OpenBSD pthread handling
Previously it was policy to use -pthread, but OpenBSD now recommends -lpthread.
its been libpthread anyway and policy has changed to stop using -pthread.
Faster predictor checking with subme<3
Fix a typo that made an early-skip less effective.
Avoid a relatively unpredictable branch.
Slightly changed output due to the typo-fix.
~50 cycles faster on Core i7.
Fiona Glaser [Tue, 26 Jun 2012 01:01:29 +0000 (18:01 -0700)]
Try 8x8 transform analysis even when sub8x8 partitions are present
Turn off the sub8x8 partitions, try it, and turn them back on if it didn't help.
Small compression improvement with p4x4 on (~0.1-0.5%).
Also update related comments.
Fiona Glaser [Sat, 9 Jun 2012 01:19:59 +0000 (18:19 -0700)]
Support changing resolutions between passes with macroblock-tree
Implement a basic separable bilinear filter to rescale the quantizer offsets.
Structure inspired by swscale, but floating-point instead of fixed-point.
Not as optimized as it could be, but it's quite fast already.
Example compression penalties on a 720p video game recording:
First pass with 720p and second as 480p: ~-1.5% (vs. same res)
First pass with 480p and second as 720p: ~-3% (vs. same res)
x86inc: import patches from libav
Allow manual invocation of WIN64_SPILL_XMM even under INIT_MMX
SSE version of mova is movaps rather than movdqa.
YMM version of movnta.
Add mp size for named arguments.
Fix DEFINE_ARGS when used outside of a cglobal.
Define a few more cpuflags.
3-argument wrappers for a few more instructions.
Fiona Glaser [Tue, 8 May 2012 22:42:56 +0000 (15:42 -0700)]
Threaded lookahead
Split each lookahead frame analysis call into multiple threads. Has a small
impact on quality, but does not seem to be consistently any worse.
This helps alleviate bottlenecks with many cores and frame threads. In many
case, this massively increases performance on many-core systems. For example,
over 100% faster 1080p encoding with --preset veryfast on a 12-core i7 system.
Realtime 1080p30 at --preset slow should now be feasible on real systems.
For sliced-threads, this patch should be faster regardless of settings (~10%).
By default, lookahead threads are 1/6 of regular threads. This isn't exacting,
but it seems to work well for all presets on real systems. With sliced-threads,
it's the same as the number of encoding threads.
Fiona Glaser [Thu, 29 Mar 2012 21:14:07 +0000 (14:14 -0700)]
Add mb_info API for signalling constant macroblocks
Some use-cases of x264 involve encoding video with large constant areas of the frame.
Sometimes, the caller knows which areas these are, and can tell x264.
This API lets the caller do this and adds internal tracking of modifications to macroblocks to avoid problems.
This is really only suitable without B-frames.
An example use-case would be using x264 for VNC.
Steven Walters [Thu, 29 Mar 2012 01:15:04 +0000 (21:15 -0400)]
ICL/MSVS: Fix shared library generation and usage
MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport.
This does not fix x264 cli being unable to use a shared library built by ICL however.
Anton Mitrofanov [Mon, 12 Mar 2012 06:08:18 +0000 (23:08 -0700)]
Fix clobbering of mutex/cvs
Regression in r2183.
Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower.
Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.
Fiona Glaser [Fri, 24 Feb 2012 21:34:39 +0000 (13:34 -0800)]
Sliced-threads: do hpel and deblock after returning
Lowers encoding latency around 14% in sliced threads mode with preset superfast.
Additionally, even if there is no waiting time between frames, this improves parallelism, because hpel+deblock are done during the (singlethreaded) lookahead.
For ease of debugging, dump-yuv forces all of the threads to wait and finish instead of setting b_full_recon.
Fiona Glaser [Wed, 22 Feb 2012 21:33:36 +0000 (13:33 -0800)]
x86inc: switch to amdnops
Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes).
Won't affect much, since we don't use ALIGN much.
Fiona Glaser [Wed, 15 Feb 2012 00:54:03 +0000 (16:54 -0800)]
BMI1 decimate functions
Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible.
This means we don't actually have to add new functions to make it work.
Fiona Glaser [Thu, 9 Feb 2012 22:23:52 +0000 (14:23 -0800)]
Add row-reencoding support to VBV for improved accuracy
Extremely accurate, possibly 100% so (I can't get it to fail even with difficult VBVs).
Does not yet support rows split on slice boundaries (occurs often with slice-max-size/mbs).
Still inaccurate with sliced threads, but better than before.
Add an small per-MB cost penalty for lowres
Helps avoid VBV predictors going nuts with very low-cost MBs.
One particular case this fixes is zero-cost MBs: adaptive quantization decreases the QP a lot, but (before this patch), no cost penalty gets factored in for this, because anything times zero is zero.
Henrik Gramner [Wed, 1 Feb 2012 22:52:48 +0000 (23:52 +0100)]
Fix incorrect zero-extension assumptions in x86_64 asm
Some x264 asm assumed that the high 32 bits of registers containing "int" values would be zero.
This is almost always the case, and it seems to work with gcc, but it is *not* guaranteed by the ABI.
As a result, it breaks with some other compilers, like Clang, that take advantage of this in optimizations.
Accordingly, fix all x86 code by using intptr_t instead of int or using movsxd where neccessary.
Also add checkasm hack to detect when assembly functions incorrectly assumes that 32-bit integers are zero-extended to 64-bit.
Fiona Glaser [Thu, 19 Jan 2012 22:56:54 +0000 (14:56 -0800)]
TBM, AVX2, FMA3, BMI1, and BMI2 CPU detection support
TBM and BMI1 are supported by Trinity/Piledriver.
The others (and BMI1) will probably appear in Intel's upcoming Haswell.
Also update x86inc with AVX2 stuff.
Anton Mitrofanov [Sun, 22 Jan 2012 18:13:52 +0000 (22:13 +0400)]
Use TV range algorithm for bit-depth conversions
Such sources are more common, so better to be correct for the common case.
This also produces less error for the case of full range than the previous algorithm produced for the case of TV range.
Loren Merritt [Fri, 6 Jan 2012 15:53:29 +0000 (15:53 +0000)]
CABAC trellis opts part 4: x86_64 asm
Another 20% faster.
18k->12k codesize.
This patch series may have a large impact on encoding speed.
For example, 24% faster at --preset slower --crf 23 with 720p parkjoy.
Overall speed increase is proportional to the cost of trellis (which is proportional to bitrate, and much more with --trellis 2).
Loren Merritt [Thu, 22 Dec 2011 17:56:06 +0000 (17:56 +0000)]
CABAC trellis opts part 2: C optimizations
Hoist the branch on coef value out of the loop over node contexts.
Special cases for each possible coef value (0,1,n).
Special case for dc-only blocks.
Template the main loop for two common subsets of nodes, to avoid a bunch of branches about which nodes are live.
Use the nonupdating version of cabac_size_decision in more cases, and omit those bins from the node struct.
CABAC offsets are now compile-time constants.
Change TRELLIS_SCORE_MAX from a specific constant to anything negative, which is cheaper to test.
Remove dct_weight2_zigzag[], since trellis has to lookup zigzag[] anyway.