Fiona Glaser [Wed, 31 Mar 2010 08:44:07 +0000 (01:44 -0700)]
Massive cosmetic and syntax cleanup
Convert all applicable loops to use C99 loop index syntax.
Clean up most inconsistent syntax in ratecontrol.c, visualize, ppc, etc.
Replace log(x)/log(2) constructs with log2, and similar with log10.
Fix all -Wshadow violations.
Fix visualize support.
Fiona Glaser [Fri, 26 Mar 2010 22:33:20 +0000 (15:33 -0700)]
New "superfast" preset, much faster intra analysis
Especially at the fastest settings, intra analysis was taking up the majority of MB analysis time.
This patch takes a ton more shortcuts at the fastest encoding settings, decreasing compression 0.5-5% but improving speed greatly.
Also rearrange the fastest presets a bit: now we have ultrafast, superfast, veryfast, faster.
superfast is the old veryfast (but much faster due to this patch).
veryfast is between the old veryfast and faster.
faster is the same as before except with MB-tree on.
Encoding with subme >= 5 should be unaffected by this patch.
Fiona Glaser [Tue, 23 Mar 2010 21:00:58 +0000 (14:00 -0700)]
Add tune for still image compression
There has been some demand for this from companies looking to use x264 for still image compression (it can outperform JPEG or JPEG-2000 by a factor of 2 or more).
Still image compression is a bit different; because temporal stability isn't an issue, we can get away with far more powerful psy settings.
Fiona Glaser [Sun, 21 Mar 2010 00:07:12 +0000 (17:07 -0700)]
Much faster non-RD intra analysis
Since every pred mode costs at least 1 bit, move that part into the initial SATD cost.
This lets i4x4/i8x8 analysis terminate earlier.
If the cost of the predicted mode is less than the cost of signalling any other mode, early-terminate the analysis.
Fiona Glaser [Sun, 14 Mar 2010 08:25:02 +0000 (00:25 -0800)]
Various motion estimation optimizations
Faster method of checking MV range.
Predict MVs and cache MVs/MVDs for bidir qpel-RD.
A whole bunch of other minor optimizations.
Slightly better performance and compression.
Fiona Glaser [Sun, 14 Mar 2010 08:19:59 +0000 (00:19 -0800)]
Overhaul macroblock_cache_rect
Unify the rectangle functions into a single one similar to ffmpeg's fill_rectangle.
Remove all cases of variable-size cache_rect calls; create a function-pointer-based system for handling such cases.
Should greatly decrease code size required for such calls.
Fiona Glaser [Sun, 7 Mar 2010 12:10:30 +0000 (04:10 -0800)]
Much more accurate B-skip detection at 2 < subme < 7
Use the same method that x264 uses for P-skip detection.
This significantly improves quality (1-6%), but at a significant speed cost as well (5-20%).
It also may have a very positive visual effect in cases where the inaccurate skip detection resulted in slightly-off vectors in B-frames.
This could cause slight blurring or non-smooth motion in low-complexity frames at high quantizers.
Not all instances of this problem are solved: the only universal solution is non-locally-optimal mode decision, which x264 does not currently have.
Kieran Kunhya [Fri, 5 Mar 2010 20:43:02 +0000 (20:43 +0000)]
Save a few bits in slice headers
Don't override the maximum ref index in the slice header if it's the same as the default.
Also update the naming of the relevant variables in the PPS.
Fiona Glaser [Fri, 19 Mar 2010 21:44:10 +0000 (14:44 -0700)]
"CRF-max" support with VBV
This is a rather curious feature that may have more use than is initially obvious.
In CRF mode with VBV enabled, CRF-max allows the user to specify a quality level which the encoder will never go below, even due to the effects of VBV.
This is not the same as qpmax, which is not aware of issues like scene complexity.
Setting this WILL cause VBV underflows in any situation where the encoder would have needed to exceed the relevant CRF to avoid underflow.
Why might one want to do this even if it would cause VBV underflows?
In the case of streaming, particularly ultra-low-latency streaming, it may be preferable to drop frames than to display frames that are of too low a quality.
Thus, in extremely complex scenes, rather than display completely awful video, the streaming server could simply drop to a lower framerate.
Scenecuts, which normally look terrible under situations like single-frame VBV, could be handled by just displaying them a bit later and dropping frames to compensate.
In other words, it's better to see the scenecut 150ms delayed than for it to look like a blocky mess for 150ms.
On the caller-side, this would be handled by detecting the output size of x264's frames and dropping future frames to compensate if necessary.
This can also be used in normal encoding simply to ensure that VBV does not hurt quality too much (at the cost of potentially causing underflows).
This can help quite a lot when using single-frame VBV and sliced threads, where VBV can often be somewhat unstable.
Kieran Kunhya [Tue, 2 Mar 2010 08:57:10 +0000 (00:57 -0800)]
Blu-ray support: NAL-HRD, VFR ratecontrol, filler, pulldown
x264 can now generate Blu-ray-compliant streams for authoring Blu-ray Discs!
Compliance tested using Sony BD-ROM Verifier 1.21.
Thanks to The Criterion Collection for sponsoring compliance testing!
An example command, using constant quality mode, for 1080p24 content:
x264 --crf 16 --preset veryslow --tune film --weightp 0 --bframes 3 --nal-hrd vbr --vbv-maxrate 40000 --vbv-bufsize 30000 --level 4.1 --keyint 24 --b-pyramid strict --slices 4 --aud --colorprim "bt709" --transfer "bt709" --colormatrix "bt709" --sar 1:1 <input> -o <output>
This command is much more complicated than usual due to the very complicated restrictions the Blu-ray spec has.
Most options after "tune" are required by the spec.
--weightp 0 is not, but there are known bugged Blu-ray player chipsets (Mediatek, notably) that will decode video with --weightp 1 or 2 incorrectly.
Furthermore, note the Blu-ray spec has very strict limitations on allowed resolution/fps combinations.
Examples include 1080p @ 24000/1001fps (NTSC FILM) and 720p @ 60000/1001fps.
Detailed features introduced in this patch:
Full NAL-HRD compliance, with both VBR (no filler) and CBR (filler) modes.
Can be enabled with --nal-hrd vbr/cbr.
libx264 now returns HRD timing information to the caller in the form of an x264_hrd_t.
x264cli doesn't currently use it, but this information is critical for compliant TS muxing.
Full VFR ratecontrol support: VBV, 1-pass ABR, and 2-pass modes.
This means that, even without knowing the average framerate, x264 can achieve a correct bitrate in target bitrate modes.
Note that this changes the statsfile format; first pass encodes make before this patch will have to be re-run.
Pulldown support: libx264 allows the calling application to specify a pulldown mode for each frame.
This is similar to the way that RFFs (Repeat Field Flags) work in MPEG-2.
Note that libx264 does not modify timestamps: it assumes the calling application has set timestamps correctly for pulldown!
x264cli contains an example implementation of caller-side pulldown code.
Pic_struct support: necessary for pulldown and allows interlaced signalling.
Also signal TFF vs BFF with delta_poc_bottom: should significantly improve interlaced compression.
--tff and --bff should be preferred to the old --interlaced in order to tell x264 what field order to use.
Huge thanks to Alex Giladi and Lamont Alston for their work on code that eventually became part of this patch.
Yusuke Nakamura [Mon, 1 Mar 2010 05:42:19 +0000 (21:42 -0800)]
Timecode input/output
--tcfile-in allows a user to specify a timecode v1 or v2 file to override input timestamps.
Useful for dealing with VFR input, especially when FFMS/LAVF support isn't available.
--tcfile-out writes a timecode v2 file containing the timecodes of the output file.
New --timebase option allows a user to change the stream timebase.
Intended primarily for forcing timebase with timecode files if necessary.
When using --seek, note that x264 will seek in the timecode file as well.
Alex Wright [Sun, 28 Feb 2010 09:29:15 +0000 (01:29 -0800)]
Mixed-refs support for B-frames
Small speed cost, usually a few percent at most. Generally has lowest cost in cases when it isn't very useful. Up to ~2% better compression overall on highly complex sources.
Also fix a few minor bugs in B-frame analysis and various bits of cleanup.
Holger Lubitz [Tue, 23 Mar 2010 23:54:39 +0000 (00:54 +0100)]
Faster cabac_encode_decision_asm
Minimizes instruction count, which also means smaller code.
Various other slight changes to allow more instruction level parallelism.
Holger Lubitz [Tue, 23 Mar 2010 22:13:54 +0000 (23:13 +0100)]
Faster hpel_filter
On ssse3, use pmaddubsw for h filter too (similar to v filter).
Change 32-bit v and c filters to write the result non-temporal.
Add commented-out defines to disable non-temporal operation.
Hardly any black magic here, but still a measurable win especially for ssse3.
Fiona Glaser [Thu, 25 Feb 2010 10:07:48 +0000 (02:07 -0800)]
Fix regression in r1449
Incorrectly placed thread MV check could result in rare thread MV internal errors, esp. with --non-deterministic.
These weren't fatal errors (x264 could recover and continue with slight compression loss).
Fiona Glaser [Wed, 24 Feb 2010 11:49:32 +0000 (03:49 -0800)]
Fix one bug, one corner case in VBV
qp_novbv wasn't set correctly for B-frames.
Disable ABR code for frames with zero complexity.
Disable ABR code for CBR mode; it is completely unnecessary and can have negative consequences.
Fiona Glaser [Tue, 23 Feb 2010 01:33:17 +0000 (17:33 -0800)]
Faster probe_skip, 2x2 DC transform handling
Move the 2x2 DC DCT into the dct_dc asm function to avoid some store-to-load forwarding penalties and extra register loads.
Use dct_dc as part of the early termination in probe_skip.
x86 asm partially by Holger Lubitz.
ARM NEON asm by David Conrad.
Anton Mitrofanov [Sun, 21 Feb 2010 21:21:11 +0000 (13:21 -0800)]
New algorithm for AQ mode 2
Combines the auto-ness of AQ2 with a new var^0.25 instead of log(var) formula.
Works better with MB-tree than the old AQ mode 2 and should give higher SSIM.
Fiona Glaser [Sun, 21 Feb 2010 11:56:06 +0000 (03:56 -0800)]
Make b-pyramid normal the default
Now that b-pyramid works with MB-tree and is spec compliant, there's no real reason not to make it default.
Improves compression 0-5% depending on the video.
Also allow 0/1/2 to be used as aliases for none/strict/normal (for conciseness).
Fiona Glaser [Sun, 21 Feb 2010 09:56:12 +0000 (01:56 -0800)]
Move presets, tunings, and profiles into libx264
Now any application calling libx264 can use them.
Full documentation and guidelines for usage are included in x264.h.
Anton Mitrofanov [Fri, 19 Feb 2010 18:45:22 +0000 (10:45 -0800)]
Faster, more accurate psy-RD caching
Keep more variants of cached Hadamard scores and only calculate them when necessary.
Results in more calculation, but simpler lookups.
Slightly more accurate due to internal rounding in SATD and SA8D functions.
Fiona Glaser [Fri, 19 Feb 2010 01:01:38 +0000 (17:01 -0800)]
Much faster and more efficient MVD handling
Store MV deltas as clipped absolute values.
This means CABAC no longer has to calculate absolute values in MV context selection.
This also lets us cut the memory spent on MVDs by a factor of 2, speeding up cache_mvd and reducing memory usage by 32*threads*(num macroblocks) bytes.
On a Core i7 encoding 1080p, this is about 3 megabytes saved.
Fiona Glaser [Sat, 13 Feb 2010 19:19:38 +0000 (11:19 -0800)]
Don't even try direct temporal when it would give junk MVs
In PbBbP pyramid structure, the last "b" cannot use temporal because L0Ref0(L1Ref0) != L0Ref0.
Don't even bother analyzing it, just use spatial.
Should improve speed and direct auto effectiveness in CRF and 1-pass modes when b-pyramid is used.
Also makes --direct temporal useful with --b-pyramid, since it will fall back to spatial for frames where temporal is broken.
Fiona Glaser [Sat, 13 Feb 2010 08:52:31 +0000 (00:52 -0800)]
Make the ABR buffer consider the distance to the end of the video
Should improve bitrate accuracy in 2-pass mode.
May also slightly improve quality by allowing more variation earlier-on in a file.
Also fix abr_buffer with 1-pass: it does something very different than what it does for 2-pass.
Thus, the earlier change that increased it based on threads caused 1-pass ABR to be somewhat less accurate.
Fiona Glaser [Sat, 13 Feb 2010 05:15:12 +0000 (21:15 -0800)]
Backport various speed tweak ideas from ffmpeg
Add mv0 early termination to spatial direct calculation
Up to twice as fast direct mv calculation on near-motionless video.
Branchless CAVLC level code adjustment based on trailing ones.
A few clocks faster.
Check tc value before clipping in C version of deblock functions.
Much faster, but nobody uses those anyways.
Fiona Glaser [Fri, 12 Feb 2010 11:33:54 +0000 (03:33 -0800)]
Implement direct temporal + interlaced
This was much easier than I expected.
It will also be basically useless until TFF/BFF support gets in, since it requires delta_poc_bottom to be set correctly to work well.
Fiona Glaser [Wed, 10 Feb 2010 21:44:28 +0000 (13:44 -0800)]
Allow longer keyints with intra refresh
If a long keyint is specified (longer than macroblock width-1), the refresh will simply not occur all the time.
In other words, a refresh will take place, and then x264 will wait until keyint is over to start another refresh.
Fiona Glaser [Wed, 10 Feb 2010 20:12:29 +0000 (12:12 -0800)]
Overhaul sliced-threads VBV
Make predictors thread-local and allow each thread to poll the others to get their predicted sizes.
Many, many other tweaks to improve quality with small VBV and sliced threads.
Note this may somewhat increase the risk of a VBV underflow in such extreme situations (single-frame VBV).
This is tolerable, as most relevant use-cases are better off with a few rare underflows (even if they have to drop a slice) than consistent low quality.