Fiona Glaser [Sat, 15 May 2010 21:48:58 +0000 (14:48 -0700)]
Overhaul CABAC: faster, less cache usage
Horribly munge up the CABAC tables to allow deduplication of some data.
Saves 256 bytes of L1d cache in non-RD, 512 bytes in RD.
Add asm versions of bypass and terminal; save L1i cache by re-using putbyte code.
Further optimize encode_decision.
All 3 primary CABAC functions fit in under 256 bytes of code total on x86_64.
Fiona Glaser [Sat, 8 May 2010 19:07:13 +0000 (12:07 -0700)]
Add API function to trigger intra refresh
Useful for interactive applications where the encoder knows that packet loss has occurred on the client.
Full documentation is in x264.h.
Fiona Glaser [Sat, 8 May 2010 18:58:22 +0000 (11:58 -0700)]
Fix intra refresh behavior with I-frames
Intra refresh still allows I-frames (for scenecuts/etc).
Now I-frames count as a full refresh, as opposed to instantly triggering a refresh.
Fiona Glaser [Tue, 4 May 2010 04:27:16 +0000 (21:27 -0700)]
Don't force row QPs to integer values with VBV
VBV should no longer raise the bitrate of the video. That is, at a given quality level or average bitrate, turning on VBV should only lower the bitrate.
This isn't quite true if adaptive quant is off, but nobody should be doing that anyways.
Also may result in slightly more accurate per-row VBV ratecontrol.
Fiona Glaser [Sun, 2 May 2010 18:41:36 +0000 (11:41 -0700)]
Improve temporal MV prediction
Predict based on the results of p16x16 search, not final MVs.
This lets us get predictions even if mode decision chose intra.
Also improves cache coherency.
Deduplicate asm constants, automate name prefixing
Auto-prefix global constants with x264_ in cextern.
Eliminate x264_ prefix from asm files; automate it in cglobal.
Deduplicate asm constants wherever possible to save data cache (move them to a new const-a.asm).
Remove x264_emms() entirely on non-x86 (don't even call an empty function).
Add cextern_naked for a non-prefixed cextern (used in checkasm).
Remove reordering restrictions from weightp
Apparently the spec does allow two consecutive copies of the same frame in the reference list.
This involves an incredibly ugly hack to wrap around the frame number.
Very slight compression improvement.
Fix issues with extremely large timebases
With timebase denominators >= 2^30 , x264 would silently overflow and cause odd issues.
Now x264 will explicitly fail with timebase denominators >= 2^31 and work with timebase denominators 2^31 > x >= 2^30.
Move deblocking/hpel into sliced threads
Instead of doing both as a separate pass, do them during the main encode.
This requires disabling deblocking between slices (disable_deblock_idc == 2).
Overall performance gain is about 11% on --preset superfast with sliced threads.
Doesn't reduce the amount of actual computation done: only better parallelizes it.
Fix various early terminations with slices
Neighbouring type values (type_top, etc) are now loaded even if the MB isn't available for prediction.
Significant overall performance increase (as high as 5-10%+) with lots of slices (e.g. with slice-max-size).
Add miscompilation check for x264_clz
Running a Phenom-optimized build of x264 (e.g. -march=amdfam10) on a non-Phenom CPU didn't SIGILL; instead it would silently produce incorrect output.
Now, instead, it will error out loudly.
Fixing floating-point exception in level-checking
Doesn't cause any issues for x264cli, but might impact some calling apps that care (e.g. Delphi apps).
Alex Wright [Wed, 7 Apr 2010 15:25:55 +0000 (01:25 +1000)]
Early termination in 16x8/8x16 search
Combine the actual cost of the first partition with the predicted cost of the second to avoid searching the second when possible.
Reduces the number of times the second partition is searched by up to ~75% in non-RD mode, ~10% in RD mode.
Negligible effect on compression.
Make MV prediction work across slice boundaries
Should improve motion search with lots of small slices, e.g. with slice-max-size.
Still restricted by sliced threads (won't cross the boundary between two threadslices).
The output-changing part of the previous patch.
Cleanup and simplification of macroblock_load
Doesn't do anything now, but will be useful for many future changes.
Splitting out neighbour calculation will make MBAFF implementation easier.
Calculation of neighbour_frame value (actual neighbouring MBs, ignoring slices) will be useful for some future patches.
Fiona Glaser [Wed, 31 Mar 2010 08:44:07 +0000 (01:44 -0700)]
Massive cosmetic and syntax cleanup
Convert all applicable loops to use C99 loop index syntax.
Clean up most inconsistent syntax in ratecontrol.c, visualize, ppc, etc.
Replace log(x)/log(2) constructs with log2, and similar with log10.
Fix all -Wshadow violations.
Fix visualize support.
Fiona Glaser [Fri, 26 Mar 2010 22:33:20 +0000 (15:33 -0700)]
New "superfast" preset, much faster intra analysis
Especially at the fastest settings, intra analysis was taking up the majority of MB analysis time.
This patch takes a ton more shortcuts at the fastest encoding settings, decreasing compression 0.5-5% but improving speed greatly.
Also rearrange the fastest presets a bit: now we have ultrafast, superfast, veryfast, faster.
superfast is the old veryfast (but much faster due to this patch).
veryfast is between the old veryfast and faster.
faster is the same as before except with MB-tree on.
Encoding with subme >= 5 should be unaffected by this patch.
Fiona Glaser [Tue, 23 Mar 2010 21:00:58 +0000 (14:00 -0700)]
Add tune for still image compression
There has been some demand for this from companies looking to use x264 for still image compression (it can outperform JPEG or JPEG-2000 by a factor of 2 or more).
Still image compression is a bit different; because temporal stability isn't an issue, we can get away with far more powerful psy settings.
Fiona Glaser [Sun, 21 Mar 2010 00:07:12 +0000 (17:07 -0700)]
Much faster non-RD intra analysis
Since every pred mode costs at least 1 bit, move that part into the initial SATD cost.
This lets i4x4/i8x8 analysis terminate earlier.
If the cost of the predicted mode is less than the cost of signalling any other mode, early-terminate the analysis.
Fiona Glaser [Sun, 14 Mar 2010 08:25:02 +0000 (00:25 -0800)]
Various motion estimation optimizations
Faster method of checking MV range.
Predict MVs and cache MVs/MVDs for bidir qpel-RD.
A whole bunch of other minor optimizations.
Slightly better performance and compression.
Fiona Glaser [Sun, 14 Mar 2010 08:19:59 +0000 (00:19 -0800)]
Overhaul macroblock_cache_rect
Unify the rectangle functions into a single one similar to ffmpeg's fill_rectangle.
Remove all cases of variable-size cache_rect calls; create a function-pointer-based system for handling such cases.
Should greatly decrease code size required for such calls.