Fiona Glaser [Wed, 5 Aug 2009 00:46:33 +0000 (17:46 -0700)]
Macroblock-tree ratecontrol
On by default; can be turned off with --no-mbtree.
Uses a large lookahead to track temporal propagation of data and weight quality accordingly.
Requires a very large separate statsfile (2 bytes per macroblock) in multi-pass mode.
Doesn't work with b-pyramid yet.
Note that MB-tree inherently measures quality different from the standard qcomp method, so bitrates produced by CRF may change somewhat.
This makes the "medium" preset a bit slower. Accordingly, make "fast" slower as well, and introduce a new preset "faster" between "fast" and "veryfast".
All presets "fast" and above will have MB-tree on.
Add a new option, --rc-lookahead, to control the distance MB tree looks ahead to perform propagation analysis.
Default is 40; larger values will be slower and require more memory but give more accurate results.
This value will be used in the future to control ratecontrol lookahead (VBV).
Add a new option, --no-psy, to disable all psy optimizations that don't improve PSNR or SSIM.
This disables psy-RD/trellis, but also other more subtle internal psy optimizations that can't be controlled directly via external parameters.
Quality improvement from MB-tree is about 2-70% depending on content.
Strength of MB-tree adjustments can be tweaked using qcompress; higher values mean lower MB-tree strength.
Note that MB-tree may perform slightly suboptimally on fades; this will be fixed by weighted prediction, which is coming soon.
Fiona Glaser [Tue, 4 Aug 2009 03:52:30 +0000 (20:52 -0700)]
Various 1-pass VBV tweaks
Make predictors have an offset in addition to a multiplier.
This primarily fixes issues in sources with lots of extremely static scenes, such as anime and CGI.
We tried linear regressions, but they were very unreliable as predictors.
Also allow VBV to be slightly more aggressive in raising QPs to avoid not having enough bits left in some situations.
Up to 1db improvement on some clips.
15% faster refine_bidir_satd, 10% faster refine_bidir_rd (or less with trellis=2)
re-roll a loop (saves 44KB code size, which is the cause of most of this speed gain)
don't re-mc mvs that haven't changed
Faster bidir_rd plus some bugfixes
Cache chroma MC during refine_bidir_rd and use both the luma and chroma caches to skip MC in macroblock_encode.
Fix incorrect call to rd_cost_part; refine_bidir_rd output was incorrect for i8>0.
Remove some redundant clips.
~12% faster refine_bidir_rd.
Add QPRD support as subme=10
Refactor trellis lambda selection to be done in analyse_init instead of in trellis.
This will allow for more easy adaption of lambda later on; for now it allows constant lambda across variable QPs.
QPRD is only available with adaptive quantization enabled and generally improves SSIM and visual quality.
Additionally, weight the SSD values from RD based on the relative QP offset for chroma; helps visually at high QPs where chroma has a lower QP than luma.
This fixes some visual artifacts created by QPRD at high QPs.
Note that this generally hurts PSNR and SSIM, and so is only on when psy-RD is on.
Anton Mitrofanov [Sat, 18 Jul 2009 23:30:18 +0000 (16:30 -0700)]
New AQ algorithm option
"Auto-variance" uses log(var)^2 instead of log(var) and attempts to adapt strength per-frame.
Generates significantly better SSIM; on by default with --tune ssim.
Whether it generates visually better quality is still up for debate.
Available as --aq-mode 2.
Totally new preset system for x264.c (not libx264), new defaults
Other new features include "tune" and "profile" settings; see --help for more details.
Unlike most other settings, "preset" and "tune" act before all other options.
However, "profile" acts afterwards, overriding all other options.
Our defaults have also changed: new defaults are --subme 7 --bframes 3 --8x8dct --no-psnr --no-ssim --threads auto --ref 3 --mixed-refs --trellis 1 --weightb --crf 23 --progress.
Users will hopefully find these changes to greatly improve usability.
Fiona Glaser [Tue, 30 Jun 2009 22:20:32 +0000 (15:20 -0700)]
Early termination for chroma encoding
Faster chroma encoding by terminating early if heuristics indicate that the block will be DC-only.
This works because the vast majority of inter chroma blocks have no coefficients at all, and those that do are almost always DC-only.
Add two new helper DSP functions for this: dct_dc_8x8 and var2_8x8. mmx/sse2/ssse3 versions of each.
Early termination is disabled at very low QPs due to it not being useful there.
Performance increase is ~1-2% without trellis, up to 5-6% with trellis=2.
Increase is greater with lower bitrates.
David Conrad [Fri, 26 Jun 2009 20:09:44 +0000 (13:09 -0700)]
Fix bug in checkasm
frame_init_lowres_core check didn't check the C plane.
However, all x86 and PPC assembly was correct regardless of the unit test being incorrect.
Fiona Glaser [Sat, 20 Jun 2009 01:49:55 +0000 (18:49 -0700)]
Various CABAC optimizations and cleanups
Faster CABAC CBF context calculation for inter blocks.
Add x264_constant_p(), will probably be useful in the future as well.
Simpler subpartition functions.
Clean up and optimize mvd_cpn a bit more.
Various other minor optimizations.
Fiona Glaser [Mon, 4 May 2009 11:19:28 +0000 (04:19 -0700)]
More CABAC and CAVLC optimizations
Simplified function calling for block_residual_write_(cabac|cavlc) and improved sigmap coding.
Tried making 0/1-bit specific versions of CABAC asm, but benefit was minimal under GCC 4.3.
Helped a decent bit under 3.4, but you shouldn't be using such old versions anyways.
Add "coded blocks" stat to output information.
This measures the total percentage of blocks, intra and inter, which have nonzero coefficients.
"y,uvAC,uvDC" refers to luma, chroma DC, and chroma AC blocks.
Note that skip blocks are included in this stat.
Various CABAC optimizations
Move calculation of b_intra out of the core residual loop and hardcode it where applicable.
Inlining cabac_mb_mvd was unnecessary and wasted tremendous amounts of code size. Inlining only cache_mvd is faster and significantly smaller.
Fiona Glaser [Tue, 17 Mar 2009 18:01:57 +0000 (11:01 -0700)]
SSE2 zigzag_interleave
Replace PHADD with FastShuffle (more accurate naming).
This flag represents asm functions that rely on fast SSE2 shuffle units, and thus are only faster on Phenom, Nehalem, and Penryn CPUs.
Holger Lubitz [Sat, 7 Mar 2009 02:16:30 +0000 (18:16 -0800)]
Vastly faster SATD/SA8D/Hadamard_AC/SSD/DCT/IDCT
Heavily optimized for Core 2 and Nehalem, but performance should improve on all modern x86 CPUs.
16x16 SATD: +18% speed on K8(64bit), +22% on K10(32bit), +42% on Penryn(64bit), +44% on Nehalem(64bit), +50% on P4(32bit), +98% on Conroe(64bit)
Similar performance boosts in SATD-like functions (SA8D, hadamard_ac) and somewhat less in DCT/IDCT/SSD.
Overall performance boost is up to ~15% on 64-bit Conroe.
Fiona Glaser [Wed, 4 Mar 2009 00:21:52 +0000 (16:21 -0800)]
Slightly faster 8x16 SAD on Penryn Core 2
Same as MMX 8x16 cacheline SAD, but calls SSE2 8x16 SAD in non-cacheline case.
Only Nehalem benefits from sizes smaller than 8x16, and Nehalem doesn't use cacheline functions, so no smaller versions are included.
Fiona Glaser [Thu, 26 Feb 2009 22:29:50 +0000 (14:29 -0800)]
Remove non-pre scenecut
Add support for no-b-adapt + pre-scenecut (patch by BugMaster)
Pre-scenecut was generally better than regular scenecut in terms of accuracy and regular scenecut didn't work in threaded mode anyways.
Add no-scenecut option (scenecut=0 is now no scenecut; previously it was -1)
Fix an incorrect bias towards P-frames near scenecuts with B-adapt 2.
Simplify pre-scenecut code.
Add AltiVec version of hadamard_ac. 2.4x faster than the C version.
Note this this implementation is pretty naive and should be improved
by implementing what's discussed in this ML thread:
date: Mon, Feb 2, 2009 at 6:58 PM
subject: Re: [x264-devel] [PATCH] AltiVec implementation of hadamard_ac routines
Manuel Rommel [Tue, 10 Feb 2009 20:06:47 +0000 (12:06 -0800)]
fix 10l in 75b495f2723fcb77f
Original thread:
date: Mon, Feb 9, 2009 at 9:37 PM
subject: [x264-devel] commit: Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors . (Guillaume Poirier )
Fiona Glaser [Fri, 6 Feb 2009 10:59:36 +0000 (02:59 -0800)]
More intra pred asm optimizations
SSSE3 version of predict_8x8_hu
SSE2 version of predict_8x8c_p
SSSE3 versions of both planar prediction functions
Optimizations to predict_16x16_p_sse2
Some unnecessary REP_RETs -> RETs.
SSE2 version of predict_8x8_vr by Holger.
SSE2 version of predict_8x8_hd.
Don't compile MMX versions of some of the pred functions on x86_64.
Remove now-useless x86_64 C versions of 4x4 pred functions.
Rewrite some of the x86_64-only C functions in asm.
Manuel Rommel [Sun, 8 Feb 2009 20:35:51 +0000 (21:35 +0100)]
Speed-up mc_chroma_altivec by using vec_mladd cleverly, and unrolling.
Also put width == 2 variant in its own scalar function because it's faster
than a vectorized one.
Holger Lubitz [Wed, 4 Feb 2009 20:46:17 +0000 (12:46 -0800)]
Merging Holger's GSOC branch part 2: intra prediction
Assembly versions of most remaining 4x4 and 8x8 intra pred functions.
Assembly version of predict_8x8_filter.
A few other optimizations.
Primarily Core 2-optimized.