Fiona Glaser [Tue, 25 May 2010 23:13:59 +0000 (16:13 -0700)]
Detect Atom CPU, enable appropriate asm functions
I'm not going to actually optimize for this pile of garbage unless someone pays me.
But it can't hurt to at least enable the correct functions based on benchmarks.
Also save some cache on Intel CPUs that don't need the decimate LUT due to having fast bsr/bsf.
Fiona Glaser [Tue, 18 May 2010 23:48:00 +0000 (16:48 -0700)]
Rewrite deblock strength calculation, add asm
Rewrite is significantly slower, but is necessary to make asm possible.
Similar concept to ffmpeg's deblock strength asm.
Roughly one order of magnitude faster than C.
Overall, with the asm, saves ~100-300 clocks in deblocking per MB.
Kieran Kunhya [Thu, 20 May 2010 16:45:16 +0000 (17:45 +0100)]
Add "Fake interlaced" option
This encodes all frames progressively yet flags the stream as interlaced.
This makes it possible to encode valid 25p and 30p Blu-Ray streams.
Also put the pulldown help section in a more appropriate place.
Fiona Glaser [Sat, 15 May 2010 21:48:58 +0000 (14:48 -0700)]
Overhaul CABAC: faster, less cache usage
Horribly munge up the CABAC tables to allow deduplication of some data.
Saves 256 bytes of L1d cache in non-RD, 512 bytes in RD.
Add asm versions of bypass and terminal; save L1i cache by re-using putbyte code.
Further optimize encode_decision.
All 3 primary CABAC functions fit in under 256 bytes of code total on x86_64.
Fiona Glaser [Sat, 8 May 2010 19:07:13 +0000 (12:07 -0700)]
Add API function to trigger intra refresh
Useful for interactive applications where the encoder knows that packet loss has occurred on the client.
Full documentation is in x264.h.
Fiona Glaser [Sat, 8 May 2010 18:58:22 +0000 (11:58 -0700)]
Fix intra refresh behavior with I-frames
Intra refresh still allows I-frames (for scenecuts/etc).
Now I-frames count as a full refresh, as opposed to instantly triggering a refresh.
Fiona Glaser [Tue, 4 May 2010 04:27:16 +0000 (21:27 -0700)]
Don't force row QPs to integer values with VBV
VBV should no longer raise the bitrate of the video. That is, at a given quality level or average bitrate, turning on VBV should only lower the bitrate.
This isn't quite true if adaptive quant is off, but nobody should be doing that anyways.
Also may result in slightly more accurate per-row VBV ratecontrol.
Fiona Glaser [Sun, 2 May 2010 18:41:36 +0000 (11:41 -0700)]
Improve temporal MV prediction
Predict based on the results of p16x16 search, not final MVs.
This lets us get predictions even if mode decision chose intra.
Also improves cache coherency.
Deduplicate asm constants, automate name prefixing
Auto-prefix global constants with x264_ in cextern.
Eliminate x264_ prefix from asm files; automate it in cglobal.
Deduplicate asm constants wherever possible to save data cache (move them to a new const-a.asm).
Remove x264_emms() entirely on non-x86 (don't even call an empty function).
Add cextern_naked for a non-prefixed cextern (used in checkasm).
Remove reordering restrictions from weightp
Apparently the spec does allow two consecutive copies of the same frame in the reference list.
This involves an incredibly ugly hack to wrap around the frame number.
Very slight compression improvement.
Fix issues with extremely large timebases
With timebase denominators >= 2^30 , x264 would silently overflow and cause odd issues.
Now x264 will explicitly fail with timebase denominators >= 2^31 and work with timebase denominators 2^31 > x >= 2^30.
Move deblocking/hpel into sliced threads
Instead of doing both as a separate pass, do them during the main encode.
This requires disabling deblocking between slices (disable_deblock_idc == 2).
Overall performance gain is about 11% on --preset superfast with sliced threads.
Doesn't reduce the amount of actual computation done: only better parallelizes it.
Fix various early terminations with slices
Neighbouring type values (type_top, etc) are now loaded even if the MB isn't available for prediction.
Significant overall performance increase (as high as 5-10%+) with lots of slices (e.g. with slice-max-size).
Add miscompilation check for x264_clz
Running a Phenom-optimized build of x264 (e.g. -march=amdfam10) on a non-Phenom CPU didn't SIGILL; instead it would silently produce incorrect output.
Now, instead, it will error out loudly.
Fixing floating-point exception in level-checking
Doesn't cause any issues for x264cli, but might impact some calling apps that care (e.g. Delphi apps).
Alex Wright [Wed, 7 Apr 2010 15:25:55 +0000 (01:25 +1000)]
Early termination in 16x8/8x16 search
Combine the actual cost of the first partition with the predicted cost of the second to avoid searching the second when possible.
Reduces the number of times the second partition is searched by up to ~75% in non-RD mode, ~10% in RD mode.
Negligible effect on compression.
Make MV prediction work across slice boundaries
Should improve motion search with lots of small slices, e.g. with slice-max-size.
Still restricted by sliced threads (won't cross the boundary between two threadslices).
The output-changing part of the previous patch.
Cleanup and simplification of macroblock_load
Doesn't do anything now, but will be useful for many future changes.
Splitting out neighbour calculation will make MBAFF implementation easier.
Calculation of neighbour_frame value (actual neighbouring MBs, ignoring slices) will be useful for some future patches.
Fiona Glaser [Wed, 31 Mar 2010 08:44:07 +0000 (01:44 -0700)]
Massive cosmetic and syntax cleanup
Convert all applicable loops to use C99 loop index syntax.
Clean up most inconsistent syntax in ratecontrol.c, visualize, ppc, etc.
Replace log(x)/log(2) constructs with log2, and similar with log10.
Fix all -Wshadow violations.
Fix visualize support.