Fiona Glaser [Fri, 4 Jun 2010 04:31:10 +0000 (21:31 -0700)]
Take more shortcuts in i4x4/i8x8 analysis
Based on the scores of the H and V modes, rule out modes which are unlikely.
Small compression loss (0.1-0.5%) and large speed gain (10-30% faster intra analysis).
Not enabled in slower encoding modes.
Also make C versions of the merged SATD functions in order to eliminate branches based on their availability.
Fiona Glaser [Wed, 2 Jun 2010 08:07:44 +0000 (01:07 -0700)]
Add API function to fix x264_picture_t initialization
Calling applications that do not use x264_picture_alloc need to use x264_picture_init to initialize x264_picture_t structures.
Previously, if the calling application didn't zero x264_picture_t, Bad Things could happen.
Oskar Arvidsson [Tue, 1 Jun 2010 23:35:38 +0000 (01:35 +0200)]
Convert to a unified "pixel" type for pixel data
Necessary for future high bit-depth support.
Various macros and extra types have been introduced to make operations on variable-size pixels more convenient.
Fiona Glaser [Fri, 28 May 2010 21:27:22 +0000 (14:27 -0700)]
Add API tool to apply arbitrary quantizer offsets
The calling application can now pass a "map" of quantizer offsets to apply to each frame.
An optional callback to free the map can also be included.
This allows all kinds of flexible region-of-interest coding and similar.
Fiona Glaser [Thu, 27 May 2010 21:27:32 +0000 (14:27 -0700)]
x86 assembly code for NAL escaping
Up to ~10x faster than C depending on CPU.
Helps the most at very high bitrates (e.g. lossless).
Also make the C code faster and simpler.
Fiona Glaser [Tue, 25 May 2010 19:42:44 +0000 (12:42 -0700)]
Overhaul deblocking again
Move deblock strength calculation to immediately after encoding to take advantage of the data that's already in cache.
Keep the deblocking itself as per-row.
Fiona Glaser [Tue, 25 May 2010 23:13:59 +0000 (16:13 -0700)]
Detect Atom CPU, enable appropriate asm functions
I'm not going to actually optimize for this pile of garbage unless someone pays me.
But it can't hurt to at least enable the correct functions based on benchmarks.
Also save some cache on Intel CPUs that don't need the decimate LUT due to having fast bsr/bsf.
Fiona Glaser [Tue, 18 May 2010 23:48:00 +0000 (16:48 -0700)]
Rewrite deblock strength calculation, add asm
Rewrite is significantly slower, but is necessary to make asm possible.
Similar concept to ffmpeg's deblock strength asm.
Roughly one order of magnitude faster than C.
Overall, with the asm, saves ~100-300 clocks in deblocking per MB.
Kieran Kunhya [Thu, 20 May 2010 16:45:16 +0000 (17:45 +0100)]
Add "Fake interlaced" option
This encodes all frames progressively yet flags the stream as interlaced.
This makes it possible to encode valid 25p and 30p Blu-Ray streams.
Also put the pulldown help section in a more appropriate place.
Fiona Glaser [Sat, 15 May 2010 21:48:58 +0000 (14:48 -0700)]
Overhaul CABAC: faster, less cache usage
Horribly munge up the CABAC tables to allow deduplication of some data.
Saves 256 bytes of L1d cache in non-RD, 512 bytes in RD.
Add asm versions of bypass and terminal; save L1i cache by re-using putbyte code.
Further optimize encode_decision.
All 3 primary CABAC functions fit in under 256 bytes of code total on x86_64.
Fiona Glaser [Sat, 8 May 2010 19:07:13 +0000 (12:07 -0700)]
Add API function to trigger intra refresh
Useful for interactive applications where the encoder knows that packet loss has occurred on the client.
Full documentation is in x264.h.
Fiona Glaser [Sat, 8 May 2010 18:58:22 +0000 (11:58 -0700)]
Fix intra refresh behavior with I-frames
Intra refresh still allows I-frames (for scenecuts/etc).
Now I-frames count as a full refresh, as opposed to instantly triggering a refresh.
Fiona Glaser [Tue, 4 May 2010 04:27:16 +0000 (21:27 -0700)]
Don't force row QPs to integer values with VBV
VBV should no longer raise the bitrate of the video. That is, at a given quality level or average bitrate, turning on VBV should only lower the bitrate.
This isn't quite true if adaptive quant is off, but nobody should be doing that anyways.
Also may result in slightly more accurate per-row VBV ratecontrol.
Fiona Glaser [Sun, 2 May 2010 18:41:36 +0000 (11:41 -0700)]
Improve temporal MV prediction
Predict based on the results of p16x16 search, not final MVs.
This lets us get predictions even if mode decision chose intra.
Also improves cache coherency.
Deduplicate asm constants, automate name prefixing
Auto-prefix global constants with x264_ in cextern.
Eliminate x264_ prefix from asm files; automate it in cglobal.
Deduplicate asm constants wherever possible to save data cache (move them to a new const-a.asm).
Remove x264_emms() entirely on non-x86 (don't even call an empty function).
Add cextern_naked for a non-prefixed cextern (used in checkasm).
Remove reordering restrictions from weightp
Apparently the spec does allow two consecutive copies of the same frame in the reference list.
This involves an incredibly ugly hack to wrap around the frame number.
Very slight compression improvement.
Fix issues with extremely large timebases
With timebase denominators >= 2^30 , x264 would silently overflow and cause odd issues.
Now x264 will explicitly fail with timebase denominators >= 2^31 and work with timebase denominators 2^31 > x >= 2^30.
Move deblocking/hpel into sliced threads
Instead of doing both as a separate pass, do them during the main encode.
This requires disabling deblocking between slices (disable_deblock_idc == 2).
Overall performance gain is about 11% on --preset superfast with sliced threads.
Doesn't reduce the amount of actual computation done: only better parallelizes it.
Fix various early terminations with slices
Neighbouring type values (type_top, etc) are now loaded even if the MB isn't available for prediction.
Significant overall performance increase (as high as 5-10%+) with lots of slices (e.g. with slice-max-size).