Anton Mitrofanov [Sun, 29 Aug 2010 12:35:32 +0000 (16:35 +0400)]
Fix bug in 2pass if the first P-frames are all skip
last_qscale_for was read before being initialized in this case, resulting
in the value from the previous iteration being used instead.
Fiona Glaser [Sat, 21 Aug 2010 07:15:53 +0000 (00:15 -0700)]
CAVLC "trellis"
~3-10% improved compression with CAVLC.
--trellis is now a valid option with CAVLC.
Perhaps more importantly, this means psy-trellis now works with CAVLC.
This isn't a real trellis; it's actually just a simplified QNS.
But it takes enough shortcuts that it's still roughly as fast as a trellis; just not quite optimal.
Thus the name is a bit of a misnomer, but we're reusing the option name because it does the same thing.
A real trellis would be better, but CAVLC is much harder to trellis than CABAC.
I'm not aware of any published polynomial-time solutions that are significantly close to optimal.
Fiona Glaser [Tue, 10 Aug 2010 23:55:05 +0000 (16:55 -0700)]
Deblock-aware RD
Small quality gain (~0.5%) at lower bitrates, potentially larger with QPRD.
May help more with psy, maybe not.
Enabled at subme >= 9. Small speed cost (a few %).
Yasuhiro Ikeda [Tue, 3 Aug 2010 13:07:36 +0000 (22:07 +0900)]
Workaround bug in fps/timestamp handling with lavf input
reordered_opaque in lavf doesn't work correctly in the identity case (no reordering).
Fixes incorrect output for some file types (e.g. raw in mov).
Mike Matsnev [Sun, 1 Aug 2010 19:08:20 +0000 (12:08 -0700)]
Fix aspect ratio writing in the MKV muxer
The braindead Matroska spec dictates aspect ratio to be measured in pixels instead of, well, an actual aspect ratio.
invalidate_reference fixes
invalidate_reference didn't actually invalidate the immediate previous frame, only frames that came before that.
Make sure that reordering is forced when invalidate_reference is used, so that the reference list is correct decoder-side.
Steven Walters [Sun, 25 Jul 2010 23:45:27 +0000 (19:45 -0400)]
Filtering system-related fixes
Fix configure to check for outdated libavutil in resize filter support.
Do not print an explicit error message in ffms when requesting a frame beyond the number of frames in the source.
Mention in --*help that filtering options can be specified as name=value.
Fix the shadowing warning in the resize filter on posix systems.
Improve reference_invalid support
Reference invalidation can now be used to invalidate multiple frames at a time, rather than being limited to one per encoder_encode call.
Prevent some cases of cache aliasing.
Avoid cases where image strides were a large power of 2.
Core 2: +3% speed at widths 898..960, +6% at widths 1922..1984, most other resolutions unaffected.
Nehalem and AMD: similar amount of speedup, but fewer resolutions affected.
Fix another PCM bug
CABAC assumes that NNZ is 0 or 1, not the number of actual nonzero coefficients.
Didn't actually break the output; only had a tiny effect on RD.
Convert x264 to use NV12 pixel format internally
~1% faster overall on Conroe, mostly due to improved cache locality.
Also allows improved SIMD on some chroma functions (e.g. deblock).
This change also extends the API to allow direct NV12 input, which should be a bit faster than YV12.
This isn't currently used in the x264cli, as swscale does not have fast NV12 conversion routines, but it might be useful for other applications.
Note this patch disables the chroma SIMD code for PPC and ARM until new versions are written.
Steven Walters [Mon, 5 Jul 2010 21:37:47 +0000 (17:37 -0400)]
Add video filtering system to x264cli
Similar to mplayer's -vf system.
Supports some basic operations like resizing and cropping. Will support more in the future.
See the help for more details.
Improve scenecut detection a bit
Put a minimum value on the scenecut threshold; makes x264 more likely to catch successive scenecuts (but might increase the odds of false detection).
This also fixes scenecut detection with keyint=infinite.
Also print keyint=infinite in the x264 SEI and statsfile correctly.
Oskar Arvidsson [Fri, 2 Jul 2010 02:06:08 +0000 (04:06 +0200)]
Support for 9 and 10-bit encoding
Output bit depth is specified on compilation time via --bit-depth.
There is currently almost no assembly code available for high-bit-depth modes, so encoding will be very slow.
Input is still 8-bit only; this will change in the future.
Note that very few H.264 decoders support >8 bit depth currently.
Also note that the quantizer scale differs for higher bit depth. For example, for 10-bit, the quantizer (and crf) ranges from 0 to 63 instead of 0 to 51.
Fiona Glaser [Wed, 30 Jun 2010 20:55:46 +0000 (13:55 -0700)]
Support infinite keyint (--keyint infinite).
This just means x264 won't insert non-scenecut keyframes.
Useful for streaming when using interactive error recovery or some other mechanism that makes keyframes unnecessary.
Also change POC logic to limit POC/framenum LSB size (to save bits per slice).
Also fix a bug in the CPB underflow detection code (didn't affect the bitstream, just resulted in the failure to print certain warning messages).
Fiona Glaser [Wed, 30 Jun 2010 20:06:22 +0000 (13:06 -0700)]
Don't check i16x16 planar mode unless previous modes were useful
Saves ~160 clocks per MB at subme=1, ~270 per MB at subme>1 (measured on Core i7).
Negligle effect on compression.
Lamont Alston [Tue, 29 Jun 2010 17:11:42 +0000 (10:11 -0700)]
Make open-GOP Blu-ray compatible
Blu-ray is even more braindamaged than we thought.
Accordingly, open-gop options are now "normal" and "bluray", as opposed to display and coded.
Normal should be used in all cases besides Blu-ray authoring.
Fiona Glaser [Mon, 28 Jun 2010 22:02:33 +0000 (15:02 -0700)]
Callback feature for low-latency per-slice output
Add a callback to allow the calling application to send slices immediately after being encoded.
Also add some extra information to the x264_nal_t structure to help inform such a calling application how the NAL units should be ordered.
Fiona Glaser [Thu, 24 Jun 2010 00:29:34 +0000 (17:29 -0700)]
Interactive encoder control: error resilience
In low-latency streaming with few clients, it is often feasible to modify encoder behavior in some fashion based on feedback from clients.
One possible application of this is error resilience: if a packet is lost, mark the associated frame (and any referenced from it) as lost.
This allows quick recovery from errors with minimal expense bit-wise.
The new i_dpb_size parameter allows a calling application to tell x264 to use a larger DPB size than required by the number of reference frames.
This lets x264 and the client keep a large buffer of old references to fall back to in case of lost frames.
If no recovery is possible even with the available buffer, x264 will force a keyframe.
This initial version does not support B-frames or intra refresh.
Recommended usage is to set keyint to a very large value, so that keyframes do not occur except as necessary for extreme error recovery.
Full documentation is in x264.h.
Move DTS/PTS calculation to before encoding each frame instead of after.
Improve documentation of x264_encoder_intra_refresh.
Fiona Glaser [Thu, 17 Jun 2010 21:50:07 +0000 (14:50 -0700)]
Lookaheadless MB-tree support
Uses past motion information instead of future data from the lookahead.
Not as accurate, but better than nothing in zero-latency compression when a lookahead isn't available.
Currently resets on keyframes, so only available if intra-refresh is set, to avoid pops on non-scenecut keyframes.
Not on by default with any preset/tune combination; must be enabled explicitly if --tune zerolatency is used.
Also slightly modify encoding presets: disable rc-lookahead in the fastest presets.
Enable MB-tree in "veryfast", albeit with a very short lookahead.
Lamont Alston [Wed, 16 Jun 2010 17:05:17 +0000 (10:05 -0700)]
Open-GOP support
Allows B-frames immediately prior to keyframes (in display order).
This helps reduce keyframe popping and improve compression with short keyframe intervals.
Due to a staggering display of braindamage in the Blu-ray spec, two open-GOP modes are available.
The two modes calculate keyframe interval differently: one based on coded distance and one based on display distance.
The latter is superior compression-wise, but for no comprehensible reason, Blu-ray requires the former if open-GOP is used.
Steven Walters [Wed, 9 Jun 2010 22:14:52 +0000 (18:14 -0400)]
Use threadpools to avoid unnecessary thread creation
Tiny performance improvement with fast settings and lots of threads.
May help more on some OSs with slow thread creation, like OS X.
Unify inconsistent synchronized abbreviations to sync.
Fiona Glaser [Sat, 19 Jun 2010 08:41:07 +0000 (01:41 -0700)]
Improve 2-pass bitrate prediction
Adapt based on distance to the end in bits, not in frames.
Helps in videos with absurdly simple end sections, e.g. black frames.
Fiona Glaser [Sat, 19 Jun 2010 10:27:33 +0000 (03:27 -0700)]
Improve HRD accuracy
In a staggering display of brain damage, the spec requires all HRD math to be done in infinite precision despite the output being of quite limited precision.
Accordingly, convert buffer management to work in units of timescale.
These accumulating rounding errors probably didn't cause any real problems, but might in theory cause issues in very picky muxers on extremely long-running streams.
Fiona Glaser [Tue, 22 Jun 2010 21:20:46 +0000 (14:20 -0700)]
Use -fno-tree-vectorize to avoid miscompilation
Some versions of gcc have been reported to attempt (and fail) to vectorize a loop in plane_expand_border.
This results in a segfault, so to limit the possible effects of gcc's utter incompetence, we're turning off vectorization entirely.
It's not like it ever did anything useful to begin with.
Holger Lubitz [Wed, 9 Jun 2010 11:59:06 +0000 (13:59 +0200)]
Faster mbtree_propagate asm
Replace fp division by multiply with the reciprocal.
Only ~12% faster on penryn, but over 80% faster on amd k8.
Also make checkasm slightly more tolerant to rounding error.
Fiona Glaser [Mon, 7 Jun 2010 21:26:05 +0000 (14:26 -0700)]
Template load_pic_pointers based on interlaced
Significantly speeds up cache_load in the non-interlaced case.
Also various other minor optimizations in cache_load and cache_save.
Fiona Glaser [Fri, 4 Jun 2010 04:31:10 +0000 (21:31 -0700)]
Take more shortcuts in i4x4/i8x8 analysis
Based on the scores of the H and V modes, rule out modes which are unlikely.
Small compression loss (0.1-0.5%) and large speed gain (10-30% faster intra analysis).
Not enabled in slower encoding modes.
Also make C versions of the merged SATD functions in order to eliminate branches based on their availability.
Fiona Glaser [Wed, 2 Jun 2010 08:07:44 +0000 (01:07 -0700)]
Add API function to fix x264_picture_t initialization
Calling applications that do not use x264_picture_alloc need to use x264_picture_init to initialize x264_picture_t structures.
Previously, if the calling application didn't zero x264_picture_t, Bad Things could happen.
Oskar Arvidsson [Tue, 1 Jun 2010 23:35:38 +0000 (01:35 +0200)]
Convert to a unified "pixel" type for pixel data
Necessary for future high bit-depth support.
Various macros and extra types have been introduced to make operations on variable-size pixels more convenient.
Fiona Glaser [Fri, 28 May 2010 21:27:22 +0000 (14:27 -0700)]
Add API tool to apply arbitrary quantizer offsets
The calling application can now pass a "map" of quantizer offsets to apply to each frame.
An optional callback to free the map can also be included.
This allows all kinds of flexible region-of-interest coding and similar.
Fiona Glaser [Thu, 27 May 2010 21:27:32 +0000 (14:27 -0700)]
x86 assembly code for NAL escaping
Up to ~10x faster than C depending on CPU.
Helps the most at very high bitrates (e.g. lossless).
Also make the C code faster and simpler.
Fiona Glaser [Tue, 25 May 2010 19:42:44 +0000 (12:42 -0700)]
Overhaul deblocking again
Move deblock strength calculation to immediately after encoding to take advantage of the data that's already in cache.
Keep the deblocking itself as per-row.
Fiona Glaser [Tue, 25 May 2010 23:13:59 +0000 (16:13 -0700)]
Detect Atom CPU, enable appropriate asm functions
I'm not going to actually optimize for this pile of garbage unless someone pays me.
But it can't hurt to at least enable the correct functions based on benchmarks.
Also save some cache on Intel CPUs that don't need the decimate LUT due to having fast bsr/bsf.