Jingning Han [Fri, 10 Jan 2014 02:01:30 +0000 (18:01 -0800)]
Enable skipping reference frame check in rd loop
This commit allows encoder to compare the SAD cost associated with
the best motion vector predictor, per frame. If one reference frame
has this cost more than 4 times of the best SAD cost given by other
reference frames, skip NEARESTMV, NEARMV, ZEROMV mode check of this
reference frame.
This setting is turned on in speed 2 and above. Compression quality
change in speed 2:
derf -0.014%
yt -0.097%
hd -0.023%
stdhd 0.046%
It reduces the speed 2 runtime of test sequences:
pedestrian_area_1080p 4000 kbps 310763 ms -> 303595 ms
bluesky_1080p 6000 kbps 259852 ms -> 251920 ms
Jingning Han [Thu, 9 Jan 2014 20:43:40 +0000 (12:43 -0800)]
Optimze inv 16x16 DCT with 10 non-zero coeffs - P2
This commit further optimizes SSE2 operations in the second 1-D
inverse 16x16 DCT, with (<10) non-zero coefficients. The average
runtime of this module goes down from 779 cycles -> 725 cycles.
Jingning Han [Tue, 7 Jan 2014 22:35:02 +0000 (14:35 -0800)]
Optimze inv 16x16 DCT with 10 non-zero coeffs - P1
This commit is the first patch optimizing SSE2 implementation of inverse
16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
transformation. It exploits the fact that only top-left 4x4 block contains
non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
The average runtime of idct16x16_10 unit is reduced from
883 cycles -> 779 cycles (12% faster).
For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
down from 310651 ms -> 305910 ms. The decoding speed goes up from
80.37 fps -> 80.87 fps.
levytamar82 [Sun, 29 Dec 2013 08:23:50 +0000 (01:23 -0700)]
AVX2 Variance Optimization
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
vp9_mse16x16 by migrating to AVX2
some of the functions were optimized by processing 32 elements instead of 16.
some of the functions were optimized by processing 2 loop strides of 16
elements in a single 256 bit register
This optimization gives between 2.4% - 2.7% user level performance gain
and 42% function level gain.
Jingning Han [Tue, 7 Jan 2014 17:53:38 +0000 (09:53 -0800)]
Fix an issue in motion vector prediction stage
The previous implementation stops motion vector prediction test when
the zero motion vector appears for the second time. This commit fixes
it by simply skipping the second time check on zero mv and continuing
on to next mv candidate.
It slightly improves stdhd in speed 2 by 0.06% on average. Most static
sequences are not affected. A few hard ones, like jet, ped, and riverbed
were improved by 0.1 - 0.2%.
Jingning Han [Mon, 6 Jan 2014 21:29:16 +0000 (13:29 -0800)]
Remove avoid_frame_with_high_error from RD loop
The feature undergoes prior assumption that the recursive partition
size search from 4x4 to 64x64, hence utilizing information from small
blocks to determine early termination in large block rate-distortion
optimization search. The current codebase is now going from top down.
The previous function might go with not properly initialized values,
hence removed.
Tested on pedestrian_area_1080p at 4000 kbps running under speed 2.
No visible difference in runtime observed.
Deb Mukherjee [Fri, 3 Jan 2014 23:41:57 +0000 (15:41 -0800)]
Corerctly sets frame type in the 2 pass case
This patch sets frame types correctly in the new
vp9_get_second_pass_params() function called prior
to encode_frame_to_data_rate() function, so that the
latter function can just work with what is passed to
it. This will allow multiple vp9_get_second_pass_params()
to be created for various encode strategies without
messing with the core encode function.
There is no difference in derf and yt. stdhd/hd are pending.
Jingning Han [Fri, 3 Jan 2014 23:05:25 +0000 (15:05 -0800)]
Tune IDCT8_1D macro function interface
This commit adds input/output ports for IDCT8_1D macro function to
provide more flexibility in variable use. It allows to skip several
buffer swap operations.
Jingning Han [Thu, 2 Jan 2014 23:33:38 +0000 (15:33 -0800)]
Rework idct8x8_10 SSE2 implementation
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.
The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.
Paul Wilkins [Thu, 2 Jan 2014 15:45:06 +0000 (15:45 +0000)]
Modified Handling of min and max vbr rates.
In two pass encodes bits are allocated to each frame
according to a modified error score for the frame as a
fraction of the modified error score for the clip or section.
Previously a minimum rate per frame was reserved and
subtracted from the bits allocatable by the two pass code.
The vbr max section rate was enforced by clipping the
actual number of bits allocated.
In this patch the min and max vbr rates are enforced
instead by clipping the modified error scores for each frame
rather than the number of bits allocated.
Small gains for all test sets (psnr and SSIM) ranging from
~ +0.05 for YT psnr up to ~ +0.25 for Std-hd SSIM.
Dmitry Kovalev [Fri, 27 Dec 2013 22:01:12 +0000 (14:01 -0800)]
Removing vpx_codec_vp9x_cx and internal experimental flag.
vpx_codec_vp9x_cx is not used internally. Experimental flag from
vp9_extracfg is also not really used. YUV 4:4:4 just works after these
changes (you have to specify --profile=1 for the encoder).
Jingning Han [Fri, 20 Dec 2013 23:24:22 +0000 (15:24 -0800)]
Adaptive motion control on ref and search range
This commit takes a preliminary attempt to refine the motion search
control. It detects the SAD associated with mv predictor per reference
frame, and based on which to determine whether the encoder wants to
reduce the motion search range (if the predicted mv provides fairly
small SAD), or to skip the current reference frame (if there exists
another ref frame that gives much smaller SAD cost).
This feature is turned on in the settings of speed 1 and above.
In speed 1, compression performance changed
derf -0.018%
yt -0.043%
hd -0.045%
stdhd -0.281%