Yunqing Wang [Wed, 22 Feb 2017 20:24:16 +0000 (12:24 -0800)]
Improve VP9 encoder threading test for better coverage
Re-organized the encoder threading tests and grouped tests into
4 parts. Added PSNR checking test to make sure the PSNR variation
is within a small range.
Johann [Fri, 17 Feb 2017 01:57:44 +0000 (17:57 -0800)]
consolidate block_error functions
vp9_highbd_block_error_8bit_c was a very simple wrapper around
vp9_block_error_c. The SSE2 implemention was practically identical to
the non-HBD one. It was missing some minor improvements which only
went into the original version.
In quick speed tests, the AVX implementation showed minimal
improvement over SSE2 when it does not detect overflow. However, when
overflow is detected the function is run a second time. The
OperationCheck test seems to trigger this case and reverses any
speed benefits by running ~60% slower. AVX2 on the other hand is
always 30-40% faster.
Jerome Jiang [Wed, 22 Feb 2017 22:24:02 +0000 (14:24 -0800)]
Make vp9_scale_and_extend_frame_ssse3 work for hbd when bitdepth = 8.
Only works for bitdepth = 8 when compiled with high bitdepth flag.
4x speed ups for handling 1:2 down/upsampling.
Validated manually for:
1) Dynamic resize for a single layer encoding
2) SVC encoding with 3 spatial layers
Results are bitexact with the patch and the speed gain (~4x) in the
scaling was verified.
Marco [Mon, 13 Feb 2017 18:16:42 +0000 (10:16 -0800)]
vp9: Incorporate source sum_diff into non-rd partition thresholds.
Increase the variance partition thresholds for superblocks that
have low sum-diff (from source analysis prior to encoding frame).
Use it for now only for speed >= 7 or for denoising on.
Small change on metrics for rtc set: less than ~0.1 avgPNSR decrease
on RTC set, for both speed 7 and 8.
Marco [Tue, 21 Feb 2017 04:15:40 +0000 (20:15 -0800)]
vp9: Fix for non-rd pickmode for high-bitdepth build.
Use the simple block_yrd under certain conditions.
The optimization code is completed but the speed is still slower
(~6% on 720p) than the low-bitdepth build.
For now, use the more complex block_yrd under certain conditions
(always use it for speed <= 5, otherwise use it on key frames and for
bsize >= 32x32).
This gives about ~2-3% gain in quality for speed 7 on RTC set
(over high bitdepth build), with about the same encoder fps as the
low bitdepth build.
paulwilkins [Wed, 15 Feb 2017 16:41:38 +0000 (16:41 +0000)]
Change to prediction decay calculation.
This change subtracts out low complexity intra regions that are also low
error in the inter domain, in the calculation of the frame prediction decay.
The rationale here his that low complexity regions (such as sky) do not imply
high prediction decay in the same way as high error intra or neutral blocks.
The effect of this is small in most clips but in a few clips it can be > 10%.
(E.g. In to tree)
Johann Koenig [Thu, 16 Feb 2017 20:34:48 +0000 (20:34 +0000)]
Merge changes I267050a5,Iebade0ef,Id96a8df3
* changes:
quantize_fp_32x32 highbd ssse3: enable existing function
quantize_fp highbd ssse3: use tran_low_t for coeff
quantize_fp highbd sse2: use tran_low_t for coeff
Johann [Thu, 16 Feb 2017 01:17:45 +0000 (17:17 -0800)]
bitdepth conversion: really use num elements
The previous implementation confused bit/bytes/elements. It was using
'32' as the multiplier but that was mistakenly adopted because a 32x32
transform embedded the stride.
Marco [Wed, 15 Feb 2017 21:51:14 +0000 (13:51 -0800)]
vp9: Some code cleanup for aq-mode = 3.
The weight segment needs to only be computed once per frame,
so remove it from the funciton vp9_cyclic_refresh_rc_bits_per_mb(),
which is called within a loop inside vp9_rc_regulate_q.
Marco [Wed, 15 Feb 2017 17:18:34 +0000 (09:18 -0800)]
Vp9: Speed 8 aq-mode=3: Reduce computation in estimating bits per mb.
vp9_compute_qdelta_by_rate has almost 2% overhead in profiling on Nexus 6.
Reduce the calling of that function in speed 8 by estimating the delta-q.
Both rtc and rtc_derf show little/no change in avg psnr/ssim.
Encoding speed is 2~3% faster on Nexus 6.
paulwilkins [Wed, 15 Feb 2017 10:33:10 +0000 (10:33 +0000)]
Disconnect ARF breakout from frame boost.
This small change replaces the frame boost check in the arf group
length break out clause with a test against a prediction decay value.
The boost value is in fact partly dependent on the decay value but
this change means that the per frame boost calculation can be adjusted
without influencing the group length calculation.
The value chosen gives a close match on all the test sets with the previous
code (on average) but it was noted that a lower threshold was slightly better
for 1080P and up and a slightly higher value for small image sizes.
(Yunqing Wang)
This patch implements the row-based multi-threading within tiles in
the encoding pass, and substantially speeds up the multi-threaded
encoder in VP9.
Speed tests at speed 1 on STDHD(using 4 tiles) set show that the
average speedups of the encoding pass(second pass in the 2-pass
encoding) is 7% while using 2 threads, 16% while using 4 threads,
85% while using 8 threads, and 116% while using 16 threads.
Johann [Tue, 14 Feb 2017 22:26:09 +0000 (14:26 -0800)]
Use 'packssdw' for loading tran_low_t values
This matches bitdepth_conversion_sse2.asm and produces substantially
better assembly. The old way had lots of 'movzwl' and 'shl' and storing
back to memory before loading into an xmm register.