Geza Lore [Wed, 28 Oct 2015 14:35:04 +0000 (14:35 +0000)]
Add AVX vectorized vp9_diamond_search_sad
This function now has an AVX intrinsics version which is about 80%
faster compared to the C implementation. This provides a 2-4% total
speed-up for encode, depending on encoding parameters. The function
utilizes 3 properties of the cost function lookup table, constructed
in 'cal_nmvjointsadcost' and 'cal_nmvsadcosts'.
For the joint cost:
- mvjointsadcost[1] == mvjointsadcost[2] == mvjointsadcost[3]
For the component costs:
- For all i: mvsadcost[0][i] == mvsadcost[1][i]
(equal per component cost)
- For all i: mvsadcost[0][i] == mvsadcost[0][-i]
(Cost function is even)
These must hold, otherwise the AVX version of the function cannot be used.
This breaks 32-bit builds:
runtime error: load of misaligned address 0xf72fdd48 for type 'const
__m128i' (vector of 2 'long long' values), which requires 16 byte
alignment
+ _mm_set1_epi64x is incompatible with some versions of visual studio
Marco [Fri, 6 Nov 2015 00:00:15 +0000 (16:00 -0800)]
vp9: Updates to noise estimation.
Add threshold/condition on spatial_variance and brightness level.
Modification to normalization of block variance.
Change resolution limit below which we disable noise estimation.
Geza Lore [Wed, 28 Oct 2015 14:35:04 +0000 (14:35 +0000)]
Add AVX vectorized vp9_diamond_search_sad
This function now has an AVX intrinsics version which is about 80%
faster compared to the C implementation. This provides a 2-4% total
speed-up for encode, depending on encoding parameters. The function
utilizes 3 properties of the cost function lookup table, constructed
in 'cal_nmvjointsadcost' and 'cal_nmvsadcosts'.
For the joint cost:
- mvjointsadcost[1] == mvjointsadcost[2] == mvjointsadcost[3]
For the component costs:
- For all i: mvsadcost[0][i] == mvsadcost[1][i]
(equal per component cost)
- For all i: mvsadcost[0][i] == mvsadcost[0][-i]
(Cost function is even)
These must hold, otherwise the AVX version of the function cannot be used.
Marco [Fri, 30 Oct 2015 18:51:06 +0000 (11:51 -0700)]
Bias against non-zero mv for large blocks.
Change is only for real-time mode, speed > 5, and non-screen content mode.
Bias is based on block size and motion vector level (motion above some threshold).
Helps to improves stability in background from lightning changes.
PSNR/SSIM metrics on RTC set almost no change/neutral (within +/- 0.1).
Marco [Sun, 1 Nov 2015 22:40:05 +0000 (14:40 -0800)]
Move noise level estimate outside denoiser.
Source noise level estimate is also useful for
setting variance encoder parameters (variance thresholds,
qp-delta, mode selection, etc), so allow it to be used also
if denoising is not on.
Alex Converse [Wed, 14 Oct 2015 18:03:14 +0000 (11:03 -0700)]
palette: Replace rand() call with custom LCG.
The custom LCG is based on the POSIX recommend constants for a 16-bit
rand(). This implementation uses less computation than typical standard
library procedures which have been extended for 32-bit support, is
guaranteed to be reentrant, and identical everywhere.
base_frame_target is supposed to track the idealized bit
allocation based on error score and not the actual bits
allocated to each frame.
The clamping of this value based on the VBR min and max pct values
was causing a bug where in some cases the loop that adjusts the
active max quantizer for each GF group was running out of bits at
the end of a KF group. This caused a spike in Q and some ugly artifacts.
A second change makes sure that the calculation of the active
Q range for a group DOES, however, take account of clamping.
Jingning Han [Fri, 23 Oct 2015 00:25:00 +0000 (17:25 -0700)]
Use explicit block position in foreach_transformed_block
Add the row and column index to the argument list of unit functions
called by foreach_transformed_block wrapper. This avoids the
repeated internal parsing according to the block index.
Geza Lore [Thu, 15 Oct 2015 17:28:31 +0000 (18:28 +0100)]
Optimize vp9_highbd_block_error_8bit assembly.
A new version of vp9_highbd_error_8bit is now available which is
optimized with AVX assembly. AVX itself does not buy us too much, but
the non-destructive 3 operand format encoding of the 128bit SSEn integer
instructions helps to eliminate move instructions. The Sandy Bridge
micro-architecture cannot eliminate move instructions in the processor
front end, so AVX will help on these machines.
Further 2 optimizations are applied:
1. The common case of computing block error on 4x4 blocks is optimized
as a special case.
2. All arithmetic is speculatively done on 32 bits only. At the end of
the loop, the code detects if overflow might have happened and if so,
the whole computation is re-executed using higher precision arithmetic.
This case however is extremely rare in real use, so we can achieve a
large net gain here.
The optimizations rely on the fact that the coefficients are in the
range [-(2^15-1), 2^15-1], and that the quantized coefficients always
have the same sign as the input coefficients (in the worst case they are
0). These are the same assumptions that the old SSE2 assembly code for
the non high bitdepth configuration relied on. The unit tests have been
updated to take this constraint into consideration when generating test
input data.
Ronald S. Bultje [Tue, 20 Oct 2015 17:31:40 +0000 (13:31 -0400)]
vp10: clip MVs before adding to find_ref_mvs() list.
This causes the output of find_ref_mvs() to always be unique or zero.
A nice side-effect of this is that it also causes the output of
find_ref_mvs_sub8x8() to be unique-or-zero, and it will not ignore
available candidate MVs under certain conditions.