Jingning Han [Mon, 25 Apr 2016 21:26:53 +0000 (14:26 -0700)]
Rework motion vector precision limit
This commit enables 1/8 luma component motion vector precision
for all motion vector cases. It improves the compression performance
of lowres by 0.13% and hdres by 0.49%.
Yi Luo [Mon, 25 Apr 2016 16:41:11 +0000 (09:41 -0700)]
HBD hybrid transform 4x4 SSE4.1 optimization
- Optimization on tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Overall encoder speed improves ~4.5%-6%.
- Update bit-exact unit test against current C version.
Yi Luo [Wed, 20 Apr 2016 01:23:17 +0000 (18:23 -0700)]
Change hybrid transform function argument from TXFM_2D_CFG* to int
Unit test shows manually developed SSE4.1 code would performs ~30%
better if TXFM_2D_CFG configuration is set in lower level. This
change only updates function signature. There is no performance
impact.
x->blk_skip used to be uninitialized (leftover from encoding the
previous block), if cm->tx_mode != TX_MODE_SELECT (which is used with
higher --cpu-used or --rt options). This resulted in degraded coding
performance when using cm->tx_mode != TX_MODE_SELECT.
This fixes the VP10/EndToEndTestLarge.EndtoEndPSNRTest/40 unit test.
Also fixed an edge effect where encode_block in encodemb.c used the
formal width of the block (without cropping at the right edge), to
look up blk_skip, while select_tx_block in rdopt.c used the cropped
width to set blk_skip.
x->blk_skip used to be uninitialzied (leftover from encoding the
previous block), if cm->tx_mode != TX_MODE_SELECT (which is used with
higher --cpu-used or --rt options). This resulted in degraded coding
performance when uning cm->tx_mode != TX_MODE_SELECT.
This fixes the VP10/EndToEndTestLarge.EndtoEndPSNRTest/40 unit test.
Jingning Han [Thu, 14 Apr 2016 19:37:10 +0000 (12:37 -0700)]
Handle zero motion vector residual
This commit handles the zero motion vector residuals for single
and compound reference modes, respectively. It improves the coding
performance by 0.13% with no additional encoding complexity.
Remove an unsuccessful adaption of overlap sizes in obmc experiment
We removed this adaption, which intended to reduce the size of
overlapped region if the neighboring block is a non-skip one. Thus,
now the width/height of the overlapping region is fixed as a half of
the current block.
Jingning Han [Fri, 15 Apr 2016 23:51:10 +0000 (16:51 -0700)]
Refactor transform selection process
This commit re-arranges the transform type and size selectio
process. It removes an unnecessary rate-distortion cost computation
step. Local experiments show that this speeds up the encoding
process by 6% for both the baseline and the ext-intra experiment.
This CL fix the bug
rdopt.c:1687: choose_tx_size_from_rd: Assertion
`mbmi->tx_type == DCT_DCT' failed
It is caused by
1) mms register access before double operation
2) different compiler behaviors
code:
int64_t a = INT64_MAX;
double b = 1. * INT64_MAX;
printf("a < b: %d\n", a < b);
result:
a < b: 0
code:
--target=x86-linux-gcc
int64_t a = INT64_MAX;
double b = 1. * INT64_MAX;
printf("a < b: %d\n", a < b);
result:
a < b: 1
I remove the double operation and test it with EXT_TX experiment.
The psnr change is around 0.05%, which is considered as noise level.
In the rd loop, check the perf of obmc, whose mv is copied from regular
inter predictor, when wedge interinter is better than regular inter
(previously it will force allow_obmc = 0). The condition of the early
termination before this step is relaxed to avoid skipping too many obmc
predictions. The rates of the overhead are properly calculated for these tools.
The logic of the bitstream syntax:
(a single ref) the interintra flag is sent first, only if it is 0, we
send the obmc flag;
(compound refs) the obmc flag is sent first, only if it is 0, we send
the wedge interinter flag
Yi Luo [Fri, 15 Apr 2016 19:26:27 +0000 (12:26 -0700)]
Improvement on hybrid transform 4x4 DCT_DCT SSE4.1 optimization
- Implemented Angie's new fwd txfm algorithm.
- Improve ~100% than last 64-bit version; 3 times faster than
original C code.
- Passed bit-exact unit test.
With ext-ref enabled, it is possible that when trying to encode the
first true ALTREF frame after a keyframe, the previous ALTREF frame
(alias for the keyframe) is the same as one of the new LAST{2,3,4}
reference frames, and hence cpi->ref_frame_flags will have the ALTREF
bit clear, as computed by get_ref_frame_flags in encoder.c.
sf->alt_ref_search_fp forces the previous ALTREF frame to
be used as the only possible reference when encoding a new ALTREF
frame, but due to cpi->ref_frame_flags, some buffers will not be
initialized (see rdopt.c:7689 yv12_mb), leading to a segfault.
get_ref_frame_flags in encoder.c has been changed to prefer to keep
the LAST frame, then the ALTREF frame, then any of the LAST{2,3,4}
frames and then the GOLDEN frame in that order of preference in case
any of them are the same. This avoids the segfault and behaves the
same for the baseline.
Merge changes I92819356,I50b5a313,I807e60c6,I8a8df9fd into nextgenv2
* changes:
Branch dct to new implementation for bd12
Change dct32x32's range
Fit dct's stage range into 32-bit when bitdepth is 12
Pass tx_type into get_tx_scale
Jingning Han [Wed, 13 Apr 2016 22:48:54 +0000 (15:48 -0700)]
Speed up dynamic motion vector referencing system
Skip transform type search in modes with ref_mv_idx > 0. This
brings down the additional encoding time cost due to the DMR system
from 32% to 17%, at minimal coding performance regression.