Jingning Han [Thu, 10 Mar 2016 00:40:08 +0000 (16:40 -0800)]
Enable hybrid 1-D/2-D transform coding for highbd setting
This commit enables the hybrid 1-D/2-D transform coding scheme for
high bit-depth setting. It improves the compression performance of
ext-tx experiment by 0.98% for lowres_all set.
Jingning Han [Wed, 9 Mar 2016 16:58:07 +0000 (08:58 -0800)]
Add horizontal and vertical scan order for 1-D transform
This commit enables the 1-D transform to use Manhattan grid vertical
and horizontal scan order for transform coefficient entropy coding.
Enabled in inter prediction mode, the hybrid 1D/2D transform coding
scheme outperforms the 2D-DCT based coding system used in VP9 by
lowres_all 1.7%
hdres_all 1.4%
As one coding option, in addition to the existing 17 other transform
types in ext-tx experiment, the 1D/2D hybrid transform improves
the coding gains:
lowres_all 2.2% -> 3.0%
- Implemented fdst16_sse2(), fdst16_8col() against C version: fdst16().
- Turned on 7 DST related hybrid txfm types in vp10_fht16x16_sse2().
- Replaced vp10_fht10x10_c() with vp10_fht16x16_sse2() in
fwd_txfm_16x16().
- Added vp10_fht16x16_sse2() unit test against C version:
vp10_fht16x16_c() (--gtest_filter=*VP10Trans16x16*).
- Unit test passed.
- Speed improvement: 2.4%, 3.2%, 3.2%, for city_cif.y4m, garden_sif.y4m,
and mobile_cif.y4m.
Yi Luo [Mon, 7 Mar 2016 22:25:07 +0000 (14:25 -0800)]
Added vp10_fht8x8_sse2() unit test
- Inherited base class TransformTestBase to derived class VP10Trans8x8HT.
- Employed RunCoeffCheck() to test vp10_fht8x8_sse2() against C reference
function vp10_fht8x8_c().
- fdst8_sse2() related seven hybrid transform cases are covered in this
test.
- Test passed (4 test cases w/o EXT_TX; 16 test cases with EXT_TX).
Jingning Han [Sat, 5 Mar 2016 05:23:55 +0000 (21:23 -0800)]
Hybrid 1-D/2-D transform coding
This commit enables a hybrid 1-D/2-D transform coding scheme and
the accompany entropy coding system. It currently uses hybrid
1-D/2-D DCT transform coding. It provides coding performance gains:
Yi Luo [Mon, 29 Feb 2016 17:53:42 +0000 (09:53 -0800)]
Added vp10_fht4x4_sse2() unit test
Inherited class TransformTestBase to derived class VP10Trans4x4HT.
Employed RunCoeffCheck() to test vp10_fht4x4_sse2() against
C reference vp10_fht4x4_c().
fdst4_sse2() related seven hybrid transform cases are covered
in this test.
Wrote a header file for test base class. Some modification to
make sure the base class can be used for 8x8, 16x16, 32x32 cases.
All related tests passed.
Sarah Parker [Tue, 1 Mar 2016 18:12:13 +0000 (10:12 -0800)]
Adding speed feature interface for ext tx search
This sets up the interface for 3 speed features that progressively
eliminate a greater number of transforms in ext tx using
pre-trained support vector machines.
Each speed feature still needs to be implemented.
Alex Converse [Tue, 16 Feb 2016 21:41:01 +0000 (13:41 -0800)]
ANS: Switch from PDFs to CDFs.
Make the RANS implementation operate on cumulative distribution
functions rather than individual probability distribution functions.
CDFs have shown themselves more flexible to work with.
Reduces decoding memory usage from scaling O(num_distributions *
symbol_resolution) to O(num_distributions).
No bitstream change. This is an purely implementation change.
Yi Luo [Wed, 2 Mar 2016 21:45:52 +0000 (13:45 -0800)]
Fixed a computation bug in fdct16_sse2()
fdct16_sse2() was not bit-exact with C reference, fdct16().
The inconsistency was found by writing a unit test for
vp10_fht16x16_sse2(). Since the unit test needs a pending
change on the inherited base class. I will commit this unit
test after making a header file for this base class.
Passed the uncommitted unit test: vp10_fht16x16_test.cc.
Yunqing Wang [Tue, 16 Feb 2016 22:33:18 +0000 (14:33 -0800)]
Do sub-pixel motion search in up-sampled reference frames
Up-sampled the reference frames to 8 times in each dimension using
the 8-tap interpolation filter. In sub-pixel motion search, use the
up-sampled reference frames to find the best matching blocks. This
largely improved the motion search precision, and thus, improved
the compression quality. There was no change in decoder side.
Borg test and speed test results:
1. On derflr set,
Overall PSNR gain: 1.306%, and SSIM gain: 1.512%.
Average speed loss on derf set was 6.0%.
2. On stdhd set,
Overall PSNR gain: 0.754%, and SSIM gain: 0.814%.
On hevchd set,
Overall PSNR gain: 0.465%, and SSIM gain: 0.527%.
Speed loss on HD clips was 3.5%.
Fixes some issues introduced by a merge of two patches.
Also decouples the temporal interpolation filter from the switchable
filters for now for ease of experimentation with both separately.
Jingning Han [Fri, 26 Feb 2016 17:23:43 +0000 (09:23 -0800)]
Unify frame border extension operation
This commit unifies the encoder and decoder border extension and
motion compensated prediction process. Remove the decoder specific
flow to simplify the development flow.
Geza Lore [Mon, 22 Feb 2016 10:55:32 +0000 (10:55 +0000)]
Port interintra experiment from nextgen.
The interintra experiment, which combines an inter prediction and an
inter prediction have been ported from the nextgen branch. The
experiment is merged into ext_inter, so there is no separate configure
option to enable it.
Angie Chiang [Mon, 22 Feb 2016 22:11:05 +0000 (14:11 -0800)]
convolve8 sse2 test
This experiment shows that when frame size is 64x64
vpx_highbd_convolve8_sse2 and vpx_convolve8_sse2's speed are similar.
However when frame size becomes 1024x1024
vpx_highbd_convolve8_sse2 is around 50% slower than vpx_convolve8_sse2
we think the bottleneck is from memory IO
Yi Luo [Wed, 24 Feb 2016 00:59:38 +0000 (16:59 -0800)]
Implemented DST 8x8 with SSE2 intrinsics.
Implemented fdst8_sse2() function against C version: fdst8().
Added seven DST related hybrid transform types in vp10_fht8x8_sse2().
Replaced vp10_fht8x8_c() with vp10_fht8x8_sse2() in fwd_txfm_8x8().
Speedup: 18.1%, 11.5%, 22.0% based on speed test from
city_cif.y4m, garden_sif.y4m, mobile_cif.y4m.
Adds hooks to use 32x32 ext-tx. Also adds scan orders for the masked
transforms for 32x32.
Make macro USE_MSKTX_FOR_32X32 1 in blockd.h to support 32x32 masked
transforms for ext-tx.
Jingning Han [Tue, 23 Feb 2016 17:25:24 +0000 (09:25 -0800)]
Enable context based motion vector entropy coding
This commit enables a context based motion vector entropy coding
conditioned on dynamic reference motion vector list. This (along with
the previous CL) imporves the coding gains due to dynamic motion
vector referencing based entropy coding:
derf 0.1%
hevcmr 0.2%
stdhd 0.7%
hevchr 0.4%
Yue Chen [Thu, 18 Feb 2016 20:35:14 +0000 (12:35 -0800)]
Optimizing obmc rd decision by checking the real rd cost
Instead of using model_rd_for_sb() to estimate the cost and make the
decision on bmc/obmc, we use super_block_yrd/uvrd() to calculate and
compare the real rd costs of bmc and obmc.
Average bit-rate reduction(%) of obmc experiment:
derflr/derfhd/hevcmr/hevchd
2.353/TBD/TBD/TBD
Before the optimization, the coding gain was:
1.582/1.109/1.600/1.164
Note: there is still some mysterious bug because that compared to
the previous version, the performance at low bit rate drops a lot.