Yi Luo [Tue, 21 Mar 2017 00:18:10 +0000 (17:18 -0700)]
Make butterfly_self() signature consistent with butterfly()
- Refer to patch: 48fca113d inv_txfm_ssse3,butterfly: fix win32 abi
compatibility.
- Change four butterfly() calls to butterfly_self(), to simplify the
operations.
Marco [Thu, 16 Mar 2017 22:55:33 +0000 (15:55 -0700)]
vp9: Use sb content measure to bias against golden.
For each superblock, keep track of how far from current frame
was the last significant content change, and use that (along
with GF distance), to turnoff GF search in non-rd pickmode.
Only enabled for speed >= 8.
avgPNSR on RTC/RTC_derf down by ~0.9/1.2.
Speedup on mac: ~3-5%.
Speedup on arm: 3.6% for VGA and 4.4% for HD.
Yunqing Wang [Thu, 16 Mar 2017 22:45:07 +0000 (15:45 -0700)]
Record the sum of tx block eobs in the partition block
The sum of tx bloxk eobs is needed in the machine learning based partition
early termination. The eobs are first accumulated during tx search, and
then the value associated with the best tx_size is copied to ctx for later
use.
After the sum of eobs are calculated correctly, re-enabled
ml_partition_search_early_termination speed feature.
Re-did the quality/speed test to check the impact of the fix.
Gabriel Marin [Wed, 14 Dec 2016 20:07:34 +0000 (12:07 -0800)]
Add a vector form of routine vp9_model_rd_from_var_lapndz
Add routine vp9_model_rd_from_var_lapndz_vec and call it from model_rd_for_sb
to model the rate and distortion for MAX_MB_PLANE Laplacian sources in
parallel. The caller ensures that all sources have non-zero variance.
Measured a 18% to 25% reduction in retired instructions, and 17% to 24%
reduction in instruction execution cost with different compilers for the
Laplacian modeling.
No change in behavior.
TEST=Verified that encoded files match bit for bit, with and without this
change.
BUG=b/33678225
Marco [Thu, 16 Mar 2017 19:47:44 +0000 (12:47 -0700)]
vp9: Fixes in non-rd pickmode for denoising with SVC.
Don't denoise spatial layer frames whose base layer is a key frame.
Disallow golden reference for SVC with denoising on frames
that will be denoised (highest layer), as this removes bad artifact.
Will re-enable when issue is resolved.
Marco [Wed, 15 Mar 2017 23:51:34 +0000 (16:51 -0700)]
vp9: Fix some issues with denoiser and SVC.
Fix the update of the denoiser buffer when the base
spatial layer is a key frame. And allow for better/lower
QP on high spatial layers when their base layer is key frame.
The PartialIDctTest.ResultsMatch test on vpx_idct32x32_135_add_neon()
in high bit-depth mode exposes 16-bit overflow in final stage of pass
2, when changing the test number from 1,000 to 1,000,000.
Change to use saturating add/sub for vpx_idct32x32_34_add_neon(),
vpx_idct32x32_135_add_neon and vpx_idct32x32_1024_add_neon() in high
bit-depth mode.
Yunqing Wang [Mon, 27 Feb 2017 22:26:15 +0000 (14:26 -0800)]
Apply machine learning-based early termination in VP9 partition search
This patch was based on Yang Xian's intern project code. Further modifications
were done.
1. Moved machine-learning related parameters into the context structure.
2. Corrected the calculation of sum_eobs.
3. Removed unused parameters and calculations.
4. Made it work with multiple tiles.
5. Added a speed feature for the machine-learning based partition search
early termination.
6. Re-organized the code.
Marco [Sat, 11 Mar 2017 06:50:43 +0000 (22:50 -0800)]
vp9: Fix condition for intra search in non-rd pickmode.
Fixes an issue when the LAST and golden is not used as a reference,
in which case its possible no encoding mode is set (since intra may be
skipped under certain codtions). Fix is to make sure intra is searched
if no inter mode is checked.
Issue can happen for temporal layer pattern#7 in vpx_temporal_svc_encoder.c
James Zern [Fri, 10 Mar 2017 07:36:11 +0000 (23:36 -0800)]
vp9/encoder: fix segfault on win32 using vs < 2015
shift the bsse[] member of the macroblock struct to the front to avoid
an incorrect offset (0) to the upper half of bsse[0] which leads to a
negative resulting in a crash. restrict this to visual studio versions
before 2015 (the bug was observed with 2013, fixed in 2015) to avoid any
potential cache impact on other platforms.
- Split the inv txfm into three parts to avoid stack spillover.
- Function level speed improves ~12%.
- Use function and macro to remove some repeated code.
James Zern [Thu, 9 Mar 2017 04:42:35 +0000 (20:42 -0800)]
move vp9_scale_and_extend_frame_c to vp9_frame_scale.c
this is similar to the x86 configuration and helps mitigate an issue
with a circular dependency between this function and the ssse3 variant
causing an outsized increase in binary size (~300K for chrome)
chrome.dll:
.text 255B000 -> 252B000
.data 7B000 -> 75000
-221184 bytes
Yunqing Wang [Wed, 8 Mar 2017 20:24:15 +0000 (12:24 -0800)]
Make the partition search early termination feature to be frame size dependent
The 2 thresholds(i.e. partition_search_breakout_dist_thr and
partition_search_breakout_rate_thr) are used as the partition search
early termination speed feature. This refactoring patch made this
feature to be frame size dependent consistently throughout the code.
Linfeng Zhang [Tue, 7 Mar 2017 21:06:06 +0000 (13:06 -0800)]
Update vpx_idct32x32_1024_add_neon()
Most are cosmetics changes.
Speed has no change with clang 3.8, and about 5% faster with gcc 4.8.4
Tried the strategy used in 8x8 and 16x16 (which operations' orders are
similar to the C code), though speed gets better with gcc, it's worse
with clang.
Tried to remove store_in_output(), but speed gets worse.