Peter Boström [Wed, 11 Jan 2017 17:28:03 +0000 (12:28 -0500)]
Add Y,U,V channel metrics and unweighted metrics.
Renames SSIM to VpxSSIM as an upscaled weighted SSIM metric, then prints
Y, U and V channels unweighted as well as a weighted but not scaled SSIM
score that's 8/1/1 parts Y/U/V (same as VpxSSIM).
Jingning Han [Mon, 9 Jan 2017 22:00:29 +0000 (14:00 -0800)]
Rework forward 8x8 2D-DCT ssse3 implementation
This commit reworks the SSSE3 implementation of the forward 8x8
2D-DCT. It uses a cyclic rotation approach to the temporary xmm
registers. It reduces the average cycles from 158 to 154. The SSE2
version uses 169 cycles.
Marco [Tue, 10 Jan 2017 00:38:49 +0000 (16:38 -0800)]
vp9: 1 pass cbr: Adjustments to usage of gf_cbr_boost and aq=3 mode.
When aq=3 mode is on and the gf_cbr_boost is set: make sure golden frame
is always refreshed, and don't incorporate segement cost in qp setting
on the boosted golden frame.
Better performance on RTC set with gf_cbr_boost on,
for example with gf_cbr_boost=50, gains from ~0.5-3%.
hui su [Fri, 2 Dec 2016 18:11:33 +0000 (10:11 -0800)]
Add support for VP9 level targeting
Constraints on encoder config:
-target_bandwidth is no larger than 80% of level bitrate limit
-target_bandwidth * (1 + max_over_shoot_pct) is no larger than
88% of level bitrate limit
-min_gf_interval is no smaller than level limit
-tile_columns is no larger than level limit
Constraints on rate control:
-current frame size plus previous three frames' size is no larger
than the CPB level limit
-current frame size is no larger than 50%/40%/20% of the CPB
level limit if it's a key/alt-ref/other frame.
Jerome Jiang [Thu, 5 Jan 2017 00:19:42 +0000 (16:19 -0800)]
vp9: Compute source sad for every superblock when partition copy is on.
The source sad could be used to copy the partition without going into
choose_partitioning function to speed up vp9 encoding. Computing source
sad takes little time. Speed test on Android and Linux shows little
encoding time gain (less than 1.4%).
Turned off for now since partition copy is turned off.
Jerome Jiang [Wed, 4 Jan 2017 19:22:51 +0000 (11:22 -0800)]
vp9: Set short circuit to level 3 for VGA for speed 8.
vp9: Set short circuit to level 3 for VGA for speed 8. Also change the
threshold_32x32 to 5/8*thresholds[1] to improve quality regression
caused to VGA clips.
Bug fix to avoid random crashes during ARNR filtering
The function 'vp9_find_best_sub_pixel_tree_pruned_more' is modified
to return INT_MAX for handling invalid MV cases from UINT32_MAX.
yunqingwang:
patch 3: rebased on top of the tree.
patch 4: The return type of vp9_find_best_sub_pixel_tree* was changed
to uint32_t to fix ubsan warnings. Changing UINT_MAX back to INT_MAX
was not quite right. Patch 4 modified vp9_temporal_filter.c to accept
uint32_t.
(Note: Inconsistency exists in vp9_find_best_sub_pixel_tree*, which
will be fixed in a separate CL.)
This change results in unsigned integer overflows (as reported by
-fsanitize=integer) in datarate_test.cc,
for many of --gtest_filter=VP9/DatarateOnePassCbrSvc.OnePassCbrSvc*:
unsigned integer overflow: 167198 - 185560 cannot be represented in type
'unsigned long'
As the encoder didn't change, but the input with the change to
(correctly) use Y4mVideoSource, this revert is merely masking the issue.
Gabriel Marin [Wed, 14 Dec 2016 19:07:50 +0000 (11:07 -0800)]
Remove superfluous conditional on 'shortcut'
Remove superfluous test. Produces a small improvement in instruction scheduling.
Measured a 1% to 1.5% reduction in execution time for routine vp9_optimize_b
with different compilers.
No change in behavior.
TEST=Verified that encoded files match bit for bit, with and without this
change.
BUG=b/33678225
Jerome Jiang [Mon, 19 Dec 2016 18:39:04 +0000 (10:39 -0800)]
vp9: Add feature to copy partition from the last frame.
Add feature to copy partition from the last frame.
The copy is only done under certain conditions that SAD is below threshold.
Feature is currently disabled, until threshold is tuned.
Feature will be initially used for Speed 8 (ARM).
Under extreme case of always copying partition for speed 8:
Encode time is reduced by 5.4% on rtc_derf and 7.8% on rtc.
Overall PSNR reduced by 2.1 on rtc_derf and 0.968 on rtc.
Gabriel Marin [Wed, 14 Dec 2016 00:22:48 +0000 (16:22 -0800)]
Simplify address arithmetic in vp9_optimize_b
Simplify address arithmetic on token_costs to reduce the number of generated
instructions that are used for address arithmetic inside routine
vp9_optimize_b. It also helps improve instruction scheduling depending on
compiler and optimization level.
Measured a 9.3% reduction in retired instructions and 5.3% reduction in
execution time for this routine with GCC v4.8.4 and optimization flags -O3,
and a reduction of up to 11.6% in execution time with other compilers.
No change in behavior.
TEST=Verified that encoded files match bit for bit, with and without this
change.
BUG=b/33678225
relocate the assignment to 'in' outside of the for loop. this quiets a
spurious warning in visual studio builds since: 86e340c enable vpx_idct32x32_1024_add_neon in hbd builds
Marco [Fri, 16 Dec 2016 19:15:57 +0000 (11:15 -0800)]
vp9: Change condition to enable recheck_zeromv_after_denoising.
For when denoising enabled: change condition to enable
the recheck_zeromv_after_denoising for only very high noise level.
This is causing an issue, so enabling it for very high noise
to effectively shut it off.
Johann [Fri, 16 Dec 2016 20:19:00 +0000 (12:19 -0800)]
postproc test: disable new down and across test
The new test is causing valgrind failures:
[ RUN ] SSE2/VpxPostProcDownAndAcrossMbRowTest.CheckCvsAssembly/0
==28923== Invalid read of size 16
28923== at 0x724016: ??? (deblock_sse2.asm:146)
Disable during investigation. The test is new but the code is not.
Jim Bankoski [Fri, 16 Dec 2016 16:50:55 +0000 (08:50 -0800)]
vp8 : use threading mutex's for tsan only.
To avoid decode performance hit of 2% when running on hyperthreaded
cores.
This patch only uses the mutex's when we are running tsan.
This is safe because 32 bit operations like read and store are atomic
on all the platforms we care about. Tsan warns about race situations,
but in this case either situation ( read occurs before write or write
before read) the worst case is that we go around one extra time in the
loop. So the ordering doesn't really matter.
That said a few other things have been tried :
for instance as per here:
webrtc/base/atomicops.h#52
In this patch they use:
__atomic_load_n(i, __ATOMIC_ACQUIRE);
__atomic_store_n(i, value, __ATOMIC_RELEASE);
This code works on gcc, clang ( replacing protected write and read), and
avoids tsan errors. Incurring no penalty in performance. In C11 its
replaced by straight atomic operands.
However there is no equivalent in the visual studio's we support as
int32 on all windows platforms is already atomic. To avoid tsan like
warnings on windows we'd need to use interlocked exchange and the
end result doesn't gain us any thing.
Marco [Wed, 14 Dec 2016 22:08:09 +0000 (14:08 -0800)]
vp9: Fix to usage of flag USE_ALTREF_FOR_ONE_PASS
The flag USE_ALTREF_FOR_ONE_PASS allows for alt-ref lookahead
in 1 pass vbr (from https://chromium-review.googlesource.com/#/c/365498).
This change is to make sure this macro flag only has effect if
the config flag cpi->oxcf.enable_auto_altef is also on.
No change in ytlive encoding, as USE_ALTREF_FOR_ONE_PASS is not
yet enabled.