i[ RUN ] NEON/VpxHBDSubpelAvgVarianceTest.Ref/33
=================================================================
==535205==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff95bb0b89 at pc 0x00000116dabc bp 0xffffd09f6430 sp 0xffffd09f6428
READ of size 8 at 0xffff95bb0b89 thread T0
#0 0x116dab8 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x116dab8 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x116dab8 in vpx_highbd_8_sub_pixel_avg_variance4x4_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:543:1
...
0xffff95bb0b89 is located 0 bytes to the right of 73-byte region
[0xffff95bb0b40,0xffff95bb0b89)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4a40 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4a40 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa52238 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*, unsigned char
const*)>::SetUp()
test/variance_test.cc:586:14
...
This is the same issue as: e33d4c276 disable vpx_highbd_*_sub_pixel_variance4x{4,8}_neon
They have highbd_var_filter_block2d_bil_w4 in common.
[ RUN ] NEON/VpxHBDSubpelVarianceTest.Ref/24
=================================================================
==450528==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8311a571 at pc 0x0000010ca52c bp 0xffffc63e96b0 sp 0xffffc63e96a8
READ of size 8 at 0xffff8311a571 thread T0
#0 0x10ca528 in load_unaligned_u16q vpx_dsp/arm/mem_neon.h:176:3
#1 0x10ca528 in highbd_var_filter_block2d_bil_w4
vpx_dsp/arm/highbd_subpel_variance_neon.c:49:21
#2 0x10ca528 in vpx_highbd_10_sub_pixel_variance4x8_neon
vpx_dsp/arm/highbd_subpel_variance_neon.c:257:1
...
0xffff8311a571 is located 0 bytes to the right of 113-byte region
[0xffff8311a500,0xffff8311a571)
allocated by thread T0 here:
#0 0x5f18b0 in malloc (test_libvpx+0x5f18b0)
#1 0xce4f90 in vpx_memalign vpx_mem/vpx_mem.c:62:10
#2 0xce4f90 in vpx_malloc vpx_mem/vpx_mem.c:70:40
#3 0xa4ad44 in (anonymous namespace)::SubpelVarianceTest<unsigned
int (*)(unsigned char const*, int, int, int, unsigned char
const*, int, unsigned int*)>::SetUp() test/variance_test.cc:586:14
James Zern [Tue, 7 Mar 2023 22:48:54 +0000 (22:48 +0000)]
Merge changes Ic021e82e,I2bce6f19,I250ab56e,I910692b1,Iefaa774d into main
* changes:
Implement highbd_d207_predictor using Neon
Implement highbd_d153_predictor using Neon
Implement d207_predictor using Neon
Implement d153_predictor using Neon
Implement highbd_d63_predictor using Neon
Salome Thirot [Wed, 1 Mar 2023 10:06:01 +0000 (10:06 +0000)]
Optimize vp9_block_error_fp_neon
Currently vp9_block_error_fp_neon is only used when
CONFIG_VP9_HIGHBITDEPTH is set to false. This patch optimizes the
implementation and uses tran_low_t instead of int16_t so that the
function can also be used in builds where vp9_highbitdepth is enabled.
Jonathan Wright [Mon, 6 Mar 2023 17:52:13 +0000 (17:52 +0000)]
Optimize vpx_sum_squares_2d_i16_neon
Add an additional 32-bit vector accumulator to allow parallel
processing on CPUs that have more than one Neon multiply-accumulate
pipeline. Also use sum_neon.h horizontal-add helpers for reduction.
George Steed [Mon, 6 Mar 2023 13:24:47 +0000 (13:24 +0000)]
Fix potential buffer over-read in highbd d117 predictor Neon
The load of `left[bs]` in the standard bitdepth d117 Neon implementation
triggered an address-sanitizer failure.
The highbd equivalent does not appear to trigger any asan failures when
running the VP9/ExternalFrameBufferMD5Test or
VP9/TestVectorTest.MD5Match tests, but for consistency with the standard
bitdepth implementation we adjust it to avoid the over-read.
Performance is roughly identical, with a 0.8% performance improvement on
average over the previous optimised code.
The implementation is mostly identical to the original but with an
adjustment to how data is loaded from the `left` array. In particular
the left array cannot be guaranteed to be larger than the block size, so
the read of e.g. `left[32]` in the `bs=32` case is not valid. This turns
out to be not a problem since the last lane loaded in this case is
unused. I have added comments in the code to explain why this is the
case.
Since we cannot load the last element directly, we instead construct it
from the previous aligned read. This seems to have an inconsistent
affect on performance, improving by up to 10% in some cases and
regressing by up to 10% on others. Either way it is still significantly
faster than the original C code.
Compared to the previous implementation attempt we now correctly match
the behaviour of the C code when handling the final element loaded from
the 'above' input array. In particular:
- The C code for a 4x4 block performs a full average of the last element
rather than duplicating the final element from the input 'above'
array.
- The C code for other block sizes performs a full average for the
stride=0 and stride=1, and otherwise shifts in duplicates of the final
element from the input 'above' array. Notably this shifting for later
strides _replaces_ the final element which we previously performed an
average on (see {d0,d1}_ext in the code).
It is worth noting that this difference is not caught by the existing
VP9HighbdIntraPredTest test cases since the test vector initialisation
contains this loop:
for (int x = block_size; x < 2 * block_size; x++) {
above_row_[x] = above_row_[block_size - 1];
}
Since AVG2(a, a) and AVG3(a, a, a) are simply 'a', such differences in
behaviour for the final element are not observed.
James Zern [Fri, 3 Mar 2023 23:33:16 +0000 (15:33 -0800)]
disable vp8_sixtap_predict16x16_neon
This causes various buffer overflows in the tests:
[ RUN ] NEON/SixtapPredictTest.TestWithPresetData/0
=================================================================
==22346==ERROR: AddressSanitizer: global-buffer-overflow on address
0x0000012b4a5b at pc 0x000000df0f60 bp 0xffffcf6e64b0 sp 0xffffcf6e64a8
READ of size 8 at 0x0000012b4a5b thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x8819e4 in (anonymous
namespace)::SixtapPredictTest_TestWithPresetData_Test::TestBody()
test/predict_test.cc:293:3
...
0x0000012b4a5b is located 2 bytes to the right of global variable
'kTestData' defined in '../test/predict_test.cc:237:24' (0x12b48a0) of
size 441
[ RUN ] NEON/SixtapPredictTest.TestWithRandomData/0
=================================================================
==22338==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8b5321fb at pc 0x000000df0f60 bp 0xfffff7e0cf30 sp 0xfffff7e0cf28
READ of size 8 at 0xffff8b5321fb thread T0
#0 0xdf0f5c in vp8_sixtap_predict16x16_neon
vp8/common/arm/neon/sixtappredict_neon.c:1507:13
#1 0x87d4c0 in (anonymous
namespace)::PredictTestBase::TestWithRandomData(void (*)(unsigned
char*, int, int, int, unsigned char*, int))
test/predict_test.cc:170:9
...
0xffff8b5321fb is located 2 bytes to the right of 441-byte region
[0xffff8b532040,0xffff8b5321f9)
allocated by thread T0 here:
#0 0x5fd4f0 in operator new[](unsigned long) (test_libvpx+0x5fd4f0)
#1 0x87c2e0 in (anonymous namespace)::PredictTestBase::SetUp()
test/predict_test.cc:47:12
#2 0x87d074 in non-virtual thunk to (anonymous
namespace)::PredictTestBase::SetUp() test/predict_test.cc
...
James Zern [Fri, 3 Mar 2023 20:56:29 +0000 (20:56 +0000)]
disable vpx_get4x4sse_cs_neon
This function causes a heap overflow in the tests:
[ RUN ] NEON/VpxSseTest.RefSse/0
=================================================================
==876922==ERROR: AddressSanitizer: heap-buffer-overflow on address
0xffff8949d903 at pc 0x000000dd95d4 bp 0xfffffdd7f260 sp 0xfffffdd7f258
READ of size 8 at 0xffff8949d903 thread T0
#0 0xdd95d0 in vpx_get4x4sse_cs_neon
vpx_dsp/arm/variance_neon.c:556:10
#1 0x9d4894 in (anonymous namespace)::MainTestClass<unsigned int
(*)(unsigned char const*, int, unsigned char const*,
int)>::RefTestSse() test/variance_test.cc:531:5
#2 0x9d4894 in (anonymous
namespace)::VpxSseTest_RefSse_Test::TestBody()
test/variance_test.cc:772:30
...
0xffff8949d903 is located 3 bytes to the right of 16-byte region
[0xffff8949d8f0,0xffff8949d900)
allocated by thread T0 here:
#0 0x5fd050 in operator new[](unsigned long) (test_libvpx+0x5fd050)
#1 0x9d3e04 in (anonymous namespace)::MainTestClass<unsigned int
(*)(unsigned char const*, int, unsigned char const*,
int)>::SetUp() test/variance_test.cc:299:12
This causes ASan errors:
[ RUN ] VP9/TestVectorTest.MD5Match/1
=================================================================
==837858==ERROR: AddressSanitizer: stack-buffer-overflow on address
0xffff82ecad40 at pc 0x000000c494d4 bp 0xffffe1695800 sp 0xffffe16957f8
READ of size 16 at 0xffff82ecad40 thread T0
#0 0xc494d0 in vpx_d117_predictor_32x32_neon (test_libvpx+0xc494d0)
#1 0x1040b34 in vp9_predict_intra_block (test_libvpx+0x1040b34)
#2 0xf8feec in decode_block (test_libvpx+0xf8feec)
#3 0xf8f588 in decode_partition (test_libvpx+0xf8f588)
#4 0xf7be5c in vp9_decode_frame (test_libvpx+0xf7be5c)
...
Address 0xffff82ecad40 is located in stack of thread T0 at offset 64 in
frame
#0 0x103fd3c in vp9_predict_intra_block (test_libvpx+0x103fd3c)
This frame has 2 object(s):
[32, 64) 'left_col.i' <== Memory access at offset 64 overflows this
variable
[96, 176) 'above_data.i'
Original change's description:
> Allow macroblock_plane to have its own rounding buffer
>
> Add 8 bytes buffer to macroblock_plane to support rounding factor.
>
> Change-Id: I3751689e4449c0caea28d3acf6cd17d7f39508ed
While porting this function to NEON, using SSE4_1 implementation
as base I noticed that both were producing files with different
checksums to the C reference implementation. After investigating
further I found that this saturating pack was the culprit. Doing
the multiplication on the 32-bit values, leads to producing the
correct results with the C implementation.
Salome Thirot [Mon, 27 Feb 2023 17:58:18 +0000 (17:58 +0000)]
Optimize Neon implementation of high bitdepth MSE functions
Currently MSE functions just call the variance helpers but don't
actually use the computed sum. This patch adds dedicated helpers to
perform the computation of sse.
James Zern [Tue, 28 Feb 2023 21:50:11 +0000 (21:50 +0000)]
Merge changes I892fbd2c,Ic59df16c,I7228327b,Ib4a1a2cb into main
* changes:
Implement highbd_d117_predictor using Neon
Implement highbd_d63_predictor using Neon
Implement d117_predictor using Neon
Implement d63_predictor using Neon
Salome Thirot [Fri, 24 Feb 2023 18:05:43 +0000 (18:05 +0000)]
Add Neon implementations of standard bitdepth MSE functions
Currently only vpx_mse16x16 has a Neon implementation. This patch adds
optimized Armv8.0 and Armv8.4 dot-product paths for all block sizes:
8x8, 8x16, 16x8 and 16x16.
Jonathan Wright [Sat, 25 Feb 2023 00:43:46 +0000 (00:43 +0000)]
Optimize transpose_neon.h helper functions
1) Use vtrn[12]q_[su]64 in vpx_vtrnq_[su]64* helpers on AArch64
targets. This produces half as many TRN1/2 instructions compared to
the number of MOVs that result from vcombine.
2) Use vpx_vtrnq_[su]64* helpers wherever applicable.
3) Refactor transpose_4x8_s16 to operate on 128-bit vectors.
James Zern [Fri, 24 Feb 2023 17:58:15 +0000 (17:58 +0000)]
Merge changes I65d86038,If3299fe5,I3ef1ff19 into main
* changes:
Add Neon implementation of high bitdepth 32x32 hadamard transform
Add Neon implementation of high bitdepth 16x16 hadamard transform
Add Neon implementation of high bitdepth 8x8 hadamard transform
James Zern [Wed, 22 Feb 2023 19:34:30 +0000 (11:34 -0800)]
vp9_block.h: rename diff struct to Diff
This matches the style guide and fixes some -Wshadow warnings related to
variables with the same name. Something similar was done in libaom in: 863b04994b Fix warnings reported by -Wshadow: Part2: av1 directory
Deepa K G [Thu, 16 Feb 2023 16:17:24 +0000 (21:47 +0530)]
Skip redundant iterations in joint motion search
In joint_motion_search, there are four iterations.
Even iterations search in the first reference frame
and odd iterations search in the second. The last two
iterations use the search result of the first two
iterations as the start point. If the search result does
not change,last two iterations are not necessary and can
be skipped.
James Zern [Tue, 14 Feb 2023 02:46:51 +0000 (02:46 +0000)]
Merge changes Id74a6d9c,I5c31e0e9,Id5a2b2d9,I73182c97,I2f5916d5, ... into main
* changes:
Optimize vpx_highbd_comp_avg_pred_neon
Add Neon AvgPredTestHBD test suite
Specialize Neon high bitdepth avg subpel variance by filter value
Specialize Neon high bitdepth subpel variance by filter value
Refactor Neon high bitdepth avg subpel variance functions
Optimize Neon high bitdepth subpel variance functions
Salome Thirot [Thu, 9 Feb 2023 16:45:01 +0000 (16:45 +0000)]
Specialize Neon high bitdepth avg subpel variance by filter value
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Salome Thirot [Thu, 9 Feb 2023 14:16:30 +0000 (14:16 +0000)]
Specialize Neon high bitdepth subpel variance by filter value
Use the same specialization as for standard bitdepth. The rationale for
the specialization is as follows:
The optimal implementation of the bilinear interpolation depends on the
filter values being used. For both horizontal and vertical interpolation
this can simplify to just taking the source values, or averaging the
source and reference values - which can be computed more easily than a
bilinear interpolation with arbitrary filter values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes.
Salome Thirot [Wed, 8 Feb 2023 16:50:59 +0000 (16:50 +0000)]
Refactor Neon high bitdepth avg subpel variance functions
Use the same general code style as in the standard bitdepth Neon
implementation - merging the computation of vpx_highbd_comp_avg_pred
with the second pass of the bilinear filter to avoid storing and loading
the block again.
Also move vpx_highbd_comp_avg_pred_neon to its own file (like the
standard bitdepth implementation) since we're no longer using it for
averaging sub-pixel variance.
Salome Thirot [Tue, 7 Feb 2023 14:08:33 +0000 (14:08 +0000)]
Optimize Neon high bitdepth subpel variance functions
Use the same general code style as in the standard bitdepth Neon
implementation. Additionally, do not unnecessarily widen to 32-bit data
types when doing bilinear filtering - allowing us to process twice as
many elements per instruction.
chiyotsai [Wed, 8 Feb 2023 21:54:46 +0000 (13:54 -0800)]
Remove CONFIG_CONSISTENT_RECODE flag
Currently, libvpx does not properly clear and re-initialize the memories
when it re-encodes a frame. As a result, out-of-date values are used in
the encoding process, and re-encoding a frame with the same parameter
will give different outputs.
This commit enables the code under CONFIG_CONSISTENT_RECODE to correct
this behavior. This change has minor effect on the coding performance,
but it ensures valid values are used in the encoding process.
Furthermore, the flag is removed as it is now always turned on.
Jerome Jiang [Thu, 9 Feb 2023 19:37:33 +0000 (14:37 -0500)]
Merge tag 'v1.13.0'
Release v1.13.0 Ugly Duckling
2023-01-31 v1.13.0 "Ugly Duckling"
This release includes more Neon and AVX2 optimizations, adds a new codec
control to set per frame QP, upgrades GoogleTest to v1.12.1, and includes
numerous bug fixes.
- Upgrading:
This release is ABI incompatible with the previous release.
New codec control VP9E_SET_QUANTIZER_ONE_PASS to set per frame QP.
GoogleTest is upgraded to v1.12.1.
.clang-format is upgraded to clang-format-11.
VPX_EXT_RATECTRL_ABI_VERSION was bumped due to incompatible changes to the
feature of using external rate control models for vp9.
- Enhancement:
Numerous improvements on Neon optimizations.
Numerous improvements on AVX2 optimizations.
Additional ARM targets added for Visual Studio.
- Bug fixes:
Fix to calculating internal stats when frame dropped.
Fix to segfault for external resize test in vp9.
Fix to build system with replacing egrep with grep -E.
Fix to a few bugs with external RTC rate control library.
Fix to make SVC work with VBR.
Fix to key frame setting in VP9 external RC.
Fix to -Wimplicit-int (Clang 16).
Fix to VP8 external RC for buffer levels.
Fix to VP8 external RC for dynamic update of layers.
Fix to VP9 auto level.
Fix to off-by-one error of max w/h in validate_config.
Fix to make SVC work for Profile 1.
Jonathan Wright [Thu, 9 Feb 2023 11:57:10 +0000 (11:57 +0000)]
Optimize Neon high bitdepth convolve copy
Use standard loads and stores instead of the significantly slower
interleaving/de-interleaving variants. Also move all loads in loop
bodies above all stores as a mitigation against the compiler thinking
that the src and dst pointers alias (since we can't use restrict in
C89.)