Salome Thirot [Fri, 20 Jan 2023 11:42:06 +0000 (11:42 +0000)]
Specialize Neon averaging subpel variance by filter value
Use the same specialization for averaging subpel variance functions
as used for the non-averaging variants. The rationale for the
specialization is as follows:
The optimal implementation of the bilinear interpolation depends on
the filter values being used. For both horizontal and vertical
interpolation this can simplify to just taking the source values, or
averaging the source and reference values - which can be computed
more easily than a bilinear interpolation with arbitrary filter
values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes
This is a backport of this libaom change[1].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
Salome Thirot [Fri, 20 Jan 2023 11:21:02 +0000 (11:21 +0000)]
Refactor Neon averaging subpel variance functions
Merge the computation of vpx_comp_avg_pred into the second pass of the
bilinear filter - avoiding the overhead of loading and storing the
entire block again.
Salome Thirot [Fri, 20 Jan 2023 10:35:34 +0000 (10:35 +0000)]
Specialize Neon subpel variance by filter value for large blocks
The optimal implementation of the bilinear interpolation depends on
the filter values being used. For both horizontal and vertical
interpolation this can simplify to just taking the source values, or
averaging the source and reference values - which can be computed
more easily than a bilinear interpolation with arbitrary filter
values.
This patch introduces tests to find the most optimal bilinear
interpolation implementation based on the filter values being used.
This new specialization is only used for larger block sizes
(>= 16x16) as we need to be doing enough work to make the cost of
finding the optimal implementation worth it.
This is a backport of this libaom change[1].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
Jerome Jiang [Thu, 12 Jan 2023 20:58:00 +0000 (15:58 -0500)]
Add codec control to set per frame QP
Use case is for 1 pass encoding.
Forces max_quantizer = min_quantizer and aq-mode = 0.
Applicalble to spatial layers, where user may set
the QP per spatial layer.
James Zern [Thu, 19 Jan 2023 03:19:01 +0000 (19:19 -0800)]
*/Android.mk: add a check for NDK_ROOT
This simplifies integration with the Android platform and avoids the
files from being used when a non-NDK build is performed. In that case
Android.bp is preferred.
Salome Thirot [Mon, 16 Jan 2023 16:44:04 +0000 (16:44 +0000)]
Refactor Neon implementation of variance functions
Refactor and optimize the Neon implementation of variance functions -
effectively backporting these libaom changes[1,2].
After this change, the only differences between the code in libvpx and
libaom are due to libvpx being compiled with ISO C90, which forbids
mixing declarations and code [-Wdeclaration-after-statement].
Marco Paniconi [Sat, 14 Jan 2023 03:46:10 +0000 (19:46 -0800)]
Fix to segfault for external resize test in vp9
Failure occurs for 1 pass non-realtime mode at speed 0.
Due to speed feautre rd_ml_partition.var_pruning, which
doesn't check for scaled reference in simple_motion_search().
Jonathan Wright [Wed, 18 May 2022 15:58:50 +0000 (16:58 +0100)]
Implement vertical convolutions using Neon USDOT instruction
Add additional AArch64 paths for vpx_convolve8_vert_neon and
vpx_convolve8_avg_vert_neon that use the Armv8.6-A USDOT (mixed-sign
dot-product) instruction. The USDOT instruction takes an 8-bit
unsigned operand vector and a signed 8-bit operand vector to produce
a signed 32-bit result. This is helpful because convolution filters
often have both positive and negative values, while the 8-bit pixel
channel data being filtered is all unsigned. As a result, the USDOT
convolution paths added here do not have to do the "transform the
pixel channel data to [-128, 128) and correct for it later" dance
that we have to do with the SDOT paths.
The USDOT instruction is optional from Armv8.2 to Armv8.5 but
mandatory from Armv8.6 onwards. The availability of the USDOT
instruction is indicated by the feature macro
__ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on
target CPUs that do not implement the USDOT instructions.
Jonathan Wright [Wed, 18 May 2022 13:14:56 +0000 (14:14 +0100)]
Implement horizontal convolutions using Neon USDOT instruction
Add additional AArch64 paths for vpx_convolve8_horiz_neon and
vpx_convolve8_avg_horiz_neon that use the Armv8.6-A USDOT (mixed-sign
dot-product) instruction. The USDOT instruction takes an 8-bit
unsigned operand vector and a signed 8-bit operand vector to produce
a signed 32-bit result. This is helpful because convolution filters
often have both positive and negative values, while the 8-bit pixel
channel data being filtered is all unsigned. As a result, the USDOT
convolution paths added here do not have to do the "transform the
pixel channel data to [-128, 128) and correct for it later" dance
that we have to do with the SDOT paths.
The USDOT instruction is optional from Armv8.2 to Armv8.5 but
mandatory from Armv8.6 onwards. The availability of the USDOT
instruction is indicated by the feature macro
__ARM_FEATURE_MATMUL_INT8. The SDOT paths are retained for use on
target CPUs that do not implement the USDOT instructions.
Jonathan Wright [Thu, 5 Jan 2023 12:20:03 +0000 (12:20 +0000)]
Use lane-referencing intrinsics in Neon convolution kernels
The Neon convolution helper functions take a pointer to a filter and
load the 8 values into a single Neon register. For some reason,
filter values 3 and 4 are then duplicated into their own separate
registers.
This patch modifies these helper functions so that they access filter
values 3 and 4 via the lane-referencing versions of the various Neon
multiply instructions. This reduces register pressure and tidies up
the source code quite a bit.
Jerome Jiang [Wed, 21 Dec 2022 16:13:40 +0000 (11:13 -0500)]
Remove references to deprecated NumPy type aliases
This change replaces references to a number of deprecated NumPy type
aliases (np.bool, np.int, np.float, np.complex, np.object, np.str)
with their recommended replacement
(bool, int, float, complex, object, str).
NumPy 1.24 drops the deprecated aliases
so we must remove uses before updating NumPy.
Anton Venema [Tue, 13 Dec 2022 18:27:37 +0000 (10:27 -0800)]
Add additional ARM targets for Visual Studio.
configure: Add an armv7-win32-vs16 target
configure: Add an armv7-win32-vs17 target
configure: Add an arm64-win64-vs16 target
configure: Add an arm64-win64-vs17 target
Hirokazu Honda [Thu, 17 Nov 2022 07:05:28 +0000 (16:05 +0900)]
vp9/rate_ctrl_rtc: Improve get cyclic refresh data
A client of the vp9 rate controller needs to know whether the
segmentation is enabled and the size of delta_q. It is also nicer to
know the size of map. This CL changes the interface to achieve these.
Marco Paniconi [Tue, 15 Nov 2022 06:11:19 +0000 (22:11 -0800)]
vp9-svc: Fixes to make SVC work with VBR
Prior to this CL SVC with VBR mode was broken.
Fixes made here to make VBR rate control work for SVC.
Rename is_one_pass_cbr_svc() --> is_one_pass_svc(),
as it can be used now for both CBR and VBR.
Johann [Mon, 14 Nov 2022 08:59:45 +0000 (17:59 +0900)]
quantize: remove vp9_regular_quantize_b_4x4
This was just a helper function which called vpx_quantize_b or
vpx_highbd_quantize_b. It also checked for skip_block, which was
necessary when webm:1439 was filed but does not appear to be
necessary now.
Removes a quantize variant and makes subsequent cleanups easier.
Refactor & optimize FHT functions further, use new butterfly functions
4x4 5% faster, 8x8 & 16x16 10% faster than previous versions.
Highbd 4x4 FHT version 2.27x faster than C version for --rt.
Sam James [Sun, 6 Nov 2022 04:11:59 +0000 (04:11 +0000)]
build: fix -Wimplicit-int (Clang 16)
Clang 16 will make -Wimplicit-int error by default which can, in addition to
other things, lead to some configure tests silently failing/returning the wrong result.
Fixes this error:
```
+/var/tmp/portage/media-libs/libvpx-1.12.0/temp/vpx-conf-1802-30624.c:1:15: error: type specifier missing, defaults to 'int'; ISO C99 and later do not support implicit int [-Wimplicit-int]
```
For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
or the (new) c-std-porting mailing list [3].
[0] https://lwn.net/Articles/913505/
[1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
[2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
[3] hosted at lists.linux.dev.
Bug: https://bugs.gentoo.org/879705
Change-Id: Id73a98944ab3c99a368b9da7a5e902ddff9d937f Signed-off-by: Sam James <sam@gentoo.org>
For --best quality, resulting function
vpx_highbd_fdct32x32_rd_neon takes 0.27% of cpu time in
profiling, vs 6.27% for the sum of scalar functions:
vpx_fdct32, vpx_fdct32.constprop.0, vpx_fdct32x32_rd_c for rd.
For --rt quality, the function takes 0.19% vs 4.57% for the scalar
version.
Overall, this improves encoding time by ~6% compared for highbd
for --best and ~9% for --rt.
[NEON] Optimize and homogenize Butterfly DCT functions
Provide a set of commonly used Butterfly DCT functions for use in
DCT 4x4, 8x8, 16x16, 32x32 functions. These are provided in various
forms, using vqrdmulh_s16/vqrdmulh_s32 for _fast variants, which
unfortunately are only usable in pass1 of most DCTs, as they do not
provide the necessary precision in pass2.
This gave a performance gain ranging from 5% to 15% in 16x16 case.
Also, for 32x32, the loads were rearranged, along with the butterfly
optimizations, this gave 10% gain in 32x32_rd function.
This refactoring was necessary to allow easier porting of highbd
32x32 functions -follows this patchset.
Marco Paniconi [Wed, 12 Oct 2022 07:10:47 +0000 (00:10 -0700)]
Fix to VP8 external RC for dynamic update of layers
On change/update of rc_cfg: when number of temporal
layers change call vp8_reset_temporal_layer_change(),
which in turn will call vp8_init_temporal_layer_context()
only for the new layers.