Henrik Gramner [Mon, 1 May 2017 12:54:32 +0000 (14:54 +0200)]
Rework pixel_var2
The functions are only ever called with pointers to fenc and fdec and the
strides are always constant so there's no point in having them as parameters.
Cover both the U and V planes in a single function call. This is more
efficient with SIMD, especially with the wider vectors provided by AVX2 and
AVX-512, even when accounting for losing the possibility of early termination.
Drop the MMX and XOP implementations, update the rest of the x86 assembly
to match the new behavior. Also enable high bit-depth in the AVX2 version.
Comment out the ARM, AARCH64, and MIPS MSA assembly for now.
Henrik Gramner [Sat, 25 Mar 2017 18:14:28 +0000 (19:14 +0100)]
x86: AVX-512 zigzag_scan_8x8_frame
The vperm* instructions ignores unused bits, so we can pack the permutation
indices together to save cache and just use a shift to get the right values.
Henrik Gramner [Thu, 11 May 2017 22:03:10 +0000 (00:03 +0200)]
checkasm: x86: More accurate ymm/zmm measurements
YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.
Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consitent
benchmark results.
Henrik Gramner [Sat, 25 Mar 2017 09:16:09 +0000 (10:16 +0100)]
x86: AVX-512 support
AVX-512 consists of a plethora of different extensions, but in order to keep
things a bit more manageable we group together the following extensions
under a single baseline cpu flag which should cover SKL-X and future CPUs:
* AVX-512 Foundation (F)
* AVX-512 Conflict Detection Instructions (CD)
* AVX-512 Byte and Word Instructions (BW)
* AVX-512 Doubleword and Quadword Instructions (DQ)
* AVX-512 Vector Length Extensions (VL)
On x86-64 AVX-512 provides 16 additional vector registers, prefer using
those over existing ones since it allows us to avoid using `vzeroupper`
unless more than 16 vector registers are required. They also happen to
be volatile on Windows which means that we don't need to save and restore
existing xmm register contents unless more than 22 vector registers are
required.
Also take the opportunity to drop X264_CPU_CMOV and X264_CPU_SLOW_CTZ while
we're breaking API by messing with the cpu flags since they weren't really
used for anything.
Henrik Gramner [Sun, 29 Jan 2017 20:38:43 +0000 (21:38 +0100)]
Support YUYV and UYVY packed 4:2:2 raw input
Packed YUV is arguably more common than planar YUV when dealing with raw
4:2:2 content.
We can utilize the existing plane_copy_deinterleave() functions with some
additional minor constraints (we cannot assume any particular alignment
or overread the input buffer).
Martin Storsjö [Fri, 24 Mar 2017 09:33:45 +0000 (11:33 +0200)]
configure: Check for -lshell32 before forcibly adding it into LDFLAGSCLI
When targeting the Windows Phone API subset, there is no shell32.lib.
When targeting Windows Phone/RT, the CLI itself won't be built, but
LDFLAGSCLI are included in all later cases of cc_check within configure.
Therefore only add -lshell32 there if it actually is usable.
Martin Storsjö [Thu, 4 May 2017 19:00:51 +0000 (22:00 +0300)]
arm: Always unconditionally declare .arch armv7-a
We already unconditionally declare .fpu neon and try to build all the
neon codepaths (but only execute them conditionally based on a runtime
check).
This fixes builds targeting armv6, where the rbit instruction isn't
available. This instruction is only used within a neon function in
any case, so there's little point in emulating it.
Martin Storsjö [Fri, 24 Mar 2017 09:33:40 +0000 (11:33 +0200)]
arm: Explicitly declare using the .text segment in the function macro
This fixes one issue in building with MS armasm via gas-preprocessor.
Without the .text segment specification, the object files assembled
fine, but linking failed. (armasm source files don't get the text/code
segment implied automatically if nothing is specified.)
Martin Storsjö [Fri, 24 Mar 2017 09:33:39 +0000 (11:33 +0200)]
osdep: Use the EXPAND macro on other cases of ALIGNED_ARRAY_EMU
EXPAND is already used on the other cases where ALIGNED_ARRAY_EMU
is used on all platforms (originally needed for ICL, later also
required by MSVC); apply the same change (originally from 21ba91ae)
for the cases that only are used on ARM.
This fixes use of ALIGNED_ARRAY_16 with MSVC when targeting ARM.
Martin Storsjö [Fri, 24 Mar 2017 09:33:35 +0000 (11:33 +0200)]
arm: Use commas between all macro arguments in arm assembly
The clang built-in assembler requires proper commas between all macro
arguments. As long as gas-preprocessor is used when building with clang,
this isn't an issue.
Henrik Gramner [Wed, 12 Apr 2017 21:26:32 +0000 (23:26 +0200)]
Windows: Add support for MSVC compilation with WSL
In Windows 10 version 1703 (Creators Update) WSL supports calling native
Windows binaries from the Bash shell, but it requires using full file
names including extension, e.g. `cl.exe` instead of `cl`.
We also don't have access to `cygpath`, so use a simple regex for
converting the dependencies to Unix paths that `make` can understand.
Henrik Gramner [Wed, 29 Mar 2017 14:43:57 +0000 (16:43 +0200)]
x86inc: Fix call with memory operands
We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.
Henrik Gramner [Fri, 24 Mar 2017 23:02:11 +0000 (00:02 +0100)]
checkasm: Fix load_deinterleave_chroma_fdec test
The function only writes to parts of the destination buffer but the test
verifies the content of the entire buffer. The problem is that some earlier
IDCT functions clobbers the same part of the buffer with garbage when
benchmarked which would incorrectly cause test failures.
Fix this by explicitly zeroing the buffers beforehand.
Henrik Gramner [Fri, 24 Mar 2017 21:27:42 +0000 (22:27 +0100)]
checkasm: Fix compilation on hardened x86-64 ELF systems
Normal PC-relative relocations cannot be used for resolving the address of
external symbols on systems where ASLR results in the offset being larger
than 32 bits. We are required to to go through the PLT instead.
Martin Storsjö [Thu, 23 Mar 2017 13:05:37 +0000 (15:05 +0200)]
configure: Always enable PIC in aarch64 assembly for apple platforms
This is similar to what we do for 32-bit ARM assembly as well.
Fixes linker errors such as `ld: Absolute addressing not allowed in
arm64 code but used in '_x264_cabac_encode_terminal_asm' referencing
'_x264_cabac_range_lps' for architecture arm64`.
Martin Storsjö [Mon, 26 Dec 2016 22:22:48 +0000 (00:22 +0200)]
arm: Load mb_y properly in mbtree_propagate_list_internal_neon
The previous version, attempting to load two stack parameters at once,
only would have worked if they were interpreted and loaded as 32 bit
elements, not when loading them as 16 bit elements.
Martin Storsjö [Wed, 16 Nov 2016 08:57:31 +0000 (10:57 +0200)]
checkasm: aarch64: Add filler args to make sure all parameters are passed on the stack
This, combined with clobbering the stack space prior to the call,
increases the chances of finding cases where 32 bit parameters
are erroneously treated as 64 bit.