Henrik Gramner [Mon, 14 Aug 2017 21:13:44 +0000 (23:13 +0200)]
x86: Shrink the x86-64 cabac coeff_last tables
Use dword instead of qword entries. Cuts the size of the tables in half
which allows each table fit inside a single cache line.
When PIC is disabled dwords are enough to store absolute addresses.
When PIC is enabled we can store dword offsets relative to the start of
the table and simply add the address of the table to the offset in order
to calculate the full address. This approach also have the advantage of
eliminating a whole bunch of run-time .data relocations.
Henrik Gramner [Fri, 4 Aug 2017 22:09:52 +0000 (00:09 +0200)]
x86inc: Enable AVX emulation for floating-point pseudo-instructions
There are 32 pseudo-instructions for each floating-point comparison
instruction, but only 8 of them are actually valid in legacy-encoded mode.
The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions
and can therefore be disregarded for this purpose.
However, by merging both passes into the same function, we get the
following speedup:
var2_8x8_neon: 2312 1190 1389 1300
var2_8x16_neon: 4862 2130 2293 2422
Henrik Gramner [Wed, 15 Feb 2017 21:00:25 +0000 (22:00 +0100)]
Add support for levels 6, 6.1, and 6.2
These levels were added in the 2016-10 revision of the H.264 specification and
improves support for content with high resolutions and/or high frame rates.
Level 6.2 supports 8K resolution at 120 fps.
Also shrink the x264_levels array by using smaller data types.
Change V and H intra prediction in lossless (TransformBypassModeFlag == 1)
macroblocks to correctly adhere to the specification. Affects lossless
encoding with 8x8dct or mix of lossless with normal macroblocks.
8x8dct has already been disabled in lossless mode for some time due to
being out-of-spec but this will allow us to re-enable it again.
Henrik Gramner [Tue, 23 May 2017 14:40:26 +0000 (16:40 +0200)]
x86: Avoid self-relative expressions on macho64
Functions that uses self-relative expressions in the form of [foo-$$]
appears to cause issues on 64-bit Mach-O systems when assembled with nasm.
Temporarily disable those functions on macho64 for the time being until
we've figured out the root cause.
Henrik Gramner [Mon, 1 May 2017 12:54:32 +0000 (14:54 +0200)]
Rework pixel_var2
The functions are only ever called with pointers to fenc and fdec and the
strides are always constant so there's no point in having them as parameters.
Cover both the U and V planes in a single function call. This is more
efficient with SIMD, especially with the wider vectors provided by AVX2 and
AVX-512, even when accounting for losing the possibility of early termination.
Drop the MMX and XOP implementations, update the rest of the x86 assembly
to match the new behavior. Also enable high bit-depth in the AVX2 version.
Comment out the ARM, AARCH64, and MIPS MSA assembly for now.
Henrik Gramner [Sat, 25 Mar 2017 18:14:28 +0000 (19:14 +0100)]
x86: AVX-512 zigzag_scan_8x8_frame
The vperm* instructions ignores unused bits, so we can pack the permutation
indices together to save cache and just use a shift to get the right values.
Henrik Gramner [Thu, 11 May 2017 22:03:10 +0000 (00:03 +0200)]
checkasm: x86: More accurate ymm/zmm measurements
YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.
Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consitent
benchmark results.
Henrik Gramner [Sat, 25 Mar 2017 09:16:09 +0000 (10:16 +0100)]
x86: AVX-512 support
AVX-512 consists of a plethora of different extensions, but in order to keep
things a bit more manageable we group together the following extensions
under a single baseline cpu flag which should cover SKL-X and future CPUs:
* AVX-512 Foundation (F)
* AVX-512 Conflict Detection Instructions (CD)
* AVX-512 Byte and Word Instructions (BW)
* AVX-512 Doubleword and Quadword Instructions (DQ)
* AVX-512 Vector Length Extensions (VL)
On x86-64 AVX-512 provides 16 additional vector registers, prefer using
those over existing ones since it allows us to avoid using `vzeroupper`
unless more than 16 vector registers are required. They also happen to
be volatile on Windows which means that we don't need to save and restore
existing xmm register contents unless more than 22 vector registers are
required.
Also take the opportunity to drop X264_CPU_CMOV and X264_CPU_SLOW_CTZ while
we're breaking API by messing with the cpu flags since they weren't really
used for anything.
Henrik Gramner [Sun, 29 Jan 2017 20:38:43 +0000 (21:38 +0100)]
Support YUYV and UYVY packed 4:2:2 raw input
Packed YUV is arguably more common than planar YUV when dealing with raw
4:2:2 content.
We can utilize the existing plane_copy_deinterleave() functions with some
additional minor constraints (we cannot assume any particular alignment
or overread the input buffer).
Martin Storsjö [Fri, 24 Mar 2017 09:33:45 +0000 (11:33 +0200)]
configure: Check for -lshell32 before forcibly adding it into LDFLAGSCLI
When targeting the Windows Phone API subset, there is no shell32.lib.
When targeting Windows Phone/RT, the CLI itself won't be built, but
LDFLAGSCLI are included in all later cases of cc_check within configure.
Therefore only add -lshell32 there if it actually is usable.
Martin Storsjö [Thu, 4 May 2017 19:00:51 +0000 (22:00 +0300)]
arm: Always unconditionally declare .arch armv7-a
We already unconditionally declare .fpu neon and try to build all the
neon codepaths (but only execute them conditionally based on a runtime
check).
This fixes builds targeting armv6, where the rbit instruction isn't
available. This instruction is only used within a neon function in
any case, so there's little point in emulating it.
Martin Storsjö [Fri, 24 Mar 2017 09:33:40 +0000 (11:33 +0200)]
arm: Explicitly declare using the .text segment in the function macro
This fixes one issue in building with MS armasm via gas-preprocessor.
Without the .text segment specification, the object files assembled
fine, but linking failed. (armasm source files don't get the text/code
segment implied automatically if nothing is specified.)
Martin Storsjö [Fri, 24 Mar 2017 09:33:39 +0000 (11:33 +0200)]
osdep: Use the EXPAND macro on other cases of ALIGNED_ARRAY_EMU
EXPAND is already used on the other cases where ALIGNED_ARRAY_EMU
is used on all platforms (originally needed for ICL, later also
required by MSVC); apply the same change (originally from 21ba91ae)
for the cases that only are used on ARM.
This fixes use of ALIGNED_ARRAY_16 with MSVC when targeting ARM.
Martin Storsjö [Fri, 24 Mar 2017 09:33:35 +0000 (11:33 +0200)]
arm: Use commas between all macro arguments in arm assembly
The clang built-in assembler requires proper commas between all macro
arguments. As long as gas-preprocessor is used when building with clang,
this isn't an issue.
Henrik Gramner [Wed, 12 Apr 2017 21:26:32 +0000 (23:26 +0200)]
Windows: Add support for MSVC compilation with WSL
In Windows 10 version 1703 (Creators Update) WSL supports calling native
Windows binaries from the Bash shell, but it requires using full file
names including extension, e.g. `cl.exe` instead of `cl`.
We also don't have access to `cygpath`, so use a simple regex for
converting the dependencies to Unix paths that `make` can understand.
Henrik Gramner [Wed, 29 Mar 2017 14:43:57 +0000 (16:43 +0200)]
x86inc: Fix call with memory operands
We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.
Henrik Gramner [Fri, 24 Mar 2017 23:02:11 +0000 (00:02 +0100)]
checkasm: Fix load_deinterleave_chroma_fdec test
The function only writes to parts of the destination buffer but the test
verifies the content of the entire buffer. The problem is that some earlier
IDCT functions clobbers the same part of the buffer with garbage when
benchmarked which would incorrectly cause test failures.
Fix this by explicitly zeroing the buffers beforehand.
Henrik Gramner [Fri, 24 Mar 2017 21:27:42 +0000 (22:27 +0100)]
checkasm: Fix compilation on hardened x86-64 ELF systems
Normal PC-relative relocations cannot be used for resolving the address of
external symbols on systems where ASLR results in the offset being larger
than 32 bits. We are required to to go through the PLT instead.