Henrik Gramner [Sat, 31 Mar 2018 11:49:56 +0000 (13:49 +0200)]
x86inc: Optimize VEX instruction encoding
Most VEX-encoded instructions require an additional byte to encode when src2
is a high register (e.g. x|ymm8..15). If the instruction is commutative we
can swap src1 and src2 when doing so reduces the instruction length, e.g.
Henrik Gramner [Sat, 2 Jun 2018 18:35:10 +0000 (20:35 +0200)]
Fix clang stack alignment issues
Clang emits aligned AVX stores for things like zeroing stack-allocated
variables when using -mavx even with -fno-tree-vectorize set which can
result in crashes if this occurs before we've realigned the stack.
Previously we only ensured that the stack was realigned before calling
assembly functions that accesses stack-allocated buffers but this is
not sufficient. Fix the issue by changing the stack realignment to
instead occur immediately in all CLI, API and thread entry points.
Martin Storsjö [Fri, 30 Mar 2018 21:10:14 +0000 (00:10 +0300)]
configure: Only use gas-preprocessor with armasm for compiler=CL
This picks the right assembler automatically for arm and aarch64
llvm-mingw targets.
This doesn't get the right assembler for clang setups when clang
acts like MSVC and uses MSVC headers though (where it perhaps
should use armasm as before), but that's probably an even more
obscure setup.
Henrik Gramner [Sun, 22 Oct 2017 07:59:28 +0000 (09:59 +0200)]
input: Add a workaround for swscale overread bugs
swscale can read past the end of the input buffer, which may result in
crashes if such a read crosses a page boundary into an invalid page.
Work around this by adding some padding space at the end of the buffer when
using memory-mapped input frames. This may sometimes require copying the
last frame into a new buffer on Windows since the Microsoft memory-mapping
implementation has very limited capabilities compared to POSIX systems.
Martin Storsjö [Wed, 18 Oct 2017 07:40:02 +0000 (10:40 +0300)]
aarch64: Use ldurb/sturb for loads/stores with negative offsets
The assembler (both gas and clang/llvm) automatically fixes this,
armasm64 doesn't. We can fix it in gas-preprocessor, but we should
also be using the right instruction form.
Martin Storsjö [Mon, 16 Oct 2017 19:50:26 +0000 (22:50 +0300)]
arm: Check for __ELF__ instead of !__APPLE__, for using .arch/.fpu
For windows, when building with armasm, we already filtered these out
with gas-preprocessor.
By filtering them out already in the source, we can also build directly
with clang for windows (which also require wrapping the assembler in
gas-preprocessor for converting instructions to thumb form, but
gas-preprocessor doesn't and shouldn't filter out them in the clang
configuration).
Henrik Gramner [Sat, 14 Oct 2017 12:11:26 +0000 (14:11 +0200)]
Shrink the i4x4_mode cost_table array
Only 17 elements are actually used. It was originally padded to 64 bytes to
avoid cache line splits in the x86 assembly, but those haven't really been
an issue on x86 CPU:s made in the past decade or so.
Benchmarking shows no performance impact from dropping the padding, so
might as well remove it and save some cache.
Henrik Gramner [Wed, 11 Oct 2017 16:02:26 +0000 (18:02 +0200)]
x86: Remove some legacy CPU detection hacks
Some ancient Pentium-M and Core 1 CPU:s had slow SSE units, and using MMX
was preferable. Nowadays many assembly functions in x264 completely lack MMX
implementations and falling back to C code will likely make things worse.
Some misconfigured virtualized systems could sometimes also trigger this code
path and cause assertions.
* Use the codec parameters API instead of the AVStream codec field.
* Use av_packet_unref() instead of av_free_packet().
* Use the AVFrame pts field instead of pkt_pts.
Anton Mitrofanov [Fri, 22 Sep 2017 14:18:55 +0000 (17:18 +0300)]
Make ref and i4x4_mode costs global instead of static
Fixes some thread safety doubts and makes code cleaner.
Downside: slightly higher memory usage when calling multiple encoders from the same application.
Anton Mitrofanov [Fri, 22 Sep 2017 13:59:13 +0000 (16:59 +0300)]
configure: Improvements
Log result of pkg-config checks to config.log.
Fix lavf support detection for pkg-config fallback case.
Fix detection of linking dependencies errors for lavf/lsmash/gpac.
Cosmetics.
Add 'i_bitdepth' to x264_param_t with the corresponding '--output-depth' CLI
option to set the bit depth at runtime.
Drop the 'x264_bit_depth' global variable. Rather than hardcoding it to an
incorrect value, it's preferable to induce a linking failure. If applications
relies on this symbol this will make it more obvious where the problem is.
Add Makefile rules that compiles modules with different bit depths. Assembly
on x86 is prefixed with the 'private_prefix' define, while all other archs
modify their function prefix internally.
Templatize the main C library, x86/x86_64 assembly, ARM assembly, AARCH64
assembly, PowerPC assembly, and MIPS assembly.
The depth and cache CLI filters heavily depend on bit depth size, so they
need to be duplicated for each value. This means having to rename these
filters, and adjust the callers to use the right version.
Unfortunately the threaded input CLI module inherits a common.h dependency
(input/frame -> common/threadpool -> common/frame -> common/common) which
is extremely complicated to address in a sensible way. Instead duplicate
the module and select the appropriate one at run time.
Each bitdepth needs different checkasm compilation rules, so split the main
checkasm target into two executables.
Henrik Gramner [Mon, 14 Aug 2017 21:13:44 +0000 (23:13 +0200)]
x86: Shrink the x86-64 cabac coeff_last tables
Use dword instead of qword entries. Cuts the size of the tables in half
which allows each table fit inside a single cache line.
When PIC is disabled dwords are enough to store absolute addresses.
When PIC is enabled we can store dword offsets relative to the start of
the table and simply add the address of the table to the offset in order
to calculate the full address. This approach also have the advantage of
eliminating a whole bunch of run-time .data relocations.
Henrik Gramner [Fri, 4 Aug 2017 22:09:52 +0000 (00:09 +0200)]
x86inc: Enable AVX emulation for floating-point pseudo-instructions
There are 32 pseudo-instructions for each floating-point comparison
instruction, but only 8 of them are actually valid in legacy-encoded mode.
The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions
and can therefore be disregarded for this purpose.
However, by merging both passes into the same function, we get the
following speedup:
var2_8x8_neon: 2312 1190 1389 1300
var2_8x16_neon: 4862 2130 2293 2422
Henrik Gramner [Wed, 15 Feb 2017 21:00:25 +0000 (22:00 +0100)]
Add support for levels 6, 6.1, and 6.2
These levels were added in the 2016-10 revision of the H.264 specification and
improves support for content with high resolutions and/or high frame rates.
Level 6.2 supports 8K resolution at 120 fps.
Also shrink the x264_levels array by using smaller data types.
Change V and H intra prediction in lossless (TransformBypassModeFlag == 1)
macroblocks to correctly adhere to the specification. Affects lossless
encoding with 8x8dct or mix of lossless with normal macroblocks.
8x8dct has already been disabled in lossless mode for some time due to
being out-of-spec but this will allow us to re-enable it again.
Henrik Gramner [Tue, 23 May 2017 14:40:26 +0000 (16:40 +0200)]
x86: Avoid self-relative expressions on macho64
Functions that uses self-relative expressions in the form of [foo-$$]
appears to cause issues on 64-bit Mach-O systems when assembled with nasm.
Temporarily disable those functions on macho64 for the time being until
we've figured out the root cause.
Henrik Gramner [Mon, 1 May 2017 12:54:32 +0000 (14:54 +0200)]
Rework pixel_var2
The functions are only ever called with pointers to fenc and fdec and the
strides are always constant so there's no point in having them as parameters.
Cover both the U and V planes in a single function call. This is more
efficient with SIMD, especially with the wider vectors provided by AVX2 and
AVX-512, even when accounting for losing the possibility of early termination.
Drop the MMX and XOP implementations, update the rest of the x86 assembly
to match the new behavior. Also enable high bit-depth in the AVX2 version.
Comment out the ARM, AARCH64, and MIPS MSA assembly for now.