]> granicus.if.org Git - libx264/log
libx264
7 years agox86: AVX-512 zigzag_scan_4x4_frame
Henrik Gramner [Sat, 25 Mar 2017 18:14:22 +0000 (19:14 +0100)]
x86: AVX-512 zigzag_scan_4x4_frame

7 years agocheckasm: x86: More accurate ymm/zmm measurements
Henrik Gramner [Thu, 11 May 2017 22:03:10 +0000 (00:03 +0200)]
checkasm: x86: More accurate ymm/zmm measurements

YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.

Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consitent
benchmark results.

7 years agox86: AVX-512 support
Henrik Gramner [Sat, 25 Mar 2017 09:16:09 +0000 (10:16 +0100)]
x86: AVX-512 support

AVX-512 consists of a plethora of different extensions, but in order to keep
things a bit more manageable we group together the following extensions
under a single baseline cpu flag which should cover SKL-X and future CPUs:
 * AVX-512 Foundation (F)
 * AVX-512 Conflict Detection Instructions (CD)
 * AVX-512 Byte and Word Instructions (BW)
 * AVX-512 Doubleword and Quadword Instructions (DQ)
 * AVX-512 Vector Length Extensions (VL)

On x86-64 AVX-512 provides 16 additional vector registers, prefer using
those over existing ones since it allows us to avoid using `vzeroupper`
unless more than 16 vector registers are required. They also happen to
be volatile on Windows which means that we don't need to save and restore
existing xmm register contents unless more than 22 vector registers are
required.

Also take the opportunity to drop X264_CPU_CMOV and X264_CPU_SLOW_CTZ while
we're breaking API by messing with the cpu flags since they weren't really
used for anything.

Big thanks to Intel for their support.

7 years agox86: Change assembler from yasm to nasm
Henrik Gramner [Sat, 18 Mar 2017 17:50:36 +0000 (18:50 +0100)]
x86: Change assembler from yasm to nasm

This is required to support AVX-512.

Drop `-Worphan-labels` from ASFLAGS since it's enabled by default in nasm.

Also change alignmode from `k8` to `p6` since it's more similar to `amdnop`
in yasm, e.g. use long nops without excessive prefixes.

7 years agox86: Add some additional cpuflag relations
Henrik Gramner [Sat, 6 May 2017 10:26:56 +0000 (12:26 +0200)]
x86: Add some additional cpuflag relations

Simplifies writing assembly code that depends on available instructions.

LZCNT implies SSE2
BMI1 implies AVX+LZCNT
AVX2 implies BMI2

Skip printing LZCNT under CPU capabilities when BMI1 or BMI2 is available,
and don't print FMA4 when FMA3 is available.

7 years agox86: Faster SSE2 pixel_sad_16x16 and 16x8
Henrik Gramner [Fri, 14 Apr 2017 14:16:49 +0000 (16:16 +0200)]
x86: Faster SSE2 pixel_sad_16x16 and 16x8

Also make the order of fenc/fdec arguments a bit more consistent.

7 years agomsvs/icl: Improve target host detection
Anton Mitrofanov [Sun, 14 May 2017 21:40:52 +0000 (00:40 +0300)]
msvs/icl: Improve target host detection

7 years agoppc: Optimize add8x8_idct_dc
Alexandra Hájková [Sat, 13 May 2017 17:14:52 +0000 (17:14 +0000)]
ppc: Optimize add8x8_idct_dc

Increases speedup compared to C from 2x to 6x.

7 years agoanalyse: Faster min/max MV clipping
Henrik Gramner [Sun, 19 Feb 2017 09:33:16 +0000 (10:33 +0100)]
analyse: Faster min/max MV clipping

Values only needs to be clipped in one direction.

7 years agoslicetype_mb_cost: Clip MVs based on MV range
Henrik Gramner [Thu, 16 Feb 2017 19:04:10 +0000 (20:04 +0100)]
slicetype_mb_cost: Clip MVs based on MV range

Improves cost calculations, especially when a short MV range is used.

7 years agoSupport YUYV and UYVY packed 4:2:2 raw input
Henrik Gramner [Sun, 29 Jan 2017 20:38:43 +0000 (21:38 +0100)]
Support YUYV and UYVY packed 4:2:2 raw input

Packed YUV is arguably more common than planar YUV when dealing with raw
4:2:2 content.

We can utilize the existing plane_copy_deinterleave() functions with some
additional minor constraints (we cannot assume any particular alignment
or overread the input buffer).

Enables assembly optimizations on x86.

7 years agox86: Utilize 3-arg instructions in AVX deblock
Henrik Gramner [Thu, 20 Apr 2017 19:58:23 +0000 (21:58 +0200)]
x86: Utilize 3-arg instructions in AVX deblock

Avoids some redundant register-register moves.

7 years agoconfigure: Support targeting ARM with MSVC tools
Martin Storsjö [Fri, 24 Mar 2017 09:33:46 +0000 (11:33 +0200)]
configure: Support targeting ARM with MSVC tools

Set up the right gas-preprocessor as assembler frontend in these cases,
using armasm as actual assembler.

Don't try to add the -mcpu -mfpu options in this case.

Check whether the compiler actually supports inline assembly.

Check for the ARMv7 features in a different way for the MSVC compiler.

7 years agoconfigure: Check for -lshell32 before forcibly adding it into LDFLAGSCLI
Martin Storsjö [Fri, 24 Mar 2017 09:33:45 +0000 (11:33 +0200)]
configure: Check for -lshell32 before forcibly adding it into LDFLAGSCLI

When targeting the Windows Phone API subset, there is no shell32.lib.

When targeting Windows Phone/RT, the CLI itself won't be built, but
LDFLAGSCLI are included in all later cases of cc_check within configure.
Therefore only add -lshell32 there if it actually is usable.

7 years agoarm: Always unconditionally declare .arch armv7-a
Martin Storsjö [Thu, 4 May 2017 19:00:51 +0000 (22:00 +0300)]
arm: Always unconditionally declare .arch armv7-a

We already unconditionally declare .fpu neon and try to build all the
neon codepaths (but only execute them conditionally based on a runtime
check).

This fixes builds targeting armv6, where the rbit instruction isn't
available. This instruction is only used within a neon function in
any case, so there's little point in emulating it.

7 years agoarm: Use .section .rodata for non-elf, non-mach platforms as well
Martin Storsjö [Fri, 24 Mar 2017 09:33:44 +0000 (11:33 +0200)]
arm: Use .section .rodata for non-elf, non-mach platforms as well

If targeting windows with armasm, gas-preprocessor can rewrite the
.section .rodata into the right construct for that platform.

7 years agogas-preprocessor: Support conversion of additional arm instructions into thumb
Martin Storsjö [Fri, 24 Mar 2017 09:33:41 +0000 (11:33 +0200)]
gas-preprocessor: Support conversion of additional arm instructions into thumb

Convert muls into mul+cmp.

Convert "and r0, sp, #xx" into "mov r0, sp", "and r0, r0, #xx".

Convert ldr with a too large shift into add+ldr. This only works in the
special case when the base register is the same as the target for the ldr.

7 years agoarm: Explicitly declare using the .text segment in the function macro
Martin Storsjö [Fri, 24 Mar 2017 09:33:40 +0000 (11:33 +0200)]
arm: Explicitly declare using the .text segment in the function macro

This fixes one issue in building with MS armasm via gas-preprocessor.
Without the .text segment specification, the object files assembled
fine, but linking failed. (armasm source files don't get the text/code
segment implied automatically if nothing is specified.)

7 years agoosdep: Use the EXPAND macro on other cases of ALIGNED_ARRAY_EMU
Martin Storsjö [Fri, 24 Mar 2017 09:33:39 +0000 (11:33 +0200)]
osdep: Use the EXPAND macro on other cases of ALIGNED_ARRAY_EMU

EXPAND is already used on the other cases where ALIGNED_ARRAY_EMU
is used on all platforms (originally needed for ICL, later also
required by MSVC); apply the same change (originally from 21ba91ae)
for the cases that only are used on ARM.

This fixes use of ALIGNED_ARRAY_16 with MSVC when targeting ARM.

7 years agoUpdate to the latest version of gas-preprocessor.pl
Martin Storsjö [Fri, 24 Mar 2017 09:33:38 +0000 (11:33 +0200)]
Update to the latest version of gas-preprocessor.pl

From http://git.libav.org/?p=gas-preprocessor.git

This update contains changes from myself only.

7 years agoarm: Skip using gas-preprocessor for iOS on arm as well
Martin Storsjö [Fri, 24 Mar 2017 09:33:37 +0000 (11:33 +0200)]
arm: Skip using gas-preprocessor for iOS on arm as well

The few constructs that differ can easily be handled within the
source itself - tested to be working since at least Xcode 6.

7 years agoarm: Use const macros in arm assembly where applicable
Martin Storsjö [Fri, 24 Mar 2017 09:33:36 +0000 (11:33 +0200)]
arm: Use const macros in arm assembly where applicable

This unifies the source code style, and allows building the code
with clang without gas-preprocessor.

7 years agoarm: Use commas between all macro arguments in arm assembly
Martin Storsjö [Fri, 24 Mar 2017 09:33:35 +0000 (11:33 +0200)]
arm: Use commas between all macro arguments in arm assembly

The clang built-in assembler requires proper commas between all macro
arguments. As long as gas-preprocessor is used when building with clang,
this isn't an issue.

7 years agoaarch64: Skip invoking gas-preprocessor for iOS
Martin Storsjö [Fri, 24 Mar 2017 09:33:34 +0000 (11:33 +0200)]
aarch64: Skip invoking gas-preprocessor for iOS

Clang can handle all the constructs used there these days, working
since Xcode 6 at least.

7 years agoaarch64: Use the const macro in the aarch64 checkasm assembly source
Martin Storsjö [Fri, 24 Mar 2017 09:33:33 +0000 (11:33 +0200)]
aarch64: Use the const macro in the aarch64 checkasm assembly source

This fixes building the source with clang for iOS without gas-preprocessor.

7 years agoWindows: Add support for MSVC compilation with WSL
Henrik Gramner [Wed, 12 Apr 2017 21:26:32 +0000 (23:26 +0200)]
Windows: Add support for MSVC compilation with WSL

In Windows 10 version 1703 (Creators Update) WSL supports calling native
Windows binaries from the Bash shell, but it requires using full file
names including extension, e.g. `cl.exe` instead of `cl`.

We also don't have access to `cygpath`, so use a simple regex for
converting the dependencies to Unix paths that `make` can understand.

7 years agocli: Improve the --fullhelp raw demuxer input-csp listing
Henrik Gramner [Sun, 29 Jan 2017 21:58:24 +0000 (22:58 +0100)]
cli: Improve the --fullhelp raw demuxer input-csp listing

Use the same logic for indentation as the lavf demuxer.

7 years agox86inc: Remove argument from WIN64_RESTORE_XMM
Anton Mitrofanov [Sat, 20 May 2017 18:17:59 +0000 (21:17 +0300)]
x86inc: Remove argument from WIN64_RESTORE_XMM

The use of rsp was pretty much hardcoded there and probably didn't work
otherwise with stack_size > 0.

7 years agox86inc: Prefer r14/r15 over r12/r13 on x86-64
Henrik Gramner [Sat, 22 Apr 2017 18:30:35 +0000 (20:30 +0200)]
x86inc: Prefer r14/r15 over r12/r13 on x86-64

Due to a peculiarity in the ModR/M addressing encoding, the r12 and r13
registers sometimes requires an additional byte when used as a base register.

r14 and r15 doesn't have that issue, so prefer using them.

7 years agox86inc: Make REP_RET identical to RET in SSSE3+ functions
Henrik Gramner [Thu, 20 Apr 2017 17:16:51 +0000 (19:16 +0200)]
x86inc: Make REP_RET identical to RET in SSSE3+ functions

There's no point in emitting a rep prefix before ret on modern CPUs.

7 years agox86inc: Fix call with memory operands
Henrik Gramner [Wed, 29 Mar 2017 14:43:57 +0000 (16:43 +0200)]
x86inc: Fix call with memory operands

We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.

7 years agoosdep: Rework alignment macros
Henrik Gramner [Sun, 29 Jan 2017 15:41:33 +0000 (16:41 +0100)]
osdep: Rework alignment macros

Drop ALIGNED_N and ALIGNED_ARRAY_N in favor of using explicit alignment.

This will allow us to increase the native alignment without unnecessarily
increasing the alignment of everything that's currently 32-byte aligned.

7 years agoMove cabac_block_residual function declarations
Vittorio Giovara [Mon, 30 Jan 2017 21:14:57 +0000 (22:14 +0100)]
Move cabac_block_residual function declarations

7 years agoRecursively delete conftest files
Vittorio Giovara [Mon, 30 Jan 2017 21:14:59 +0000 (22:14 +0100)]
Recursively delete conftest files

On OS X, one of the conftest files might be a directory named `conftest.dSYM`.

7 years agoDrop unused function declarations
Vittorio Giovara [Mon, 30 Jan 2017 21:14:56 +0000 (22:14 +0100)]
Drop unused function declarations

7 years agox86: Adjust cache64_ssse3 function suffixes
Vittorio Giovara [Fri, 27 Jan 2017 17:06:39 +0000 (18:06 +0100)]
x86: Adjust cache64_ssse3 function suffixes

Makes those function names more consistent with other similar functions.

7 years agomc: Mark a function only used within the file as static
Vittorio Giovara [Fri, 27 Jan 2017 15:21:16 +0000 (16:21 +0100)]
mc: Mark a function only used within the file as static

7 years agoppc: Drop two unused static functions
Vittorio Giovara [Fri, 27 Jan 2017 15:21:15 +0000 (16:21 +0100)]
ppc: Drop two unused static functions

7 years agocli: Verify that yuv/y4m input has at least one frame of data
Henrik Gramner [Fri, 19 May 2017 14:08:34 +0000 (16:08 +0200)]
cli: Verify that yuv/y4m input has at least one frame of data

Prevents a SIGBUS crash caused by attempting to access a memory-mapped
region beyond the end of the input file.

7 years agomips: Fix out-of-tree build
Kaustubh Raste [Fri, 14 Apr 2017 09:59:31 +0000 (15:29 +0530)]
mips: Fix out-of-tree build

Signed-off-by: Kaustubh Raste <kaustubh.raste@imgtec.com>
7 years agocheckasm: Fix load_deinterleave_chroma_fdec test
Henrik Gramner [Fri, 24 Mar 2017 23:02:11 +0000 (00:02 +0100)]
checkasm: Fix load_deinterleave_chroma_fdec test

The function only writes to parts of the destination buffer but the test
verifies the content of the entire buffer. The problem is that some earlier
IDCT functions clobbers the same part of the buffer with garbage when
benchmarked which would incorrectly cause test failures.

Fix this by explicitly zeroing the buffers beforehand.

7 years agocheckasm: Fix compilation on hardened x86-64 ELF systems
Henrik Gramner [Fri, 24 Mar 2017 21:27:42 +0000 (22:27 +0100)]
checkasm: Fix compilation on hardened x86-64 ELF systems

Normal PC-relative relocations cannot be used for resolving the address of
external symbols on systems where ASLR results in the offset being larger
than 32 bits. We are required to to go through the PLT instead.

7 years agoaarch64: Fix building checkasm for iOS
Martin Storsjö [Thu, 23 Mar 2017 13:05:38 +0000 (15:05 +0200)]
aarch64: Fix building checkasm for iOS

On iOS, symbols are prefixed - this prefix gets added by the X()
macro.

7 years agoconfigure: Always enable PIC in aarch64 assembly for apple platforms
Martin Storsjö [Thu, 23 Mar 2017 13:05:37 +0000 (15:05 +0200)]
configure: Always enable PIC in aarch64 assembly for apple platforms

This is similar to what we do for 32-bit ARM assembly as well.

Fixes linker errors such as `ld: Absolute addressing not allowed in
arm64 code but used in '_x264_cabac_encode_terminal_asm' referencing
'_x264_cabac_range_lps' for architecture arm64`.

7 years agoppc: AltiVec plane_copy_deinterleave
Alexandra Hájková [Mon, 5 Dec 2016 10:28:53 +0000 (10:28 +0000)]
ppc: AltiVec plane_copy_deinterleave

7 years agoppc: AltiVec plane_copy_deinterleave_v210
Alexandra Hájková [Mon, 2 Jan 2017 12:56:48 +0000 (12:56 +0000)]
ppc: AltiVec plane_copy_deinterleave_v210

7 years agoppc: AltiVec plane_copy_deinterleave_rgb
Alexandra Hájková [Wed, 7 Dec 2016 19:48:02 +0000 (19:48 +0000)]
ppc: AltiVec plane_copy_deinterleave_rgb

Also add some missing vector types in ppccommon.h

7 years agoppc: Adjust AltiVec function suffix
Vittorio Giovara [Thu, 19 Jan 2017 16:43:57 +0000 (17:43 +0100)]
ppc: Adjust AltiVec function suffix

Architecture should always be the last element.

7 years agoMove the x264_mdate() declaration to the appropriate header
Vittorio Giovara [Mon, 9 Jan 2017 21:28:20 +0000 (22:28 +0100)]
Move the x264_mdate() declaration to the appropriate header

7 years agoarm/aarch64: Correctly prefix integral function symbols
Vittorio Giovara [Tue, 17 Jan 2017 16:04:19 +0000 (17:04 +0100)]
arm/aarch64: Correctly prefix integral function symbols

7 years agox86: Avoid using hardcoded function symbol prefixes
Anton Mitrofanov [Fri, 13 Jan 2017 13:57:51 +0000 (14:57 +0100)]
x86: Avoid using hardcoded function symbol prefixes

7 years agox86: AVX2 high bit-depth load_deinterleave_chroma
Henrik Gramner [Wed, 18 Jan 2017 20:57:14 +0000 (21:57 +0100)]
x86: AVX2 high bit-depth load_deinterleave_chroma

load_deinterleave_chroma_fenc: 50% faster than AVX
load_deinterleave_chroma_fdec: 25% faster than AVX

7 years agox86: AVX2 load_deinterleave_chroma_fenc
Henrik Gramner [Wed, 18 Jan 2017 20:46:55 +0000 (21:46 +0100)]
x86: AVX2 load_deinterleave_chroma_fenc

20% faster than SSSE3.

7 years agox86: AVX2 plane_copy_deinterleave
Henrik Gramner [Tue, 17 Jan 2017 20:59:47 +0000 (21:59 +0100)]
x86: AVX2 plane_copy_deinterleave

50% faster than SSSE3 in 8-bit.
25% faster than AVX in high bit-depth.

Also drop the MMX versions of deinterleave functions in favor of SSE2.

7 years agox86: AVX2 plane_copy_deinterleave_rgb
Henrik Gramner [Thu, 12 Jan 2017 21:16:53 +0000 (22:16 +0100)]
x86: AVX2 plane_copy_deinterleave_rgb

Around 15% faster than SSSE3.

7 years agox86: Faster plane_copy_deinterleave_rgb_sse2
Henrik Gramner [Thu, 12 Jan 2017 20:36:28 +0000 (21:36 +0100)]
x86: Faster plane_copy_deinterleave_rgb_sse2

50% faster than the previous SSE2 function.

7 years agox86util: Reduce code size of high bit-depth AVX LOAD_DIFF
Henrik Gramner [Sun, 15 Jan 2017 13:52:29 +0000 (14:52 +0100)]
x86util: Reduce code size of high bit-depth AVX LOAD_DIFF

AVX supports unaligned memory operands which makes the SATD code a bit denser.

7 years agoBump dates to 2017
Henrik Gramner [Sun, 1 Jan 2017 18:10:10 +0000 (19:10 +0100)]
Bump dates to 2017

7 years agoppc: Fix the pre-VSX vec_vsx_st() fallback macro
Alexandra Hájková [Sat, 21 Jan 2017 12:34:49 +0000 (12:34 +0000)]
ppc: Fix the pre-VSX vec_vsx_st() fallback macro

It would previously only work correctly with 8-bit data types.

Fixes compilation with --disable-vsx.

7 years agoFix plane_copy_deinterleave_v210 on big-endian
Alexandra Hájková [Wed, 18 Jan 2017 09:13:39 +0000 (09:13 +0000)]
Fix plane_copy_deinterleave_v210 on big-endian

7 years agoppc: Avoid instantiating unused plane_copy functions
Alexandra Hájková [Wed, 21 Dec 2016 13:13:43 +0000 (13:13 +0000)]
ppc: Avoid instantiating unused plane_copy functions

Those functions are currently only used in 8-bit mode and results in
warnings in other bit depths.

8 years agoarm: Load mb_y properly in mbtree_propagate_list_internal_neon
Martin Storsjö [Mon, 26 Dec 2016 22:22:48 +0000 (00:22 +0200)]
arm: Load mb_y properly in mbtree_propagate_list_internal_neon

The previous version, attempting to load two stack parameters at once,
only would have worked if they were interpreted and loaded as 32 bit
elements, not when loading them as 16 bit elements.

8 years agoanalyse: Fix lambda table values
Anton Mitrofanov [Mon, 31 Oct 2016 11:39:52 +0000 (14:39 +0300)]
analyse: Fix lambda table values

8 years agoCosmetics
Anton Mitrofanov [Sat, 26 Nov 2016 12:30:58 +0000 (15:30 +0300)]
Cosmetics

Also make x264_weighted_reference_duplicate() static.

8 years agoppc: AltiVec store_interleave_chroma
Alexandra Hájková [Mon, 28 Nov 2016 14:04:10 +0000 (14:04 +0000)]
ppc: AltiVec store_interleave_chroma

8 years agoppc: AltiVec plane_copy_interleave
Alexandra Hájková [Mon, 28 Nov 2016 10:51:54 +0000 (10:51 +0000)]
ppc: AltiVec plane_copy_interleave

8 years agoppc: AltiVec plane_copy_swap
Alexandra Hájková [Sat, 26 Nov 2016 20:03:34 +0000 (20:03 +0000)]
ppc: AltiVec plane_copy_swap

8 years agoppc: AltiVec zigzag_interleave_8x8_cavlc
Alexandra Hájková [Wed, 23 Nov 2016 19:53:51 +0000 (20:53 +0100)]
ppc: AltiVec zigzag_interleave_8x8_cavlc

8 years agoppc: AltiVec zigzag_scan_8x8_frame
Alexandra Hájková [Wed, 23 Nov 2016 19:53:50 +0000 (20:53 +0100)]
ppc: AltiVec zigzag_scan_8x8_frame

8 years agoppc: AltiVec sub8x8_dct_dc
Alexandra Hájková [Mon, 14 Nov 2016 14:06:06 +0000 (15:06 +0100)]
ppc: AltiVec sub8x8_dct_dc

8 years agoppc: AltiVec add8x8_idct_dc
Alexandra Hájková [Mon, 14 Nov 2016 14:06:05 +0000 (15:06 +0100)]
ppc: AltiVec add8x8_idct_dc

8 years agocheckasm: aarch64: Add filler args to make sure all parameters are passed on the...
Martin Storsjö [Wed, 16 Nov 2016 08:57:31 +0000 (10:57 +0200)]
checkasm: aarch64: Add filler args to make sure all parameters are passed on the stack

This, combined with clobbering the stack space prior to the call,
increases the chances of finding cases where 32 bit parameters
are erroneously treated as 64 bit.

8 years agocheckasm: aarch64: Clobber the stack before calling functions
Martin Storsjö [Wed, 16 Nov 2016 08:57:30 +0000 (10:57 +0200)]
checkasm: aarch64: Clobber the stack before calling functions

8 years agoppc: Use vec_vsx_ld instead of VEC_LOAD/STORE macros
Alexandra Hájková [Tue, 1 Nov 2016 22:16:17 +0000 (23:16 +0100)]
ppc: Use vec_vsx_ld instead of VEC_LOAD/STORE macros

Remove VEC_LOAD*, some of VEC_STORE* macros, some PREP* macros and
VEC_DIFF_H_OFFSET macro.

Make sure the functions do not use deprected primitives.

8 years agoppc: Provide fallbacks for older architectures
Luca Barbato [Tue, 1 Nov 2016 22:16:16 +0000 (23:16 +0100)]
ppc: Provide fallbacks for older architectures

8 years agoppc: Add VSX support to configure
Luca Barbato [Tue, 1 Nov 2016 22:16:14 +0000 (23:16 +0100)]
ppc: Add VSX support to configure

8 years agoppc: Manually unroll the horizontal prediction loop
Luca Barbato [Tue, 1 Nov 2016 22:16:13 +0000 (23:16 +0100)]
ppc: Manually unroll the horizontal prediction loop

Doubles the speedup from the function (from being slower to be over
twice as fast than C).

8 years agox86inc: Avoid using eax/rax for storing the stack pointer
Henrik Gramner [Sat, 8 Oct 2016 15:20:18 +0000 (17:20 +0200)]
x86inc: Avoid using eax/rax for storing the stack pointer

When allocating stack space with an alignment requirement that is larger
than the current stack alignment we need to store a copy of the original
stack pointer in order to be able to restore it later.

If we chose to use another register for this purpose we should not pick
eax/rax since it can be overwritten as a return value.

8 years agoShow the correct settings for --preset slow in --fullhelp
Henrik Gramner [Thu, 1 Dec 2016 15:05:16 +0000 (16:05 +0100)]
Show the correct settings for --preset slow in --fullhelp

The slow preset was recently adjusted but we forgot to update the
corresponding --fullhelp message to reflect the change.

8 years agocheckasm: arm/aarch64: Fix the amount of space reserved for stack parameters
Martin Storsjö [Mon, 14 Nov 2016 21:54:51 +0000 (23:54 +0200)]
checkasm: arm/aarch64: Fix the amount of space reserved for stack parameters

Even if MAX_ARGS - 2 (for arm) or MAX_ARGS - 6 (for aarch64) parameters
are passed on the stack to checkasm_checked_call, we actually only
need to store MAX_ARGS - 4 (for arm) or MAX_ARGS - 8 (for aarch64)
parameters on the stack when calling the tested function.

8 years agocheckasm: arm: preserve the stack alignment in x264_checkasm_checked_call
Janne Grunau [Mon, 14 Nov 2016 21:54:50 +0000 (23:54 +0200)]
checkasm: arm: preserve the stack alignment in x264_checkasm_checked_call

The stack used by x264_checkasm_checked_call_neon was a multiple of 4
when the checked function is called. AAPCS requires a double word (8 byte)
aligned stack public interfaces. Since both calls are public interfaces
the stack is misaligned when the checked is called.

This can cause issues if code called within this (which includes
the C implementations) relies on the stack alignment.

8 years agoarm: Don't use vcmp.f64 for testing for an all-zeros register
Martin Storsjö [Wed, 16 Nov 2016 08:56:14 +0000 (10:56 +0200)]
arm: Don't use vcmp.f64 for testing for an all-zeros register

On iOS, vcmp.f64 can behave as if the register was zero, if the
register (interpreted as a f64), was a denormal number.

The vcmp.f64 (and other VFP instructions) will trap to the kernel
(which is supposed to implement the FP operation, which it apparently
doesn't do properly on iOS) if the value is a denormal. If this happens,
the whole comparison ends up way more costly.

8 years agoaarch64: Clear the upper half of int parameters in x264_plane_copy_core_neon
Janne Grunau [Wed, 16 Nov 2016 08:49:14 +0000 (10:49 +0200)]
aarch64: Clear the upper half of int parameters in x264_plane_copy_core_neon

8 years agoppc: Fix hadamard for little-endian
Luca Barbato [Tue, 1 Nov 2016 22:16:18 +0000 (23:16 +0100)]
ppc: Fix hadamard for little-endian

Extending to 16-bit works with flipped bytes.

8 years agoCorrectly signal max_dec_frame_buffering with --keyint 1
Anton Mitrofanov [Wed, 21 Sep 2016 21:17:48 +0000 (00:17 +0300)]
Correctly signal max_dec_frame_buffering with --keyint 1

According to E.2.1 it is inferred to be equal to 0 only if profile_idc is equal
to 44, 86, 100, 110, 122, or 244 and constraint_set3_flag is equal to 1.

8 years agox86: Faster pixel_ssim_4x4x2_core
Henrik Gramner [Sat, 17 Sep 2016 19:41:52 +0000 (21:41 +0200)]
x86: Faster pixel_ssim_4x4x2_core

8 years agox86: Deduplicate a constant in hpel_filter_c
Henrik Gramner [Sat, 17 Sep 2016 19:14:35 +0000 (21:14 +0200)]
x86: Deduplicate a constant in hpel_filter_c

8 years agox86: Faster pixel_ssd_nv12
Henrik Gramner [Sat, 17 Sep 2016 12:45:08 +0000 (14:45 +0200)]
x86: Faster pixel_ssd_nv12

Also drop the MMX2 version to simplify things.

8 years agox86: SSE zigzag_scan_4x4_field
Henrik Gramner [Sun, 11 Sep 2016 13:32:54 +0000 (15:32 +0200)]
x86: SSE zigzag_scan_4x4_field

Replaces the MMX2 version, one cycle faster.

Also change the checkasm test to use the correct alignment macro.

8 years agox86: AVX2 mbtree_propagate_list
Henrik Gramner [Wed, 7 Sep 2016 17:27:31 +0000 (19:27 +0200)]
x86: AVX2 mbtree_propagate_list

SIMD part is around 25% faster than AVX on Haswell, around 7%
faster when including the runtime of the scalar C wrapper.

8 years agox86: Move predict_16x16_dc_left calculations to asm
Henrik Gramner [Wed, 7 Sep 2016 17:26:42 +0000 (19:26 +0200)]
x86: Move predict_16x16_dc_left calculations to asm

1-2 cycles faster and avoids some code duplication to decrease code size.

Also drop the MMX2 implementation in favor of SSE2 to simplify things.

8 years agoavs: support for AviSynth+ high bit-depth pixel formats
Anton Mitrofanov [Thu, 18 Aug 2016 16:00:48 +0000 (19:00 +0300)]
avs: support for AviSynth+ high bit-depth pixel formats

8 years agoaarch64: implement x264_plane_copy_swap_neon
Janne Grunau [Fri, 26 Aug 2016 17:26:56 +0000 (20:26 +0300)]
aarch64: implement x264_plane_copy_swap_neon

plane_copy_swap_c: 27054
plane_copy_swap_neon: 4152

8 years agoVarious cosmetics of semicolon use
Anton Mitrofanov [Thu, 18 Aug 2016 19:14:22 +0000 (22:14 +0300)]
Various cosmetics of semicolon use

8 years agocli: Prefetch yuv/y4m input frames on Windows 8 and newer
Henrik Gramner [Thu, 28 Jul 2016 19:58:40 +0000 (21:58 +0200)]
cli: Prefetch yuv/y4m input frames on Windows 8 and newer

Use PrefetchVirtualMemory() (if available) on memory-mapped input frames.

Significantly improves performance when the source file is not already
present in the OS page cache by asking the OS to bring in those pages from
disk using large, concurrent I/O requests.

Most beneficial on fast encoding settings. Up to 40% faster overall with
--preset ultrafast, and up to 20% faster overall with --preset veryfast.

This API was introduced in Windows 8, so call it conditionally. On older
Windows systems the previous behavior remains unchanged.

8 years agoAdjust --preset slow
Henrik Gramner [Thu, 28 Jul 2016 17:34:04 +0000 (19:34 +0200)]
Adjust --preset slow

 * Swap --me umh for --trellis 2. They have a similar effect on performance
   but the latter gives slightly better results in most cases.
 * Change --b-adapt from 2 to 1. Negligible difference in quality since the
   b-adapt 1 improvements, but it's significantly faster.

Also remove a redundant assignment from veryfast (--me hex is set by default).

8 years agoratecontrol_new: Simplify an expression in HRD timescale calculation
Henrik Gramner [Thu, 28 Jul 2016 17:33:57 +0000 (19:33 +0200)]
ratecontrol_new: Simplify an expression in HRD timescale calculation

Also gets rid of a false positive static analyser integer division warning.

8 years agogcc: Enable __sync_fetch_and_add() on x86-64
Henrik Gramner [Thu, 28 Jul 2016 17:33:44 +0000 (19:33 +0200)]
gcc: Enable __sync_fetch_and_add() on x86-64

It was previously only enabled on 32-bit x86 for no reason, so 64-bit
systems had to use a mutex instead of a simple `lock xadd` instruction.

Note that this code is only used in some very specific configurations
involving sliced threads.

8 years agomips: Fix high bit-depth compilation
Anton Mitrofanov [Tue, 20 Sep 2016 15:48:22 +0000 (18:48 +0300)]
mips: Fix high bit-depth compilation

8 years agocheckasm: Fix compilation on Windows with --disable-thread
Henrik Gramner [Sat, 17 Sep 2016 13:53:59 +0000 (15:53 +0200)]
checkasm: Fix compilation on Windows with --disable-thread