Janne Grunau [Tue, 29 Jul 2014 10:06:24 +0000 (11:06 +0100)]
aarch64: NEON asm for missing x264_zigzag_* functions
zigzag_scan_4x4_field_neon, zigzag_sub_4x4_field_neon,
zigzag_sub_4x4ac_field_neon, zigzag_sub_4x4_frame_neon,
igzag_sub_4x4ac_frame_neon more than 2 times faster
zigzag_scan_8x8_frame_neon, zigzag_scan_8x8_field_neon,
zigzag_sub_8x8_field_neon, zigzag_sub_8x8_frame_neon 4-5 times faster
Henrik Gramner [Wed, 8 Oct 2014 20:25:35 +0000 (22:25 +0200)]
checkasm: Serialize read_time() calls on x86
Improves the accuracy of benchmarks, especially in short functions.
To quote the Intel 64 and IA-32 Architectures Software Developer's Manual:
"The RDTSC instruction is not a serializing instruction. It does not necessarily
wait until all previous instructions have been executed before reading the counter.
Similarly, subsequent instructions may begin execution before the read operation
is performed. If software requires RDTSC to be executed only after all previous
instructions have completed locally, it can either use RDTSCP (if the processor
supports that instruction) or execute the sequence LFENCE;RDTSC."
RDTSCP would accomplish the same task, but it's only available since Nehalem.
This change makes SSE2 a requirement to run checkasm.
Henrik Gramner [Mon, 4 Aug 2014 23:42:51 +0000 (01:42 +0200)]
x86: Minor pixel_ssim_end4 improvements
Reduce the number of vector registers used from 7 to 5.
Eliminate some moves in the AVX implementation.
Avoid bypass delays for transitioning between int and float domains.
Janne Grunau [Fri, 18 Jul 2014 13:49:10 +0000 (14:49 +0100)]
aarch64: deblocking NEON asm
Deblock chroma/luma are based on libav's h264 aarch64 NEON deblocking
filter which was ported by me from the existing ARM NEON asm. No
additional persons to ask for a relicense.
Janne Grunau [Wed, 2 Apr 2014 14:31:28 +0000 (16:31 +0200)]
checkasm: add memory clobber to read_time inline asm
The memory acts as compiler barrier preventing aggressive reordering
of read_time calls. gcc 4.8 reorders some of initial read_time calls
after the second when targeting arm.
Janne Grunau [Sun, 20 Jul 2014 11:34:27 +0000 (13:34 +0200)]
arm: move instructions after '.rept' to separate line
The gas manual states "Repeat the sequence of lines between the .rept
directive and the next .endr directive ...". GNU as seems to support
instructions on the same line as .rept anyway but the integrated
assembler in llvm trunk (to be released 3.5 in August 2014) does not.
Diego Biurrun [Wed, 7 May 2014 19:43:15 +0000 (21:43 +0200)]
build: Add dependencies on x86inc.asm/x86util.asm for all .asm files
This is a little bit overzealous, but errs on the side of caution.
Generating full dependency information is also possible, but slightly
slows down the build as YASM cannot do it as a sideeffect of compilation.
Tal Aloni [Tue, 17 Jun 2014 22:10:56 +0000 (15:10 -0700)]
Fix frame-packing==5 with some decoders
The spec mandates that frame-packing==5 requires the SEI on every frame that
begins a view sequence (i.e. the input frames L0-R0-L1-R1 have 4 view sequences,
but if reordered by the encoder to L0-L1-R0-R1 there are now 2 view sequences).
For simplicity, we write the SEI on every frame.
This fixes frame-packing==5 3D playback on some decoders (PlayStation 3, Sony
W8 series, possibly others).
Janne Grunau [Tue, 1 Apr 2014 20:11:43 +0000 (22:11 +0200)]
arm: do not export every asm function
Based on Libav's libavutil/arm/asm.S. Also prevents having the same
label twice for every function on systems not defining EXTERN_ASM.
Clang's integrated assembler does not like it.
Fiona Glaser [Sun, 23 Feb 2014 18:36:55 +0000 (10:36 -0800)]
Macroblock tree overhaul/optimization
Move the second core part of macroblock tree into an assembly function;
SIMD-optimize roughly half of it (for x86). Roughly ~25-65% faster mbtree,
depending on content.
Slightly change how mbtree handles the tradeoff between range and precision
for propagation.
Overall a slight (but mostly negligible) effect on SSIM and ~2% faster.
Henrik Gramner [Sun, 16 Feb 2014 20:24:54 +0000 (21:24 +0100)]
x86: Minor mbtree_propagate_cost improvements
Reduce the number of registers used from 7 to 6.
Reduce the number of vector registers used by the AVX2 implementation from 8 to 7.
Multiply fps_factor by 1/256 once per frame instead of once per macroblock row.
Use mova instead of movu for dst since it's guaranteed to be aligned.
Some cosmetics.
Henrik Gramner [Sun, 9 Feb 2014 22:58:04 +0000 (23:58 +0100)]
x86inc: Support arbitrary stack alignments
If the stack is known to be at least 32-byte aligned we can safely store ymm
registers on the stack without doing manual alignment.
Change ALLOC_STACK to always align the stack before allocating stack space for
consistency. Previously alignment would occur either before or after allocating
stack space depending on whether manual alignment was required or not.