benchmarks on a Nexus 9 (nvidia denver):
101.3 cycles in x264_cabac_encode_decision_c, 67105369 runs, 3495 skips
97.3 cycles in x264_cabac_encode_decision_asm, 67105493 runs, 3371 skips
132.8 cycles in x264_cabac_encode_terminal_c, 1046950 runs, 1626 skips
116.1 cycles in x264_cabac_encode_terminal_asm, 1048424 runs, 152 skips
92.4 cycles in x264_cabac_encode_bypass_c, 16776192 runs, 1024 skips
89.6 cycles in x264_cabac_encode_bypass_asm, 16776453 runs, 763 skips
Cycle counts are not as stable as one would like. The dynamic code
optimisation seems to produce different results for small chnages in a
binary. Repeated runs with the same binary produce stable results
though (ignoring the first run).
Janne Grunau [Fri, 10 Oct 2014 08:29:15 +0000 (10:29 +0200)]
aarch64: NEON asm for intra chroma deblocking
deblock_h_chroma_420_intra, deblock_h_chroma_422_intra and
x264_deblock_h_chroma_intra_mbaff_neon are ~3 times faster.
deblock_chroma_intra[1] is ~4 times faster than C.
Janne Grunau [Thu, 14 Aug 2014 13:22:50 +0000 (14:22 +0100)]
aarch64: NEON asm for integral init
integral_init4h_neon and integral_init8h_neon are 3-4 times faster than
C. integral_init8v_neon is 6 times faster and integral_init4v_neon is 10
times faster.
Janne Grunau [Tue, 29 Jul 2014 10:06:24 +0000 (11:06 +0100)]
aarch64: NEON asm for missing x264_zigzag_* functions
zigzag_scan_4x4_field_neon, zigzag_sub_4x4_field_neon,
zigzag_sub_4x4ac_field_neon, zigzag_sub_4x4_frame_neon,
igzag_sub_4x4ac_frame_neon more than 2 times faster
zigzag_scan_8x8_frame_neon, zigzag_scan_8x8_field_neon,
zigzag_sub_8x8_field_neon, zigzag_sub_8x8_frame_neon 4-5 times faster
Henrik Gramner [Wed, 8 Oct 2014 20:25:35 +0000 (22:25 +0200)]
checkasm: Serialize read_time() calls on x86
Improves the accuracy of benchmarks, especially in short functions.
To quote the Intel 64 and IA-32 Architectures Software Developer's Manual:
"The RDTSC instruction is not a serializing instruction. It does not necessarily
wait until all previous instructions have been executed before reading the counter.
Similarly, subsequent instructions may begin execution before the read operation
is performed. If software requires RDTSC to be executed only after all previous
instructions have completed locally, it can either use RDTSCP (if the processor
supports that instruction) or execute the sequence LFENCE;RDTSC."
RDTSCP would accomplish the same task, but it's only available since Nehalem.
This change makes SSE2 a requirement to run checkasm.
Henrik Gramner [Mon, 4 Aug 2014 23:42:51 +0000 (01:42 +0200)]
x86: Minor pixel_ssim_end4 improvements
Reduce the number of vector registers used from 7 to 5.
Eliminate some moves in the AVX implementation.
Avoid bypass delays for transitioning between int and float domains.
Janne Grunau [Fri, 18 Jul 2014 13:49:10 +0000 (14:49 +0100)]
aarch64: deblocking NEON asm
Deblock chroma/luma are based on libav's h264 aarch64 NEON deblocking
filter which was ported by me from the existing ARM NEON asm. No
additional persons to ask for a relicense.
Janne Grunau [Wed, 2 Apr 2014 14:31:28 +0000 (16:31 +0200)]
checkasm: add memory clobber to read_time inline asm
The memory acts as compiler barrier preventing aggressive reordering
of read_time calls. gcc 4.8 reorders some of initial read_time calls
after the second when targeting arm.
Janne Grunau [Sun, 20 Jul 2014 11:34:27 +0000 (13:34 +0200)]
arm: move instructions after '.rept' to separate line
The gas manual states "Repeat the sequence of lines between the .rept
directive and the next .endr directive ...". GNU as seems to support
instructions on the same line as .rept anyway but the integrated
assembler in llvm trunk (to be released 3.5 in August 2014) does not.
Diego Biurrun [Wed, 7 May 2014 19:43:15 +0000 (21:43 +0200)]
build: Add dependencies on x86inc.asm/x86util.asm for all .asm files
This is a little bit overzealous, but errs on the side of caution.
Generating full dependency information is also possible, but slightly
slows down the build as YASM cannot do it as a sideeffect of compilation.
Tal Aloni [Tue, 17 Jun 2014 22:10:56 +0000 (15:10 -0700)]
Fix frame-packing==5 with some decoders
The spec mandates that frame-packing==5 requires the SEI on every frame that
begins a view sequence (i.e. the input frames L0-R0-L1-R1 have 4 view sequences,
but if reordered by the encoder to L0-L1-R0-R1 there are now 2 view sequences).
For simplicity, we write the SEI on every frame.
This fixes frame-packing==5 3D playback on some decoders (PlayStation 3, Sony
W8 series, possibly others).