Henrik Gramner [Sat, 19 Feb 2022 21:57:49 +0000 (22:57 +0100)]
x86inc: Enable 4-operand emulation for variable blend instructions
With legacy encoding the last operand (the index) must be xmm0,
but aside from that emulating non-destructive forms works
the same as any other instruction.
Jessica Clarke [Sun, 19 Dec 2021 22:13:09 +0000 (22:13 +0000)]
configure: Always make shared imply PIC
Building a shared library without -fPIC does not make sense. On most
architectures, especially recent ones, doing so will give link-time
errors due to relocations in read-only sections like .text. On some
legacy architectures, including i386, it is allowed by default, but will
warn, and is highly discouraged due to the overheads it adds at library
load time. Most architectures were already listed here as having shared
imply PIC, but not all, such as i386 which ends up with unwanted text
relocations, as well as architectures not known to the build system
currently like RISC-V, which does not permit text relocations by
default. There is no good reason to want shared without PIC on any
architecture, so just remove the architecture list.
Henrik Gramner [Sun, 12 Dec 2021 22:15:51 +0000 (23:15 +0100)]
Remove thread priority tweaking
Back in 2009 when this was added it improved scheduling of lookahead
threads on prevalent operating systems at the time.
According to more recent testing by Intel however, lowering thread
priorities does not improve performance on modern operating systems.
And more importantly, doing so on systems with heterogeneous CPU
topologies may actually result in a severe performance reduction.
Removing this code altogether eliminates the issue with performance
degradation on such systems, while having no noticeable impact on
regular systems with homogeneous CPU topologies.
The lookahead_thread main loop checks b_exit_thread and exits if it is set.
That flag is set by x264_lookahead_delete, which uses ifbuf.mutex to guard accessing it.
However, the read in the while-loop condition of lookahead_thread is not guarded,
and so TSAN sometimes reports a data race.
Henrik Gramner [Mon, 14 Jun 2021 10:20:01 +0000 (12:20 +0200)]
x86inc: Support memory operands in src1 in 3-operand instructions
Particularly in code that makes heavy use of macros it's possible
to end up with 3-operand instructions with a memory operand in src1.
In the case of SSE this works fine due to automatic move insertions,
but in AVX that fails since memory operands are only allowed in src2.
The main purpose of this feature is to minimize the amount of code
changes required to facilitate conversion of existing SSE code to AVX.
Support writing the content light level information SEI message
Use --cll to specify the maximum content light level (MaxCLL)
and the maximum frame average light level (MaxFALL) as described
by the CTA 861.3 specification.
Phillip Blucas [Wed, 12 Jun 2019 14:52:38 +0000 (09:52 -0500)]
Support writing the mastering display color volume SEI message
Use --mastering-display to specify the properties of the reference display.
A formatted string with all 10 values is required: G,B,R primaries and
white point coordinates, plus max/min brightness. Coordinates are in
0.00002 increments. Brightness units are 0.0001 cd/m^2.
For example, a 1000 nit BT.2020 display with a 0.0001 nit black level:
--mastering-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
Henrik Gramner [Wed, 10 Feb 2021 14:40:32 +0000 (15:40 +0100)]
x86inc: Add stack probing on Windows
Large stack allocations on Windows need to use stack probing in order
to guarantee that all stack memory is committed before accessing it.
This is done by ensuring that the guard page(s) at the end of the
currently committed pages are touched prior to any pages beyond that.
Janne Grunau [Thu, 1 Oct 2020 21:08:37 +0000 (21:08 +0000)]
aarch64/asm: optimize cabac asm
0.5% - 2% overall speedup on
`./x264 --threads X --profile high --preset veryfast --crf 15 -o /dev/null park_joy_420_720p50.y4m`
cabac is responsible for roughly 1/6 of the CPU use.
Branch mispredictions are reduced by 15% to 20%.
Janne Grunau [Thu, 1 Oct 2020 23:49:53 +0000 (01:49 +0200)]
aarch64/asm: optimize cabac_encode_terminal with extrinsic knowledge
Approach taken from x86 asm. Overall speedup meaningless.
cabac_encode_terminal on average twice as fast on cortex-53 while
encoding with following command:
./x264 --threads 1 --profile high --preset veryfast --crf 15 -o /dev/null park_joy_420_720p50.y4m
Should be called to free struct members allocated internally by libx264,
e.g. by x264_param_parse.
Partially based on videolan/x264!18 by Derek Buitenhuis.