]> granicus.if.org Git - libx264/log
libx264
11 years agox86-64: 64-bit variant of AVX2 hpel_filter
Fiona Glaser [Fri, 10 May 2013 00:20:05 +0000 (17:20 -0700)]
x86-64: 64-bit variant of AVX2 hpel_filter

~5% faster than 32-bit.

11 years agox86: AVX2 high bit-depth denoise_dct
Henrik Gramner [Mon, 6 May 2013 16:41:24 +0000 (18:41 +0200)]
x86: AVX2 high bit-depth denoise_dct

28->15 cycles

Also reorder instructions to use fewer registers, 3 cycles faster on Ivy Bridge with 64-bit Windows.

11 years agox86: AVX2 high bit-depth quant
Henrik Gramner [Sat, 4 May 2013 16:48:58 +0000 (18:48 +0200)]
x86: AVX2 high bit-depth quant

quant_4x4: 13->6 cycles
quant_4x4_dc: 14->8 cycles
quant_8x8: 47->24 cycles
quant_4x4x4: 48->25 cycles

11 years agox86: AVX2 add16x16_idct_dc
Fiona Glaser [Wed, 1 May 2013 21:32:11 +0000 (14:32 -0700)]
x86: AVX2 add16x16_idct_dc

27 -> 19 cycles

11 years agox86: faster AVX2 quant_4x4x4
Fiona Glaser [Mon, 29 Apr 2013 23:16:54 +0000 (16:16 -0700)]
x86: faster AVX2 quant_4x4x4

10->9 cycles

11 years agox86: AVX2 intra_sad_x3_8x8c
Fiona Glaser [Sun, 28 Apr 2013 04:03:32 +0000 (21:03 -0700)]
x86: AVX2 intra_sad_x3_8x8c

30->22 cycles

11 years agox86: AVX2 high bit-depth intra_sad_x3_8x8
Henrik Gramner [Sun, 28 Apr 2013 09:11:03 +0000 (11:11 +0200)]
x86: AVX2 high bit-depth intra_sad_x3_8x8

43->24 cycles

11 years agox86: AVX2 deblock strength
Fiona Glaser [Wed, 24 Apr 2013 21:22:15 +0000 (14:22 -0700)]
x86: AVX2 deblock strength

30->18 cycles

11 years agox86: Faster high bit-depth intra_sad_x3_4x4
Henrik Gramner [Wed, 1 May 2013 15:42:48 +0000 (17:42 +0200)]
x86: Faster high bit-depth intra_sad_x3_4x4

20->16 cycles on Ivy Bridge

11 years agox86: faster SSSE3 hpel
Fiona Glaser [Wed, 1 May 2013 00:36:46 +0000 (17:36 -0700)]
x86: faster SSSE3 hpel

~7% faster using the pmulhrsw trick from mc_chroma.

11 years agox86-64: faster SSSE3 trellis
Fiona Glaser [Mon, 29 Apr 2013 21:22:23 +0000 (14:22 -0700)]
x86-64: faster SSSE3 trellis

~2% faster trellis.

11 years agox86: 32-byte align the stack if possible
Fiona Glaser [Fri, 3 May 2013 00:10:26 +0000 (17:10 -0700)]
x86: 32-byte align the stack if possible

Avoids the need for manual 32 byte array alignment on compilers that support
-mpreferred-stack-boundary.

11 years agox86inc: Utilize the shadow space on 64-bit Windows
Henrik Gramner [Sat, 11 May 2013 21:39:09 +0000 (23:39 +0200)]
x86inc: Utilize the shadow space on 64-bit Windows

Store XMM6 and XMM7 in the shadow space in functions that clobbers them.
This way we don't have to adjust the stack pointer as often,
reducing the number of instructions as well as code size.

11 years agox86: Don't use explicitly aligned versions of SAD on AVX CPUs
Henrik Gramner [Fri, 3 May 2013 21:06:10 +0000 (23:06 +0200)]
x86: Don't use explicitly aligned versions of SAD on AVX CPUs

On modern CPUs movdqu isn't slower than movdqa when used on aligned data and using the same code in both cases saves cache.

This was already done for the high bit-depth AVX2 implementation but the aligned version still exists as dead code so remove that.

11 years agox86: Add missing initializations for high bit-depth sad_aligned
Henrik Gramner [Fri, 3 May 2013 18:18:03 +0000 (20:18 +0200)]
x86: Add missing initializations for high bit-depth sad_aligned

11 years agox86: add Jaguar CPU detection
Fiona Glaser [Mon, 13 May 2013 23:52:18 +0000 (16:52 -0700)]
x86: add Jaguar CPU detection

11 years agox86inc: Remove .rodata kludges
Henrik Gramner [Tue, 7 May 2013 15:21:03 +0000 (17:21 +0200)]
x86inc: Remove .rodata kludges

The Mach-O bug was fixed in yasm 0.8.0 and we don't support versions that old.

a.out was superseded by ELF on sane systems a few decades ago.

11 years agocheckasm: Use 64-bit cycle counters
Henrik Gramner [Sat, 4 May 2013 14:21:32 +0000 (16:21 +0200)]
checkasm: Use 64-bit cycle counters

Prevents overflows that can occur in some cases.

11 years agocheckasm: Fix stack alignment bug
Henrik Gramner [Fri, 10 May 2013 11:55:32 +0000 (13:55 +0200)]
checkasm: Fix stack alignment bug

11 years agoFix invalid memcpy in sliced-threads
Fiona Glaser [Wed, 8 May 2013 17:48:41 +0000 (10:48 -0700)]
Fix invalid memcpy in sliced-threads

Likely didn't actually break in practice, but memcpy with src==dst
is incorrect.

11 years agoFix two bugs in slice-min-mbs and slices-max
Fiona Glaser [Mon, 29 Apr 2013 19:14:01 +0000 (12:14 -0700)]
Fix two bugs in slice-min-mbs and slices-max

Slices-max broke slice-max-size when slice-max wasn't used.
Slice-min-mbs broke in rare cases near the end of a threadslice.

11 years agox86: SSSE3 LUT-based faster coeff_level_run
Fiona Glaser [Fri, 5 Apr 2013 01:00:23 +0000 (18:00 -0700)]
x86: SSSE3 LUT-based faster coeff_level_run

~2x faster coeff_level_run.
Faster CAVLC encoding: {1%,2%,7%} overall with {superfast,medium,slower}.
Uses the same pshufb LUT abuse trick as in the previous ads_mvs patch.

11 years agox86-64: BMI2 cabac_residual functions
Fiona Glaser [Mon, 25 Mar 2013 21:03:37 +0000 (14:03 -0700)]
x86-64: BMI2 cabac_residual functions

11 years agox86: SSSE3 ads_mvs
Fiona Glaser [Wed, 20 Mar 2013 22:08:35 +0000 (15:08 -0700)]
x86: SSSE3 ads_mvs

~55% faster ads in benchasm, ~15-30% in real encoding.
~4% faster "placebo" preset overall.

11 years agox86: AVX2 pixel_ssd_nv12_core
Henrik Gramner [Tue, 16 Apr 2013 21:27:53 +0000 (23:27 +0200)]
x86: AVX2 pixel_ssd_nv12_core

11 years agox86: AVX2 high bit-depth pixel_ssd
Henrik Gramner [Tue, 16 Apr 2013 21:27:50 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth pixel_ssd

11 years agox86: AVX2 high bit-depth pixel_sad_x3/pixel_sad_x4
Henrik Gramner [Tue, 16 Apr 2013 21:27:46 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth pixel_sad_x3/pixel_sad_x4

Also reduce the number of xmm registers used by sse2/ssse3 pixel_sad_x3.

11 years agox86: AVX2 high bit-depth vsad
Henrik Gramner [Tue, 16 Apr 2013 21:27:43 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth vsad

11 years agox86: AVX2 high bit-depth pixel_sad
Henrik Gramner [Tue, 16 Apr 2013 21:27:39 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth pixel_sad

Also use loops instead of duplicating code; reduces code size by ~10kB with
negligible effect on performance.

11 years agox86: AVX2 high_bit_depth pixel_avg2, get_ref, mc_copy_w16, mc_luma
Henrik Gramner [Tue, 16 Apr 2013 21:27:35 +0000 (23:27 +0200)]
x86: AVX2 high_bit_depth pixel_avg2, get_ref, mc_copy_w16, mc_luma

Also reduce the number of xmm registers used by mc_copy_* to avoid
saving and restoring xmm6 and xmm7 on 64-bit Windows.

11 years agox86: AVX2 nal_escape
Henrik Gramner [Tue, 16 Apr 2013 21:27:32 +0000 (23:27 +0200)]
x86: AVX2 nal_escape

Also rewrite the entire function to be faster and drop the AVX version which is no longer useful.

11 years agox86: AVX memzero_aligned
Henrik Gramner [Tue, 16 Apr 2013 21:27:29 +0000 (23:27 +0200)]
x86: AVX memzero_aligned

11 years agox86: AVX2 predict_16x16_dc
Henrik Gramner [Tue, 16 Apr 2013 21:27:25 +0000 (23:27 +0200)]
x86: AVX2 predict_16x16_dc

11 years agox86: AVX2 predict_8x8c_p/predict_8x16c_p
Henrik Gramner [Tue, 16 Apr 2013 21:27:22 +0000 (23:27 +0200)]
x86: AVX2 predict_8x8c_p/predict_8x16c_p

11 years agox86: AVX2 predict_16x16_p
Henrik Gramner [Tue, 16 Apr 2013 21:27:18 +0000 (23:27 +0200)]
x86: AVX2 predict_16x16_p

Also fix the AVX implementation to correctly use the SSSE3 inline asm
instead of SSE2.

11 years agox86: AVX high bit-depth predict_16x16_v
Henrik Gramner [Tue, 16 Apr 2013 21:27:14 +0000 (23:27 +0200)]
x86: AVX high bit-depth predict_16x16_v

Also restructure some code to reduce code size of various functions,
especially in high bit-depth.

11 years agox86: AVX2 high bit-depth predict_4x4_h
Henrik Gramner [Tue, 16 Apr 2013 21:27:08 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth predict_4x4_h

11 years agox86: AVX2 high bit-depth predict_16x16_h
Henrik Gramner [Tue, 16 Apr 2013 21:27:04 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth predict_16x16_h

11 years agox86: AVX2 high bit-depth predict_8x8c_h/predict_8x16c_h
Henrik Gramner [Tue, 16 Apr 2013 21:27:00 +0000 (23:27 +0200)]
x86: AVX2 high bit-depth predict_8x8c_h/predict_8x16c_h

11 years agox86util: Support ymm registers in HADD macros
Henrik Gramner [Tue, 16 Apr 2013 21:26:47 +0000 (23:26 +0200)]
x86util: Support ymm registers in HADD macros

11 years agox86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks
Fiona Glaser [Wed, 27 Feb 2013 00:26:34 +0000 (16:26 -0800)]
x86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks

AVX2 functions:
mc_chroma
intra_sad_x3_16x16
last64
ads
hpel
dct4
idct4
sub16x16_dct8
quant_4x4x4
quant_4x4
quant_4x4_dc
quant_8x8
SAD_X3/X4
SATD
var
var2
SSD
zigzag interleave
weightp
weightb
intra_sad_8x8_x9
decimate
integral
hadamard_ac
sa8d_satd
sa8d
lowres_init
denoise

11 years agox86inc: create xm# and ym#, analagous to m#
Loren Merritt [Mon, 25 Feb 2013 21:16:45 +0000 (21:16 +0000)]
x86inc: create xm# and ym#, analagous to m#

For when we want to mix simd sizes within one function.

11 years agox86inc: fix AVX emulation of cmp(p|s)(s|d)
Fiona Glaser [Fri, 5 Apr 2013 23:08:35 +0000 (16:08 -0700)]
x86inc: fix AVX emulation of cmp(p|s)(s|d)

11 years agox86-64: cabac_block_residual assembly
Fiona Glaser [Wed, 6 Feb 2013 01:15:00 +0000 (17:15 -0800)]
x86-64: cabac_block_residual assembly

RDO: ~20% faster than C
Bitstream: ~50% faster than C
1-2% faster overall, highest on preset superfast/fast/medium.

11 years agoOpenCL lookahead
Steve Borho [Thu, 21 Feb 2013 18:48:40 +0000 (12:48 -0600)]
OpenCL lookahead

OpenCL support is compiled in by default, but must be enabled at runtime by an
--opencl command line flag. Compiling OpenCL support requires perl. To avoid
the perl requirement use: configure --disable-opencl.

When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
device.  Lowres intra cost prediction, lowres motion search (including subpel)
and bidir cost predictions are all done on the GPU.  MB-tree and final slice
decisions are still done by the CPU.  Presets which do not use a threaded
lookahead will not use OpenCL at all (superfast, ultrafast).

Because of data dependencies, the GPU must use an iterative motion search which
performs more total work than the CPU would do, so this is not work efficient
or power efficient. But if there are spare GPU cycles to spare, it can often
speed up the encode. Output quality when OpenCL lookahead is enabled is often
very slightly worse in quality than the CPU quality (because of the same data
dependencies).

x264 must compile its OpenCL kernels for your device before running them, and in
order to avoid doing this every run it caches the compiled kernel binary in a
file named x264_lookahead.clbin (--opencl-clbin FNAME to override).  The cache
file will be ignored if the device, driver, or OpenCL source are changed.

x264 will use the first GPU device which supports the required cl_image
features required by its kernels. Most modern discrete GPUs and all AMD
integrated GPUs will work.  Intel integrated GPUs (up to IvyBridge) do not
support those necessary features. Use --opencl-device N to specify a number of
capable GPUs to skip during device detection.

Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
as some have bugs in their OpenCL drivers that cause output to be silently
incorrect.

Developed by MulticoreWare with support from AMD and Telestream.

11 years agoweightp: improve scale/offset search, chroma
Fiona Glaser [Mon, 4 Mar 2013 23:19:47 +0000 (15:19 -0800)]
weightp: improve scale/offset search, chroma

Rescale the scale factor if the offset clips. This makes weightp more effective
in fades to/from white (and an other situation that requires big offsets).

Search more than 1 scale factor and more than 1 offset, depending on --subme.

Try to find the optimal chroma denominator instead of hardcoding it.

Overall improvement: a few percent in fade-heavy clips, such as a sample from
Avatar: TLA.

11 years agoAdd slices-max feature
Fiona Glaser [Tue, 19 Feb 2013 21:48:44 +0000 (13:48 -0800)]
Add slices-max feature

The H.264 spec technically has limits on the number of slices per frame. x264
normally ignores this, since most use-cases that require large numbers of
slices prefer it to. However, certain decoders may break with extremely large
numbers of slices, as can occur with some slice-max-size/mbs settings.

When set, x264 will refuse to create any slices beyond the maximum number,
even if slice-max-size/mbs requires otherwise.

11 years agoAdd slice-min-mbs feature
Fiona Glaser [Fri, 15 Feb 2013 01:22:02 +0000 (17:22 -0800)]
Add slice-min-mbs feature

Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
small slices.
Useful with certain decoders that barf on extremely small slices.

If slice-min-mbs would be violated as a result of slice-max-size, x264 will
exceed slice-max-size and print a warning.

11 years agoDisable mbtree asm with cpu-independent option
Anton Mitrofanov [Tue, 26 Mar 2013 14:56:21 +0000 (18:56 +0400)]
Disable mbtree asm with cpu-independent option

Results vary between versions because of different rounding results.

11 years agoShow "avs: no" --disable-avs option instead of empty string
Anton Mitrofanov [Tue, 26 Mar 2013 14:30:00 +0000 (18:30 +0400)]
Show "avs: no" --disable-avs option instead of empty string

11 years agolavf input: don't use deprecated AVStream fields
Tim Walker [Tue, 19 Mar 2013 22:42:43 +0000 (23:42 +0100)]
lavf input: don't use deprecated AVStream fields

Fixes building against newer libavcodecs from the Libav project.

11 years agoFix y4m input with C420paldv colorspace
Anton Mitrofanov [Tue, 26 Mar 2013 15:54:36 +0000 (19:54 +0400)]
Fix y4m input with C420paldv colorspace

11 years agox86: correctly check stack alignment for Atom hadamard_ac
Fiona Glaser [Sat, 2 Mar 2013 09:22:29 +0000 (01:22 -0800)]
x86: correctly check stack alignment for Atom hadamard_ac

Regression in r2265 (only affected compilers with broken stack alignment,
like ICL on win32).

11 years agox86inc: fix some corner cases of SWAP
Loren Merritt [Mon, 25 Feb 2013 21:23:55 +0000 (21:23 +0000)]
x86inc: fix some corner cases of SWAP

SWAP with >=3 named (rather than numbered) args
PERMUTE followed by SWAP with 2 named args
used to produce the wrong permutation

11 years agoFix array overreads that caused miscompilation in gcc 4.8
Fiona Glaser [Wed, 27 Feb 2013 21:30:22 +0000 (13:30 -0800)]
Fix array overreads that caused miscompilation in gcc 4.8

11 years agoFix undefined behavior in x264_ratecontrol_mb
Fiona Glaser [Thu, 28 Feb 2013 21:32:37 +0000 (13:32 -0800)]
Fix undefined behavior in x264_ratecontrol_mb

11 years agoARM: Fix bug in x264_quant_4x4x4_neon
Stefan Groenroos [Fri, 1 Mar 2013 20:35:34 +0000 (22:35 +0200)]
ARM: Fix bug in x264_quant_4x4x4_neon

Regression in r2273.

11 years agoARM: update NEON mc_chroma to work with NV12 and re-enable it
Stefan Groenroos [Mon, 25 Feb 2013 21:43:09 +0000 (23:43 +0200)]
ARM: update NEON mc_chroma to work with NV12 and re-enable it

Up to 10-15% faster overall.

11 years agoCABAC/CAVLC: use the new bit-iterating macro here too
Fiona Glaser [Thu, 14 Feb 2013 23:00:48 +0000 (15:00 -0800)]
CABAC/CAVLC: use the new bit-iterating macro here too

11 years agoquant_4x4x4: quant one 8x8 block at a time
Fiona Glaser [Fri, 8 Feb 2013 23:34:38 +0000 (15:34 -0800)]
quant_4x4x4: quant one 8x8 block at a time

This reduces overhead and lets us use less branchy code for zigzag, dequant,
decimate, and so on.
Reorganize and optimize a lot of macroblock_encode using this new function.
~1-2% faster overall.

Includes NEON and x86 versions of the new function.
Using larger merged functions like this will also make wider SIMD, like
AVX2, more effective.

11 years agoAdd AvxSynth support to the AviSynth input module.
Stephen Hutchinson [Wed, 13 Feb 2013 02:55:43 +0000 (21:55 -0500)]
Add AvxSynth support to the AviSynth input module.

Uses dlopen to load AvxSynth on Linux and OS X.

Allows the use of --demuxer avs for AvxSynth, though the only source filter it
can currently use is FFMS2.

Add a local copy of avxsynth_c.h and its dependent headers in extras/ so that
users don't need to actually have AvxSynth development headers installed to
enable support for it (mirroring the AviSynth behavior).

Based on a patch by 0x09 (tab@lavabit.com)

11 years agoEliminate some branchiness in ME/analysis
Fiona Glaser [Fri, 8 Feb 2013 08:13:15 +0000 (00:13 -0800)]
Eliminate some branchiness in ME/analysis

Faster, fewer branch mispredictions.

11 years agoFix some store forwarding stalls
Fiona Glaser [Thu, 7 Feb 2013 00:55:39 +0000 (16:55 -0800)]
Fix some store forwarding stalls
There's quite a few others, but most of them don't help to fix or there's no
easy way to avoid them.

11 years agox86: faster AVX satd/sa8d/sa8d_satd/hadamard_ac
Fiona Glaser [Tue, 5 Feb 2013 09:23:23 +0000 (01:23 -0800)]
x86: faster AVX satd/sa8d/sa8d_satd/hadamard_ac

Use Conroe-style movddup in AVX transforms; both Sandy Bridge and Bulldozer
do movddup in the load unit, so it's totally free this way.

On Sandy Bridge:
~6% faster sa8d_satd
~5% faster hadamard_ac
~9% faster 32-bit satd
~2% faster sa8d

11 years agox86: detect Bobcat, improve Atom optimizations, reorganize flags
Fiona Glaser [Sat, 2 Feb 2013 20:37:08 +0000 (12:37 -0800)]
x86: detect Bobcat, improve Atom optimizations, reorganize flags

The Bobcat has a 64-bit SIMD unit reminiscent of the Athlon 64; detect this
and apply the appropriate flags.

It also has an extremely slow palignr instruction; create a flag for this to
avoid massive penalties on palignr-heavy functions.

Improve Atom function selection and document exactly what the SLOW_ATOM flag
covers.

Add Atom-optimized SATD/SA8D/hadamard_ac functions: simply combine the ssse3
optimizations with the sse2 algorithm to avoid pmaddubsw, which is slow on
Atom along with other SIMD multiplies.

Drop TBM detection; it'll probably never be useful for x264.

Invert FastShuffle to SlowShuffle; it only ever applied to one CPU (Conroe).

Detect CMOV, to fail more gracefully when run on a chip with MMX2 but no CMOV.

11 years agox86: combined SA8D/SATD dsp function
Oskar Arvidsson [Sat, 19 Jan 2013 00:47:09 +0000 (01:47 +0100)]
x86: combined SA8D/SATD dsp function

Speedup is most apparent for 8-bit (~30%), but gives some improvements
for 10-bit too (~12%).
64-bit only for now.

11 years agox86: port SSE2+ SATD functions to high bit depth
Oskar Arvidsson [Tue, 29 Jan 2013 22:44:32 +0000 (23:44 +0100)]
x86: port SSE2+ SATD functions to high bit depth

Makes SATD 20-50% faster across all partition sizes but 4x4.

11 years agox86: faster high bit depth ssd
Oskar Arvidsson [Wed, 6 Feb 2013 01:07:53 +0000 (02:07 +0100)]
x86: faster high bit depth ssd

About 15% faster on average.

11 years agox86: optimize and clean up predictor checking
Fiona Glaser [Sat, 19 Jan 2013 06:55:46 +0000 (22:55 -0800)]
x86: optimize and clean up predictor checking
Branchlessly handle elimination of candidates in MMX roundclip asm.
Add a new asm function, similar to roundclip, except without the round part.
Optimize and organize the C code, and make both subme>=3 and subme<3 consistent.
Add lots of explanatory comments and try to make things a little more understandable.
~5-10% faster with subme>=3, ~15-20% faster with subme<3.

11 years agoFix two bugs in predictor checking
Fiona Glaser [Tue, 22 Jan 2013 20:31:55 +0000 (12:31 -0800)]
Fix two bugs in predictor checking
pmv wasn't checked properly in some cases, as well as zero vector.
Output-changing portion of the following patch.

11 years agoImprove lookahead-threads auto selection
Fiona Glaser [Thu, 10 Jan 2013 21:15:52 +0000 (13:15 -0800)]
Improve lookahead-threads auto selection
Smarter decision to improve fast-first-pass performance in 2-pass encodes.
Dramatically improves CPU utilization on multi-core systems.

Tested on a quad-core Ivy Bridge (12 threads, 1080p):
Fast first pass:
veryfast:     ~7% faster
faster:      ~11% faster
fast/medium: ~15% faster
slow/slower: ~42% faster
veryslow:    ~55% faster
CRF/1-pass:
veryfast:     ~9% faster
(all others remained the same)

11 years agox86: Use SSE instead of SSE2 for copying data
Henrik Gramner [Sun, 27 Jan 2013 22:01:59 +0000 (23:01 +0100)]
x86: Use SSE instead of SSE2 for copying data

Reduces code size because movaps/movups is one byte shorter than movdqa/movdqu.
Also merge MMX and SSE versions of memcpy_aligned into a single macro.

11 years ago64-bit cabac optimizations
Henrik Gramner [Sun, 13 Jan 2013 17:27:08 +0000 (18:27 +0100)]
64-bit cabac optimizations

~4% faster PIC

WIN64:
~3% faster and 16 byte shorter cabac_encode_bypass
~8% faster cabac_encode_terminal
Benchmarked on Ivy Bridge

UNIX64:
One instruction less in cabac_encode_bypass

11 years agoconfigure: add QNX support
Mike Gorchak [Sun, 3 Feb 2013 07:35:00 +0000 (23:35 -0800)]
configure: add QNX support

11 years agoWindows: Enable DEP and ASLR
Henrik Gramner [Sun, 20 Jan 2013 18:35:06 +0000 (19:35 +0100)]
Windows: Enable DEP and ASLR

11 years agox86inc: Set ELF hidden visibility for global constants
Henrik Gramner [Thu, 17 Jan 2013 18:17:24 +0000 (19:17 +0100)]
x86inc: Set ELF hidden visibility for global constants

11 years agox86inc: Add cvisible macro for C functions with public prefix
Diego Biurrun [Thu, 17 Jan 2013 10:18:31 +0000 (11:18 +0100)]
x86inc: Add cvisible macro for C functions with public prefix

This allows defining externally visible library symbols.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
11 years agox86inc: rename program_name to private_prefix
Diego Biurrun [Thu, 17 Jan 2013 19:30:37 +0000 (11:30 -0800)]
x86inc: rename program_name to private_prefix
Synced from libav.
The new name is more descriptive and will allow defining a separate public
prefix for externally visible library symbols.

11 years agox264.h: improve x264_encoder_reconfig documentation
Fiona Glaser [Mon, 14 Jan 2013 13:35:30 +0000 (05:35 -0800)]
x264.h: improve x264_encoder_reconfig documentation

11 years agoCosmetics: stricter definition of parameterless functions
Henrik Gramner [Sat, 16 Feb 2013 18:36:50 +0000 (19:36 +0100)]
Cosmetics: stricter definition of parameterless functions

11 years agoUpdate "Install and compile x264" in doc/regression_test.txt
Neil [Mon, 28 Jan 2013 02:47:38 +0000 (10:47 +0800)]
Update "Install and compile x264" in doc/regression_test.txt

11 years agoFix possible non-determinism with mbtree + open-gop + sync-lookahead
Anton Mitrofanov [Thu, 24 Jan 2013 08:11:26 +0000 (12:11 +0400)]
Fix possible non-determinism with mbtree + open-gop + sync-lookahead

Code assumed keyframe analysis would only pull one frame off the list; this
isn't true with open-gop.

11 years agox86: don't use the red zone on win64
Anton Mitrofanov [Mon, 25 Feb 2013 15:28:19 +0000 (19:28 +0400)]
x86: don't use the red zone on win64

11 years agox86-64: fix trellis asm with interlacing
Fiona Glaser [Mon, 11 Feb 2013 00:12:34 +0000 (16:12 -0800)]
x86-64: fix trellis asm with interlacing

Regression in r2145.
Assembly assumed array was [2][64] when it was actually [2][63].
Tiny (~0.1%) compression improvement.

11 years agox86-32: use simple nop codes for <= sse
Ronald S. Bultje [Wed, 30 Jan 2013 17:48:14 +0000 (09:48 -0800)]
x86-32: use simple nop codes for <= sse

The "CentaurHauls family 6 model 9 stepping 8" family of CPUs (flags:
fpu vme de pse tsc msr cx8 sep mtrr pge mov pat mmx fxsr sse up rng
rng_en ace ace_en) SIGILLs on long nop codes.

12 years agoBump dates to 2013
Loren Merritt [Tue, 8 Jan 2013 21:30:57 +0000 (21:30 +0000)]
Bump dates to 2013

12 years agox86inc: Drop tzcnt workaround
Henrik Gramner [Mon, 17 Dec 2012 20:54:00 +0000 (21:54 +0100)]
x86inc: Drop tzcnt workaround

It is no longer needed now that we've bumped the version requirement of yasm to 1.2.0.

12 years agoAVX2/FMA3 version of mbtree_propagate
Fiona Glaser [Mon, 12 Nov 2012 18:28:53 +0000 (10:28 -0800)]
AVX2/FMA3 version of mbtree_propagate
First AVX2 function for testing.
Bump yasm version to 1.2.0 for AVX2 support.

12 years agox86inc: Use VEX-encoded instructions in AVX functions
Henrik Gramner [Tue, 11 Dec 2012 15:05:34 +0000 (16:05 +0100)]
x86inc: Use VEX-encoded instructions in AVX functions
Automatically use VEX-encoding in AVX/AVX2/XOP/FMA3/FMA4 functions for all instructions that exists in a VEX-encoded version.
This change makes it easier to extend existing code to use AVX2.
Also add support for AVX emulation of a few instructions that were missing before.

12 years agox86inc: activate REP_RET automatically
Loren Merritt [Sun, 2 Dec 2012 15:56:30 +0000 (15:56 +0000)]
x86inc: activate REP_RET automatically
Now RET checks whether it immediately follows a branch, so the programmer dosen't have to keep track of that condition.
REP_RET is still needed manually when it's a branch target, but that's much rarer.
The implementation involves lots of spurious labels, but that's ok because we strip them.

12 years agox86inc: support stack mem allocation and re-alignment in PROLOGUE
Ronald S. Bultje [Thu, 6 Dec 2012 23:40:13 +0000 (15:40 -0800)]
x86inc: support stack mem allocation and re-alignment in PROLOGUE
Use this in 8-bit loopfilter functions so they can be used if
there is no aligned stack (e.g. x86-32 MSVC or ICC 10.x).

12 years agoUpdate config.guess and config.sub
Henrik Gramner [Mon, 17 Dec 2012 21:15:02 +0000 (22:15 +0100)]
Update config.guess and config.sub

12 years agoFix crash if the first frame is forced to a non-keyframe
Anton Mitrofanov [Tue, 8 Jan 2013 21:29:49 +0000 (13:29 -0800)]
Fix crash if the first frame is forced to a non-keyframe
This is obviously bad user input, but x264 shouldn't crash if it happens.

12 years agoFix build on ARM with binutils >= 2.23.51.0.6
Bernhard Rosenkränzer [Sun, 30 Dec 2012 20:18:00 +0000 (12:18 -0800)]
Fix build on ARM with binutils >= 2.23.51.0.6
GAS doesn't seem to like spaces in vld1 anymore, so remove those.

12 years agoFix pthread_join emulation on win32 and BeOS
Anton Mitrofanov [Fri, 23 Nov 2012 14:26:53 +0000 (18:26 +0400)]
Fix pthread_join emulation on win32 and BeOS
Doesn't actually affect x264, but it's more correct.

12 years agoFix typo in r2222
Fiona Glaser [Tue, 27 Nov 2012 15:50:51 +0000 (07:50 -0800)]
Fix typo in r2222
Slightly wrong numbers in level table.

12 years agoconfigure: fix gpac detection with -Wp,-D_FORTIFY_SOURCE=2
Sergio Basto [Fri, 23 Nov 2012 02:02:50 +0000 (18:02 -0800)]
configure: fix gpac detection with -Wp,-D_FORTIFY_SOURCE=2

12 years agoSolaris: use sysconf to get processor count
Sean McGovern [Fri, 23 Nov 2012 02:01:16 +0000 (18:01 -0800)]
Solaris: use sysconf to get processor count
Solaris responds correctly to the same value as Cygwin, so let's use that.

12 years agolavf input: allocate AVFrame correctly
Anton Khirnov [Tue, 13 Nov 2012 20:01:24 +0000 (21:01 +0100)]
lavf input: allocate AVFrame correctly
Allocate AVFrames correctly with avcodec_alloc_frame().
This caused crashes with newer libavcodecs that try to free frame extradata.

12 years agoFix crash when using libx264.dll compiled with ICL for X86_64
Anton Mitrofanov [Sat, 10 Nov 2012 23:44:02 +0000 (03:44 +0400)]
Fix crash when using libx264.dll compiled with ICL for X86_64