]> granicus.if.org Git - libx264/log
libx264
15 years agoFaster signed golomb coding
Fiona Glaser [Sat, 16 May 2009 03:07:59 +0000 (20:07 -0700)]
Faster signed golomb coding
3% faster CAVLC RDO and bitstream writing.

15 years agoFaster spatial direct MV prediction
Fiona Glaser [Thu, 14 May 2009 11:11:15 +0000 (04:11 -0700)]
Faster spatial direct MV prediction
unroll/tweak col_zero_flag

15 years agoMore CABAC and CAVLC optimizations
Fiona Glaser [Mon, 4 May 2009 11:19:28 +0000 (04:19 -0700)]
More CABAC and CAVLC optimizations
Simplified function calling for block_residual_write_(cabac|cavlc) and improved sigmap coding.
Tried making 0/1-bit specific versions of CABAC asm, but benefit was minimal under GCC 4.3.
Helped a decent bit under 3.4, but you shouldn't be using such old versions anyways.

15 years agoVarious optimizations in frametype lookahead
Fiona Glaser [Thu, 30 Apr 2009 05:54:52 +0000 (22:54 -0700)]
Various optimizations in frametype lookahead

15 years agoSome cosmetics/cleanup
Fiona Glaser [Mon, 27 Apr 2009 05:13:17 +0000 (22:13 -0700)]
Some cosmetics/cleanup
Move some macros to x86util.asm that should have been there to begin with.
Fix a typo that didn't cause any issues.

15 years agofix "incompatible types in initialization" compilation issues with GCC 4.3 (which...
Guillaume Poirier [Tue, 21 Apr 2009 21:18:44 +0000 (21:18 +0000)]
fix "incompatible types in initialization" compilation issues with GCC 4.3 (which is stricter than previous compiler version)

15 years agofix conversions between vectors with differing element types or numbers of subparts...
Guillaume Poirier [Tue, 21 Apr 2009 15:32:21 +0000 (17:32 +0200)]
fix conversions between vectors with differing element types or numbers of subparts errors

15 years agoAdd "coded blocks" stat to output information.
Fiona Glaser [Sat, 18 Apr 2009 23:07:53 +0000 (16:07 -0700)]
Add "coded blocks" stat to output information.
This measures the total percentage of blocks, intra and inter, which have nonzero coefficients.
"y,uvAC,uvDC" refers to luma, chroma DC, and chroma AC blocks.
Note that skip blocks are included in this stat.

15 years agoEnable asm predict_8x8_filter
Fiona Glaser [Sat, 18 Apr 2009 06:38:29 +0000 (23:38 -0700)]
Enable asm predict_8x8_filter
I'm not entirely sure how this snuck its way out of holger's intra pred patch.

15 years agoRemove various bits of dead code found by CLANG.
Fiona Glaser [Fri, 17 Apr 2009 13:00:39 +0000 (06:00 -0700)]
Remove various bits of dead code found by CLANG.

15 years agoSlightly faster SSE4 SA8D, SSE4 Hadamard_AC, SSE2 SSIM
Fiona Glaser [Tue, 14 Apr 2009 21:47:02 +0000 (14:47 -0700)]
Slightly faster SSE4 SA8D, SSE4 Hadamard_AC, SSE2 SSIM
shufps is the most underrated SSE instruction on x86.

15 years agoVarious CABAC optimizations
Fiona Glaser [Thu, 9 Apr 2009 09:14:41 +0000 (02:14 -0700)]
Various CABAC optimizations
Move calculation of b_intra out of the core residual loop and hardcode it where applicable.
Inlining cabac_mb_mvd was unnecessary and wasted tremendous amounts of code size.  Inlining only cache_mvd is faster and significantly smaller.

15 years agoCAVLC optimizations
Fiona Glaser [Wed, 8 Apr 2009 12:45:03 +0000 (05:45 -0700)]
CAVLC optimizations
faster bs_write_te, port CABAC context selection optimization to CAVLC.

15 years agoFaster CABAC RDO
Fiona Glaser [Sun, 5 Apr 2009 20:01:42 +0000 (13:01 -0700)]
Faster CABAC RDO
Since the bypass case is quite unlikely, especially when doing merged sigmap/level coding,
it's faster to use a branch than a cmov.

15 years agoActivate intra_sad_x3_8x8c in lookahead
Fiona Glaser [Tue, 31 Mar 2009 17:36:57 +0000 (10:36 -0700)]
Activate intra_sad_x3_8x8c in lookahead

15 years agoMBAFF interlaced coding is not allowed in baseline profile
Fiona Glaser [Tue, 31 Mar 2009 17:34:35 +0000 (10:34 -0700)]
MBAFF interlaced coding is not allowed in baseline profile

15 years agointra_sad_x3_8x8 assembly
Fiona Glaser [Tue, 31 Mar 2009 02:30:59 +0000 (19:30 -0700)]
intra_sad_x3_8x8 assembly

15 years agointra_sad_x3_4x4 assembly
Fiona Glaser [Mon, 30 Mar 2009 23:37:46 +0000 (16:37 -0700)]
intra_sad_x3_4x4 assembly

15 years agointra_sad_x3_8x8c assembly
Fiona Glaser [Mon, 30 Mar 2009 11:07:50 +0000 (04:07 -0700)]
intra_sad_x3_8x8c assembly
Also fix intra_sad_x3_16x16's use of "n" as a loop variable (broke SWAP)

15 years agoShave one instruction off CABAC encode_decision
Fiona Glaser [Mon, 30 Mar 2009 01:27:32 +0000 (18:27 -0700)]
Shave one instruction off CABAC encode_decision
range_lps>>6 ranges from 4-7, so (range_lps>>6)-4 == (range_lps>>6) & 3

15 years agoFaster probe_skip
Fiona Glaser [Fri, 27 Mar 2009 05:22:23 +0000 (22:22 -0700)]
Faster probe_skip
Add a second chroma threshold after the DC transform.

15 years agoAdd missing "static" qualifier to two arrays
Fiona Glaser [Thu, 19 Mar 2009 19:28:21 +0000 (12:28 -0700)]
Add missing "static" qualifier to two arrays
Should slightly improve performance.

15 years agoSSE2 zigzag_interleave
Fiona Glaser [Tue, 17 Mar 2009 18:01:57 +0000 (11:01 -0700)]
SSE2 zigzag_interleave
Replace PHADD with FastShuffle (more accurate naming).
This flag represents asm functions that rely on fast SSE2 shuffle units, and thus are only faster on Phenom, Nehalem, and Penryn CPUs.

15 years agoFaster integral_init
Fiona Glaser [Tue, 10 Mar 2009 06:37:53 +0000 (23:37 -0700)]
Faster integral_init
palignr to avoid unaligned loads is worth it in inith, but not initv.

15 years agoFaster SSSE3 hpel_filter_v
Holger Lubitz [Mon, 9 Mar 2009 21:05:16 +0000 (14:05 -0700)]
Faster SSSE3 hpel_filter_v
~10% faster hpel_filter on 64-bit Penryn.
32-bit version by Fiona Glaser.

15 years agoFaster SSE2 pixel_var
Fiona Glaser [Sun, 8 Mar 2009 00:43:09 +0000 (16:43 -0800)]
Faster SSE2 pixel_var
Optimized using the DEINTB method from r1122.  ~32% faster var_16x16 on Conroe.

15 years agoSSSE3 hpel_filter_v
Fiona Glaser [Sat, 7 Mar 2009 08:27:27 +0000 (00:27 -0800)]
SSSE3 hpel_filter_v
Optimized using the same method as in r1122.  Patch partially by Holger.
~8% faster hpel filter on 64-bit Nehalem

15 years agoUpdate some asm copyright headers
Fiona Glaser [Sat, 7 Mar 2009 02:57:15 +0000 (18:57 -0800)]
Update some asm copyright headers

15 years agoVastly faster SATD/SA8D/Hadamard_AC/SSD/DCT/IDCT
Holger Lubitz [Sat, 7 Mar 2009 02:16:30 +0000 (18:16 -0800)]
Vastly faster SATD/SA8D/Hadamard_AC/SSD/DCT/IDCT
Heavily optimized for Core 2 and Nehalem, but performance should improve on all modern x86 CPUs.
16x16 SATD: +18% speed on K8(64bit), +22% on K10(32bit), +42% on Penryn(64bit), +44% on Nehalem(64bit), +50% on P4(32bit), +98% on Conroe(64bit)
Similar performance boosts in SATD-like functions (SA8D, hadamard_ac) and somewhat less in DCT/IDCT/SSD.
Overall performance boost is up to ~15% on 64-bit Conroe.

15 years agoUpdate x264 copyright date
Fiona Glaser [Fri, 6 Mar 2009 23:28:47 +0000 (15:28 -0800)]
Update x264 copyright date

15 years agoRemove pre-scenecut from fprofile commands as well
Fiona Glaser [Wed, 4 Mar 2009 11:16:06 +0000 (03:16 -0800)]
Remove pre-scenecut from fprofile commands as well
Also add psy-trellis to fprofile

15 years agoSlightly faster 8x16 SAD on Penryn Core 2
Fiona Glaser [Wed, 4 Mar 2009 00:21:52 +0000 (16:21 -0800)]
Slightly faster 8x16 SAD on Penryn Core 2
Same as MMX 8x16 cacheline SAD, but calls SSE2 8x16 SAD in non-cacheline case.
Only Nehalem benefits from sizes smaller than 8x16, and Nehalem doesn't use cacheline functions, so no smaller versions are included.

15 years agoFix scenecut and VBV with videos of width/height <= 32
Fiona Glaser [Fri, 27 Feb 2009 03:50:09 +0000 (19:50 -0800)]
Fix scenecut and VBV with videos of width/height <= 32
Also remove an unused variable

15 years agoRemove non-pre scenecut
Fiona Glaser [Thu, 26 Feb 2009 22:29:50 +0000 (14:29 -0800)]
Remove non-pre scenecut
Add support for no-b-adapt + pre-scenecut (patch by BugMaster)
Pre-scenecut was generally better than regular scenecut in terms of accuracy and regular scenecut didn't work in threaded mode anyways.
Add no-scenecut option (scenecut=0 is now no scenecut; previously it was -1)
Fix an incorrect bias towards P-frames near scenecuts with B-adapt 2.
Simplify pre-scenecut code.

15 years agoAdd AltiVec version of hadamard_ac. 2.4x faster than the C version.
Guillaume Poirier [Tue, 3 Mar 2009 15:44:18 +0000 (07:44 -0800)]
Add AltiVec version of hadamard_ac. 2.4x faster than the C version.
Note this this implementation is pretty naive and should be improved
by implementing what's discussed in this ML thread:
date: Mon, Feb 2, 2009 at 6:58 PM
subject: Re: [x264-devel] [PATCH] AltiVec implementation of hadamard_ac routines

15 years agoFix regression in r1085
Fiona Glaser [Thu, 26 Feb 2009 20:07:56 +0000 (12:07 -0800)]
Fix regression in r1085
Deblocking was very slightly incorrect with partitions=all.
Bug found by BugMaster.

15 years agoOptimize neighbor CBP calculation and fix related regression
Fiona Glaser [Mon, 16 Feb 2009 13:56:12 +0000 (05:56 -0800)]
Optimize neighbor CBP calculation and fix related regression
r1105 introduced array overflow in cbp handling

15 years agoShow FPS when importing a raw YUV file
Tal Aloni [Sat, 14 Feb 2009 00:30:14 +0000 (16:30 -0800)]
Show FPS when importing a raw YUV file

15 years agoWindows 64-bit support
Anton Mitrofanov [Wed, 11 Feb 2009 18:38:56 +0000 (10:38 -0800)]
Windows 64-bit support
A "make distclean" is probably required after updating to this revision.

15 years agoMinor fixes and cosmetics
Fiona Glaser [Wed, 11 Feb 2009 18:35:56 +0000 (10:35 -0800)]
Minor fixes and cosmetics
Suppress a GCC warning, fix a non-problematic array overflow, one REP->REP_RET.

15 years agofix 10l in 75b495f2723fcb77f
Manuel Rommel [Tue, 10 Feb 2009 20:06:47 +0000 (12:06 -0800)]
fix 10l in 75b495f2723fcb77f
Original thread:
date: Mon, Feb 9, 2009 at 9:37 PM
subject: [x264-devel] commit: Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors . (Guillaume Poirier )

15 years agoSpare a vec_perm and a vec_mergeh though using a LUT of permutation vectors.
Guillaume Poirier [Mon, 9 Feb 2009 20:17:33 +0000 (21:17 +0100)]
Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors.

15 years agoPromote chroma planes to 16 byte alignment.
Guillaume Poirier [Mon, 9 Feb 2009 20:12:23 +0000 (21:12 +0100)]
Promote chroma planes to 16 byte alignment.
This will allow simplifying vectors loads that can only load 16-bytes
aligned data (such as AltiVec).

15 years agoFix 10L in intra pred
Fiona Glaser [Mon, 9 Feb 2009 19:30:54 +0000 (11:30 -0800)]
Fix 10L in intra pred
Forgetting a %define resulted in SIGILL on 32-bit systems without SSE (e.g. Athlon XP).

15 years agoAdd decimation in i16x16 blocks
Fiona Glaser [Mon, 9 Feb 2009 07:36:40 +0000 (23:36 -0800)]
Add decimation in i16x16 blocks
Up to +0.04db with CAVLC, generally a lot less with CABAC.

15 years agoMuch faster CABAC residual context selection
Fiona Glaser [Sat, 7 Feb 2009 10:27:16 +0000 (02:27 -0800)]
Much faster CABAC residual context selection
Up to ~17% faster CABAC RDO, ~36% faster intra-only CABAC RDO.
Up to 7% faster overall in extreme cases.

15 years agoFaster coeff_last64 on 32-bit
Fiona Glaser [Sat, 7 Feb 2009 09:57:43 +0000 (01:57 -0800)]
Faster coeff_last64 on 32-bit

15 years agoMore intra pred asm optimizations
Fiona Glaser [Fri, 6 Feb 2009 10:59:36 +0000 (02:59 -0800)]
More intra pred asm optimizations
SSSE3 version of predict_8x8_hu
SSE2 version of predict_8x8c_p
SSSE3 versions of both planar prediction functions
Optimizations to predict_16x16_p_sse2
Some unnecessary REP_RETs -> RETs.
SSE2 version of predict_8x8_vr by Holger.
SSE2 version of predict_8x8_hd.
Don't compile MMX versions of some of the pred functions on x86_64.
Remove now-useless x86_64 C versions of 4x4 pred functions.
Rewrite some of the x86_64-only C functions in asm.

15 years agoSpeed-up mc_chroma_altivec by using vec_mladd cleverly, and unrolling.
Manuel Rommel [Sun, 8 Feb 2009 20:35:51 +0000 (21:35 +0100)]
Speed-up mc_chroma_altivec by using vec_mladd cleverly, and unrolling.
Also put width == 2 variant in its own scalar function because it's faster
than a vectorized one.

15 years agoMerging Holger's GSOC branch part 2: intra prediction
Holger Lubitz [Wed, 4 Feb 2009 20:46:17 +0000 (12:46 -0800)]
Merging Holger's GSOC branch part 2: intra prediction
Assembly versions of most remaining 4x4 and 8x8 intra pred functions.
Assembly version of predict_8x8_filter.
A few other optimizations.
Primarily Core 2-optimized.

15 years ago10l: fix compilation with GCC 4.3+
Guillaume Poirier [Wed, 4 Feb 2009 10:04:55 +0000 (10:04 +0000)]
10l: fix compilation with GCC 4.3+

15 years agoFaster 8x8dct+CAVLC interleave
Fiona Glaser [Sat, 31 Jan 2009 13:00:39 +0000 (05:00 -0800)]
Faster 8x8dct+CAVLC interleave
Integrate array_non_zero with the CAVLC 8x8dct interleave function.
Roughly 1.5-2x faster than the original separate array_non_zero method.

15 years agoMeasure CBP cost in i8x8 RD refinement
Fiona Glaser [Sat, 31 Jan 2009 09:00:26 +0000 (01:00 -0800)]
Measure CBP cost in i8x8 RD refinement
~0.02-0.05db PSNR gain at high quants in intra-only encoding, pretty small otherwise.
Allows a small optimization in i8x8 encoding.

15 years agoTake advantage of saturated signed horizontal sum instructions in
Guillaume Poirier [Sun, 1 Feb 2009 19:58:00 +0000 (20:58 +0100)]
Take advantage of saturated signed horizontal sum instructions in
the variance computation epilogue since there won't be any overflow
triggering an overflow.
Suggested by Loren Merritt

15 years agoMassive overhaul of nnz/cbp calculation
Fiona Glaser [Fri, 30 Jan 2009 11:40:54 +0000 (03:40 -0800)]
Massive overhaul of nnz/cbp calculation
Modify quantization to also calculate array_non_zero.
PPC assembly changes by gpoirior.
New quant asm includes some small tweaks to quant and SSE4 versions using ptest for the array_non_zero.
Use this new feature of quant to merge nnz/cbp calculation directly with encoding and avoid many unnecessary calls to dequant/zigzag/decimate/etc.
Also add new i16x16 DC-only iDCT with asm.
Since intra encoding now directly calculates nnz, skip_intra now backs up nnz/cbp as well.
Output should be equivalent except when using p4x4+RDO because of a subtlety involving old nnz values lying around.
Performance increase in macroblock_encode: ~18% with dct-decimate, 30% without at CRF 25.
Overall performance increase 0-6% depending on encoding settings.

15 years agoAdd PowerPC support for "checkasm --bench", reading the time base register.
Guillaume Poirier [Thu, 29 Jan 2009 09:28:12 +0000 (01:28 -0800)]
Add PowerPC support for "checkasm --bench", reading the time base register.
This isn't ideal since the `time base' register is running at a fraction
of the processor cycle speed, so the measurement isn't as precise as x86's
rdtsc.
It's better than nothing though...

15 years agofix detection of pthread and isfinite on OpenBSD
Brad Smith [Thu, 29 Jan 2009 04:35:34 +0000 (04:35 +0000)]
fix detection of pthread and isfinite on OpenBSD

15 years agoremove $ECHON kludge, which broke on SunOS. bring back `gcc -MT`.
Loren Merritt [Tue, 27 Jan 2009 05:42:51 +0000 (05:42 +0000)]
remove $ECHON kludge, which broke on SunOS. bring back `gcc -MT`.
remove auto-reconfigure on svn update, which has done nothing since we stopped using svn.
fix $AS on sparc (was disabled by mmx check).
fix --extra-asflags (was ignored).
mark bash scripts as bash, not sh

patch partly by Greg Robinson and Jugdish.

15 years ago1.6x faster satd_c (and sa8d and hadamard_ac) with pseudo-simd.
Loren Merritt [Mon, 26 Jan 2009 14:28:48 +0000 (14:28 +0000)]
1.6x faster satd_c (and sa8d and hadamard_ac) with pseudo-simd.
60KB smaller binary.

15 years agoHack around a potential failure point in VBV
Fiona Glaser [Wed, 28 Jan 2009 07:27:56 +0000 (23:27 -0800)]
Hack around a potential failure point in VBV
pred_b_from_p can become absurdly large in static scenes, leading to rare collapses of quality with VBV+B-frames+threads.
This isn't a final fix, but should resolve the problem in most cases in the meantime.

15 years agoMuch faster chroma encoding and other opts
Fiona Glaser [Tue, 27 Jan 2009 07:43:25 +0000 (23:43 -0800)]
Much faster chroma encoding and other opts
~15% faster chroma encode by reorganizing CBP calculation and adding special-case idct_dc function, since most coded chroma blocks are DC-only.
Small optimization in cache_save (skip_bp)
Fix array_non_zero to not violate strict aliasing (should eliminate miscompilation issues in the future)
Add in automatic substitutions for some asm instructions that have an equivalent smaller representation.

15 years agoadd AltiVec implementation of x264_mc_copy_w16_aligned
Guillaume Poirier [Mon, 26 Jan 2009 14:28:23 +0000 (06:28 -0800)]
add AltiVec implementation of x264_mc_copy_w16_aligned

15 years agoadd AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8
Guillaume Poirier [Fri, 23 Jan 2009 21:53:06 +0000 (13:53 -0800)]
add AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8

15 years agoadd AltiVec 16 <-> 32 bits conversions macros
Guillaume Poirier [Fri, 23 Jan 2009 09:11:20 +0000 (01:11 -0800)]
add AltiVec 16 <-> 32 bits conversions macros

16 years agoReplace 16x16=>32 mul + pack + add by a simple 16x16=>16 multiply-add.
Guillaume Poirier [Mon, 19 Jan 2009 20:29:27 +0000 (21:29 +0100)]
Replace 16x16=>32 mul + pack + add by a simple 16x16=>16 multiply-add.
Suggested by Loren.

16 years agoEliminate support for direct_8x8_inference=0
Fiona Glaser [Mon, 19 Jan 2009 23:17:53 +0000 (15:17 -0800)]
Eliminate support for direct_8x8_inference=0
The benefit in the most extreme contrived situation was at most 0.001db PSNR, at the cost of slower decoding.
As this option was basically useless, it was a waste of code and prevented some other useful optimizations.
Remove some unused mc code related to sub-8x8 partitions.
Small deblocking speedup when p4x4 is used.
Also remove unused x264_nal_decode prototype from x264.h.

16 years agoAdd AltiVec and CPU numbers detection on OpenBSD.
Brad Smith [Mon, 19 Jan 2009 13:14:53 +0000 (05:14 -0800)]
Add AltiVec and CPU numbers detection on OpenBSD.

16 years agoAdd AltiVec implementation of predict_8x8c_p. 2.6x faster than scalar C.
Guillaume Poirier [Sun, 18 Jan 2009 21:44:14 +0000 (22:44 +0100)]
Add AltiVec implementation of predict_8x8c_p. 2.6x faster than scalar C.

16 years agoWarn if direct auto wasn't set on the first pass
Fiona Glaser [Sat, 17 Jan 2009 20:16:37 +0000 (15:16 -0500)]
Warn if direct auto wasn't set on the first pass
And, if it wasn't, run direct auto as if it was the first pass, rather than simply forcing temporal direct mode on all frames.
Also a small tweak to coeff_level_run asm.

16 years agoChanges the PowerPC ppccommon.h header so it no longer checks for a particular
Brad Smith [Sat, 17 Jan 2009 12:52:28 +0000 (12:52 +0000)]
Changes the PowerPC ppccommon.h header so it no longer checks for a particular
OS such as Linux but instead looks for HAVE_ALTIVEC_H being set.
Fixes all *BSD/PowerPC builds.

16 years agoupdate x264_hpel_filter_altivec's prototype to match the one of the C version.
Guillaume Poirier [Wed, 14 Jan 2009 20:56:31 +0000 (21:56 +0100)]
update x264_hpel_filter_altivec's prototype to match the one of the C version.
It changed in commit 045ae4045a1827555b3eaab4fbf3c9809e98c58f (factorization of mallocs)
(NB: Altivec implementation wasn't allocating and writing to any scratch memory.)

16 years agorename vector+array unions to closer match the vector typedefs names.
Guillaume Poirier [Wed, 14 Jan 2009 20:49:42 +0000 (21:49 +0100)]
rename vector+array unions to closer match the vector typedefs names.

16 years agoAdd Altivec implementation of all the remaining 16x16 predict routines.
Guillaume Poirier [Wed, 14 Jan 2009 20:13:58 +0000 (21:13 +0100)]
Add Altivec implementation of all the remaining 16x16 predict routines.

16 years agoCache ref costs and use more accurate MV costs
Fiona Glaser [Wed, 14 Jan 2009 02:11:50 +0000 (21:11 -0500)]
Cache ref costs and use more accurate MV costs
New MV costs should improve quality slightly by improving the smoothness of the field of MV costs (and they're closer to CABAC's actual costs).
Despite being optimized for CABAC, they still help under CAVLC, albeit less.
MV cost change by Loren Merritt

16 years agoSupport forced frametypes with scenecut/b-adapt
Fiona Glaser [Wed, 14 Jan 2009 01:22:36 +0000 (20:22 -0500)]
Support forced frametypes with scenecut/b-adapt
This allows an input qpfile to be used to force I-frames, for example.
The same can be done through the library interface.
Document the format of the qpfile in --longhelp and the forcing of frametypes in x264.h
Note that forcing B-frames and B-refs may not always have the intended result.
Patch partially by Steven Walters <kemuri9@gmail.com>.

16 years agoRemove an IDIV from i8x8 analysis
Fiona Glaser [Wed, 14 Jan 2009 00:58:44 +0000 (19:58 -0500)]
Remove an IDIV from i8x8 analysis
Only one IDIV is left in macroblock level code (transform_rd)

16 years agoFix regression in r1066
Fiona Glaser [Thu, 8 Jan 2009 20:07:16 +0000 (15:07 -0500)]
Fix regression in r1066
With some combinations of video width and other settings, the scratch buffer was slightly too small.
This caused heap corruption on some systems.
Also prevent merange from being raised during encoding with esa/tesa through encoder_reconfig, as this no longer works.

16 years agoDisable B-frames in lossless mode
Fiona Glaser [Tue, 6 Jan 2009 21:55:44 +0000 (16:55 -0500)]
Disable B-frames in lossless mode
They hurt compression anyways, and direct auto was bugged with lossless.

16 years agoFactorize in ppccommon.h the conditional inclusion of altivec.h on Linux systems.
Brad Smith [Mon, 5 Jan 2009 22:53:11 +0000 (22:53 +0000)]
Factorize in ppccommon.h the conditional inclusion of altivec.h on Linux systems.

16 years agoDisable __builtin_clz() intrinsic on gcc versions prior to 3.4.
Brad Smith [Mon, 5 Jan 2009 20:58:32 +0000 (15:58 -0500)]
Disable __builtin_clz() intrinsic on gcc versions prior to 3.4.
The function did not exist before that version.

16 years agoSmall tweaks to coeff asm
Fiona Glaser [Fri, 2 Jan 2009 02:44:00 +0000 (21:44 -0500)]
Small tweaks to coeff asm
Factor out a few redundant pxors
Related cosmetics

16 years agoUse the correct strtok under MSVC
Steven Walters [Wed, 31 Dec 2008 03:20:37 +0000 (22:20 -0500)]
Use the correct strtok under MSVC
Also change one malloc -> x264_malloc

16 years agoAdd stack alignment for lookahead functions
Fiona Glaser [Wed, 31 Dec 2008 03:14:45 +0000 (22:14 -0500)]
Add stack alignment for lookahead functions
Should allow libx264 to be called from non-gcc-compiled applications without adding force_align_arg_pointer.

16 years agoAdd support for SSE4a (Phenom) LZCNT instruction
Fiona Glaser [Wed, 31 Dec 2008 01:47:45 +0000 (20:47 -0500)]
Add support for SSE4a (Phenom) LZCNT instruction
Significantly speeds up coeff_last and coeff_level_run on Phenom CPUs for faster CAVLC and CABAC.
Also a small tweak to coeff_level_run asm.

16 years agofactor mallocs out of hpel, ssim, and esa.
Steven Walters [Mon, 29 Dec 2008 05:14:26 +0000 (05:14 +0000)]
factor mallocs out of hpel, ssim, and esa.
there should now be no memory allocation outside of init-time.

16 years agoMuch faster CAVLC RDO and bitstream writing
Fiona Glaser [Tue, 30 Dec 2008 03:12:17 +0000 (03:12 +0000)]
Much faster CAVLC RDO and bitstream writing
Pure asm version of level/run coding.  Over 2x faster than C.
Up to 40% faster CAVLC RDO.  Overall benefit up to ~7.5% with RDO or ~5% with fast encoding settings.

16 years agoCosmetics: cleaner syntax for defining temporary registers in asm
Loren Merritt [Tue, 30 Dec 2008 02:52:25 +0000 (21:52 -0500)]
Cosmetics: cleaner syntax for defining temporary registers in asm
Globally define t#[qdwb], so that only t# needs to be locally defined when reorganizing registers

16 years agoMuch faster CABAC RDO
Fiona Glaser [Sun, 28 Dec 2008 02:36:14 +0000 (21:36 -0500)]
Much faster CABAC RDO
Since RDO doesn't care about what order bit costs are calculated, merge sigmap and level coding into the same loop in RDO.
This is bit-exact for 4x4dct but slightly incorrect for 8x8dct due to the sigmap containing duplicated contexts.
However, the PSNR penalty of this is extremely small (~0.001db).
Speed benefit is about 15% in 4x4dct and 30% in 8x8dct residual bit cost calculation at QP20.
Overall encoding speed benefit is up to 5%, depending on encoding settings.
Also remove an old unnecessary CABAC table that hasn't been used for years.

16 years agoVLC table optimizations
Fiona Glaser [Fri, 26 Dec 2008 12:35:49 +0000 (07:35 -0500)]
VLC table optimizations
Slightly reorganize VLC tables for ~2% faster block_residual_write_cavlc.
Also a small optimization in p8x8 CAVLC.

16 years agoFix crash in --me esa/tesa introduced in r1058
Loren Merritt [Thu, 25 Dec 2008 03:58:17 +0000 (22:58 -0500)]
Fix crash in --me esa/tesa introduced in r1058
Also suppress the last mingw warning message

16 years agoOptimize variance asm + minor changes
Fiona Glaser [Wed, 24 Dec 2008 03:33:28 +0000 (22:33 -0500)]
Optimize variance asm + minor changes
Remove SAD argument from var, not needed anymore.
Speed up var asm a bit by eliminating psadbw and instead HADDWing at end.
Eliminate all remaining warnings on gcc 3.4 on cygwin
Port another minor optimization from lavc (pskip)

16 years agoMinor CABAC cleanups and related optimizations
Fiona Glaser [Tue, 23 Dec 2008 23:31:48 +0000 (18:31 -0500)]
Minor CABAC cleanups and related optimizations
Merge the two list tables to allow cleaner MC/CABAC/CAVLC code
Remove lots of unnecessary {s
Port some very minor opts from lavc

16 years agofaster ESA init
Loren Merritt [Thu, 11 Dec 2008 19:47:17 +0000 (19:47 +0000)]
faster ESA init
reduce memory if using ESA and not p4x4

16 years agoMore macroblock_cache optimizations
Fiona Glaser [Tue, 16 Dec 2008 07:02:49 +0000 (23:02 -0800)]
More macroblock_cache optimizations
Patch partially by Loren Merritt

16 years agoFaster macroblock_cache_rect
Fiona Glaser [Mon, 15 Dec 2008 21:15:29 +0000 (13:15 -0800)]
Faster macroblock_cache_rect
Explicit loop unrolling

16 years agoOptimizations in predict_mv_direct
Fiona Glaser [Mon, 15 Dec 2008 02:30:51 +0000 (18:30 -0800)]
Optimizations in predict_mv_direct
Add some early terminations and minor optimizations
This change may also fix the extremely rare direct+threading MV bug.

16 years agoFix visual corruption when picture width was not mod 32.
David Wolstencroft [Sun, 14 Dec 2008 10:47:28 +0000 (10:47 +0000)]
Fix visual corruption when picture width was not mod 32.
The previous Altivec implemention of mc_chroma assumed that i_src_stride was always mod 16.

16 years agoAdd support for FSF GCC version >= 4.3 on OSX.
Guillaume Poirier [Mon, 8 Dec 2008 20:11:45 +0000 (21:11 +0100)]
Add support for FSF GCC version >= 4.3 on OSX.
So far, only Apple GCC version was supported.

16 years agoMore accurate refcost for p8x8 CAVLC
Fiona Glaser [Fri, 12 Dec 2008 01:31:52 +0000 (17:31 -0800)]
More accurate refcost for p8x8 CAVLC
Slightly better quality, especially in non-RD mode, with CAVLC.

16 years agouse lookup tables instead of actual exp/pow for AQ
Loren Merritt [Thu, 11 Dec 2008 04:54:17 +0000 (20:54 -0800)]
use lookup tables instead of actual exp/pow for AQ
Significant speed boost, especially on CPUs with atrociously slow floating point units (e.g. Pentium 4 saves 800 clocks per MB with this change).
Add x264_clz function as part of the LUT system: this may be useful later.
Note this changes output somewhat as the numbers from the lookup table are not exact.