granicus.if.org Git - libx264/log

]> granicus.if.org Git - libx264/log

Manuel Rommel [Tue, 10 Feb 2009 20:06:47 +0000 (12:06 -0800)]

fix 10l in 75b495f2723fcb77f
Original thread:
date: Mon, Feb 9, 2009 at 9:37 PM
subject: [x264-devel] commit: Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors . (Guillaume Poirier )

commit | commitdiff | tree

Guillaume Poirier [Mon, 9 Feb 2009 20:17:33 +0000 (21:17 +0100)]

Spare a vec_perm and a vec_mergeh though using a LUT of permutation vectors.

commit | commitdiff | tree

Guillaume Poirier [Mon, 9 Feb 2009 20:12:23 +0000 (21:12 +0100)]

Promote chroma planes to 16 byte alignment.
This will allow simplifying vectors loads that can only load 16-bytes
aligned data (such as AltiVec).

commit | commitdiff | tree

Fiona Glaser [Mon, 9 Feb 2009 19:30:54 +0000 (11:30 -0800)]

Fix 10L in intra pred
Forgetting a %define resulted in SIGILL on 32-bit systems without SSE (e.g. Athlon XP).

commit | commitdiff | tree

Fiona Glaser [Mon, 9 Feb 2009 07:36:40 +0000 (23:36 -0800)]

Add decimation in i16x16 blocks
Up to +0.04db with CAVLC, generally a lot less with CABAC.

commit | commitdiff | tree

Fiona Glaser [Sat, 7 Feb 2009 10:27:16 +0000 (02:27 -0800)]

Much faster CABAC residual context selection
Up to ~17% faster CABAC RDO, ~36% faster intra-only CABAC RDO.
Up to 7% faster overall in extreme cases.

commit | commitdiff | tree

Fiona Glaser [Sat, 7 Feb 2009 09:57:43 +0000 (01:57 -0800)]

Faster coeff_last64 on 32-bit

commit | commitdiff | tree

Fiona Glaser [Fri, 6 Feb 2009 10:59:36 +0000 (02:59 -0800)]

More intra pred asm optimizations
SSSE3 version of predict_8x8_hu
SSE2 version of predict_8x8c_p
SSSE3 versions of both planar prediction functions
Optimizations to predict_16x16_p_sse2
Some unnecessary REP_RETs -> RETs.
SSE2 version of predict_8x8_vr by Holger.
SSE2 version of predict_8x8_hd.
Don't compile MMX versions of some of the pred functions on x86_64.
Remove now-useless x86_64 C versions of 4x4 pred functions.
Rewrite some of the x86_64-only C functions in asm.

commit | commitdiff | tree

Manuel Rommel [Sun, 8 Feb 2009 20:35:51 +0000 (21:35 +0100)]

Speed-up mc_chroma_altivec by using vec_mladd cleverly, and unrolling.
Also put width == 2 variant in its own scalar function because it's faster
than a vectorized one.

commit | commitdiff | tree

Holger Lubitz [Wed, 4 Feb 2009 20:46:17 +0000 (12:46 -0800)]

Merging Holger's GSOC branch part 2: intra prediction
Assembly versions of most remaining 4x4 and 8x8 intra pred functions.
Assembly version of predict_8x8_filter.
A few other optimizations.
Primarily Core 2-optimized.

commit | commitdiff | tree

Guillaume Poirier [Wed, 4 Feb 2009 10:04:55 +0000 (10:04 +0000)]

10l: fix compilation with GCC 4.3+

commit | commitdiff | tree

Fiona Glaser [Sat, 31 Jan 2009 13:00:39 +0000 (05:00 -0800)]

Faster 8x8dct+CAVLC interleave
Integrate array_non_zero with the CAVLC 8x8dct interleave function.
Roughly 1.5-2x faster than the original separate array_non_zero method.

commit | commitdiff | tree

Fiona Glaser [Sat, 31 Jan 2009 09:00:26 +0000 (01:00 -0800)]

Measure CBP cost in i8x8 RD refinement
~0.02-0.05db PSNR gain at high quants in intra-only encoding, pretty small otherwise.
Allows a small optimization in i8x8 encoding.

commit | commitdiff | tree

Guillaume Poirier [Sun, 1 Feb 2009 19:58:00 +0000 (20:58 +0100)]

Take advantage of saturated signed horizontal sum instructions in
the variance computation epilogue since there won't be any overflow
triggering an overflow.
Suggested by Loren Merritt

commit | commitdiff | tree

Fiona Glaser [Fri, 30 Jan 2009 11:40:54 +0000 (03:40 -0800)]

Massive overhaul of nnz/cbp calculation
Modify quantization to also calculate array_non_zero.
PPC assembly changes by gpoirior.
New quant asm includes some small tweaks to quant and SSE4 versions using ptest for the array_non_zero.
Use this new feature of quant to merge nnz/cbp calculation directly with encoding and avoid many unnecessary calls to dequant/zigzag/decimate/etc.
Also add new i16x16 DC-only iDCT with asm.
Since intra encoding now directly calculates nnz, skip_intra now backs up nnz/cbp as well.
Output should be equivalent except when using p4x4+RDO because of a subtlety involving old nnz values lying around.
Performance increase in macroblock_encode: ~18% with dct-decimate, 30% without at CRF 25.
Overall performance increase 0-6% depending on encoding settings.

commit | commitdiff | tree

Guillaume Poirier [Thu, 29 Jan 2009 09:28:12 +0000 (01:28 -0800)]

Add PowerPC support for "checkasm --bench", reading the time base register.
This isn't ideal since the `time base' register is running at a fraction
of the processor cycle speed, so the measurement isn't as precise as x86's
rdtsc.
It's better than nothing though...

commit | commitdiff | tree

Brad Smith [Thu, 29 Jan 2009 04:35:34 +0000 (04:35 +0000)]

fix detection of pthread and isfinite on OpenBSD

commit | commitdiff | tree

Loren Merritt [Tue, 27 Jan 2009 05:42:51 +0000 (05:42 +0000)]

remove $ECHON kludge, which broke on SunOS. bring back `gcc -MT`.
remove auto-reconfigure on svn update, which has done nothing since we stopped using svn.
fix $AS on sparc (was disabled by mmx check).
fix --extra-asflags (was ignored).
mark bash scripts as bash, not sh

patch partly by Greg Robinson and Jugdish.

commit | commitdiff | tree

Loren Merritt [Mon, 26 Jan 2009 14:28:48 +0000 (14:28 +0000)]

1.6x faster satd_c (and sa8d and hadamard_ac) with pseudo-simd.
60KB smaller binary.

commit | commitdiff | tree

Fiona Glaser [Wed, 28 Jan 2009 07:27:56 +0000 (23:27 -0800)]

Hack around a potential failure point in VBV
pred_b_from_p can become absurdly large in static scenes, leading to rare collapses of quality with VBV+B-frames+threads.
This isn't a final fix, but should resolve the problem in most cases in the meantime.

commit | commitdiff | tree

Fiona Glaser [Tue, 27 Jan 2009 07:43:25 +0000 (23:43 -0800)]

Much faster chroma encoding and other opts
~15% faster chroma encode by reorganizing CBP calculation and adding special-case idct_dc function, since most coded chroma blocks are DC-only.
Small optimization in cache_save (skip_bp)
Fix array_non_zero to not violate strict aliasing (should eliminate miscompilation issues in the future)
Add in automatic substitutions for some asm instructions that have an equivalent smaller representation.

commit | commitdiff | tree

Guillaume Poirier [Mon, 26 Jan 2009 14:28:23 +0000 (06:28 -0800)]

add AltiVec implementation of x264_mc_copy_w16_aligned

commit | commitdiff | tree

Guillaume Poirier [Fri, 23 Jan 2009 21:53:06 +0000 (13:53 -0800)]

add AltiVec implementation of x264_pixel_var_16x16 and x264_pixel_var_8x8

commit | commitdiff | tree

Guillaume Poirier [Fri, 23 Jan 2009 09:11:20 +0000 (01:11 -0800)]

add AltiVec 16 <-> 32 bits conversions macros

commit | commitdiff | tree

Guillaume Poirier [Mon, 19 Jan 2009 20:29:27 +0000 (21:29 +0100)]

Replace 16x16=>32 mul + pack + add by a simple 16x16=>16 multiply-add.
Suggested by Loren.

commit | commitdiff | tree

Fiona Glaser [Mon, 19 Jan 2009 23:17:53 +0000 (15:17 -0800)]

Eliminate support for direct_8x8_inference=0
The benefit in the most extreme contrived situation was at most 0.001db PSNR, at the cost of slower decoding.
As this option was basically useless, it was a waste of code and prevented some other useful optimizations.
Remove some unused mc code related to sub-8x8 partitions.
Small deblocking speedup when p4x4 is used.
Also remove unused x264_nal_decode prototype from x264.h.

commit | commitdiff | tree

Brad Smith [Mon, 19 Jan 2009 13:14:53 +0000 (05:14 -0800)]

Add AltiVec and CPU numbers detection on OpenBSD.

commit | commitdiff | tree

Guillaume Poirier [Sun, 18 Jan 2009 21:44:14 +0000 (22:44 +0100)]

Add AltiVec implementation of predict_8x8c_p. 2.6x faster than scalar C.

commit | commitdiff | tree

Fiona Glaser [Sat, 17 Jan 2009 20:16:37 +0000 (15:16 -0500)]

Warn if direct auto wasn't set on the first pass
And, if it wasn't, run direct auto as if it was the first pass, rather than simply forcing temporal direct mode on all frames.
Also a small tweak to coeff_level_run asm.

commit | commitdiff | tree

Brad Smith [Sat, 17 Jan 2009 12:52:28 +0000 (12:52 +0000)]

Changes the PowerPC ppccommon.h header so it no longer checks for a particular
OS such as Linux but instead looks for HAVE_ALTIVEC_H being set.
Fixes all *BSD/PowerPC builds.

commit | commitdiff | tree

Guillaume Poirier [Wed, 14 Jan 2009 20:56:31 +0000 (21:56 +0100)]

update x264_hpel_filter_altivec's prototype to match the one of the C version.
It changed in commit 045ae4045a1827555b3eaab4fbf3c9809e98c58f (factorization of mallocs)
(NB: Altivec implementation wasn't allocating and writing to any scratch memory.)

commit | commitdiff | tree

Guillaume Poirier [Wed, 14 Jan 2009 20:49:42 +0000 (21:49 +0100)]

rename vector+array unions to closer match the vector typedefs names.

commit | commitdiff | tree

Guillaume Poirier [Wed, 14 Jan 2009 20:13:58 +0000 (21:13 +0100)]

Add Altivec implementation of all the remaining 16x16 predict routines.

commit | commitdiff | tree

Fiona Glaser [Wed, 14 Jan 2009 02:11:50 +0000 (21:11 -0500)]

Cache ref costs and use more accurate MV costs
New MV costs should improve quality slightly by improving the smoothness of the field of MV costs (and they're closer to CABAC's actual costs).
Despite being optimized for CABAC, they still help under CAVLC, albeit less.
MV cost change by Loren Merritt

commit | commitdiff | tree

Fiona Glaser [Wed, 14 Jan 2009 01:22:36 +0000 (20:22 -0500)]

Support forced frametypes with scenecut/b-adapt
This allows an input qpfile to be used to force I-frames, for example.
The same can be done through the library interface.
Document the format of the qpfile in --longhelp and the forcing of frametypes in x264.h
Note that forcing B-frames and B-refs may not always have the intended result.
Patch partially by Steven Walters <kemuri9@gmail.com>.

commit | commitdiff | tree

Fiona Glaser [Wed, 14 Jan 2009 00:58:44 +0000 (19:58 -0500)]

Remove an IDIV from i8x8 analysis
Only one IDIV is left in macroblock level code (transform_rd)

commit | commitdiff | tree

Fiona Glaser [Thu, 8 Jan 2009 20:07:16 +0000 (15:07 -0500)]

Fix regression in r1066
With some combinations of video width and other settings, the scratch buffer was slightly too small.
This caused heap corruption on some systems.
Also prevent merange from being raised during encoding with esa/tesa through encoder_reconfig, as this no longer works.

commit | commitdiff | tree

Fiona Glaser [Tue, 6 Jan 2009 21:55:44 +0000 (16:55 -0500)]

Disable B-frames in lossless mode
They hurt compression anyways, and direct auto was bugged with lossless.

commit | commitdiff | tree

Brad Smith [Mon, 5 Jan 2009 22:53:11 +0000 (22:53 +0000)]

Factorize in ppccommon.h the conditional inclusion of altivec.h on Linux systems.

commit | commitdiff | tree

Brad Smith [Mon, 5 Jan 2009 20:58:32 +0000 (15:58 -0500)]

Disable __builtin_clz() intrinsic on gcc versions prior to 3.4.
The function did not exist before that version.

commit | commitdiff | tree

Fiona Glaser [Fri, 2 Jan 2009 02:44:00 +0000 (21:44 -0500)]

Small tweaks to coeff asm
Factor out a few redundant pxors
Related cosmetics

commit | commitdiff | tree

Steven Walters [Wed, 31 Dec 2008 03:20:37 +0000 (22:20 -0500)]

Use the correct strtok under MSVC
Also change one malloc -> x264_malloc

commit | commitdiff | tree

Fiona Glaser [Wed, 31 Dec 2008 03:14:45 +0000 (22:14 -0500)]

Add stack alignment for lookahead functions
Should allow libx264 to be called from non-gcc-compiled applications without adding force_align_arg_pointer.

commit | commitdiff | tree

Fiona Glaser [Wed, 31 Dec 2008 01:47:45 +0000 (20:47 -0500)]

Add support for SSE4a (Phenom) LZCNT instruction
Significantly speeds up coeff_last and coeff_level_run on Phenom CPUs for faster CAVLC and CABAC.
Also a small tweak to coeff_level_run asm.

commit | commitdiff | tree

Steven Walters [Mon, 29 Dec 2008 05:14:26 +0000 (05:14 +0000)]

factor mallocs out of hpel, ssim, and esa.
there should now be no memory allocation outside of init-time.

commit | commitdiff | tree

Fiona Glaser [Tue, 30 Dec 2008 03:12:17 +0000 (03:12 +0000)]

Much faster CAVLC RDO and bitstream writing
Pure asm version of level/run coding. Over 2x faster than C.
Up to 40% faster CAVLC RDO. Overall benefit up to ~7.5% with RDO or ~5% with fast encoding settings.

commit | commitdiff | tree

Loren Merritt [Tue, 30 Dec 2008 02:52:25 +0000 (21:52 -0500)]

Cosmetics: cleaner syntax for defining temporary registers in asm
Globally define t#[qdwb], so that only t# needs to be locally defined when reorganizing registers

commit | commitdiff | tree

Fiona Glaser [Sun, 28 Dec 2008 02:36:14 +0000 (21:36 -0500)]

Much faster CABAC RDO
Since RDO doesn't care about what order bit costs are calculated, merge sigmap and level coding into the same loop in RDO.
This is bit-exact for 4x4dct but slightly incorrect for 8x8dct due to the sigmap containing duplicated contexts.
However, the PSNR penalty of this is extremely small (~0.001db).
Speed benefit is about 15% in 4x4dct and 30% in 8x8dct residual bit cost calculation at QP20.
Overall encoding speed benefit is up to 5%, depending on encoding settings.
Also remove an old unnecessary CABAC table that hasn't been used for years.

commit | commitdiff | tree

Fiona Glaser [Fri, 26 Dec 2008 12:35:49 +0000 (07:35 -0500)]

VLC table optimizations
Slightly reorganize VLC tables for ~2% faster block_residual_write_cavlc.
Also a small optimization in p8x8 CAVLC.

commit | commitdiff | tree

Loren Merritt [Thu, 25 Dec 2008 03:58:17 +0000 (22:58 -0500)]

Fix crash in --me esa/tesa introduced in r1058
Also suppress the last mingw warning message

commit | commitdiff | tree

Fiona Glaser [Wed, 24 Dec 2008 03:33:28 +0000 (22:33 -0500)]

Optimize variance asm + minor changes
Remove SAD argument from var, not needed anymore.
Speed up var asm a bit by eliminating psadbw and instead HADDWing at end.
Eliminate all remaining warnings on gcc 3.4 on cygwin
Port another minor optimization from lavc (pskip)

commit | commitdiff | tree

Fiona Glaser [Tue, 23 Dec 2008 23:31:48 +0000 (18:31 -0500)]

Minor CABAC cleanups and related optimizations
Merge the two list tables to allow cleaner MC/CABAC/CAVLC code
Remove lots of unnecessary {s
Port some very minor opts from lavc

commit | commitdiff | tree

Loren Merritt [Thu, 11 Dec 2008 19:47:17 +0000 (19:47 +0000)]

faster ESA init
reduce memory if using ESA and not p4x4

commit | commitdiff | tree

Fiona Glaser [Tue, 16 Dec 2008 07:02:49 +0000 (23:02 -0800)]

More macroblock_cache optimizations
Patch partially by Loren Merritt

commit | commitdiff | tree

Fiona Glaser [Mon, 15 Dec 2008 21:15:29 +0000 (13:15 -0800)]

Faster macroblock_cache_rect
Explicit loop unrolling

commit | commitdiff | tree

Fiona Glaser [Mon, 15 Dec 2008 02:30:51 +0000 (18:30 -0800)]

Optimizations in predict_mv_direct
Add some early terminations and minor optimizations
This change may also fix the extremely rare direct+threading MV bug.

commit | commitdiff | tree

David Wolstencroft [Sun, 14 Dec 2008 10:47:28 +0000 (10:47 +0000)]

Fix visual corruption when picture width was not mod 32.
The previous Altivec implemention of mc_chroma assumed that i_src_stride was always mod 16.

commit | commitdiff | tree

Guillaume Poirier [Mon, 8 Dec 2008 20:11:45 +0000 (21:11 +0100)]

Add support for FSF GCC version >= 4.3 on OSX.
So far, only Apple GCC version was supported.

commit | commitdiff | tree

Fiona Glaser [Fri, 12 Dec 2008 01:31:52 +0000 (17:31 -0800)]

More accurate refcost for p8x8 CAVLC
Slightly better quality, especially in non-RD mode, with CAVLC.

commit | commitdiff | tree

Loren Merritt [Thu, 11 Dec 2008 04:54:17 +0000 (20:54 -0800)]

use lookup tables instead of actual exp/pow for AQ
Significant speed boost, especially on CPUs with atrociously slow floating point units (e.g. Pentium 4 saves 800 clocks per MB with this change).
Add x264_clz function as part of the LUT system: this may be useful later.
Note this changes output somewhat as the numbers from the lookup table are not exact.

commit | commitdiff | tree

Fiona Glaser [Thu, 11 Dec 2008 04:53:13 +0000 (20:53 -0800)]

Suppress saveptr warnings on Windows GCC

commit | commitdiff | tree

Fiona Glaser [Thu, 11 Dec 2008 04:52:06 +0000 (20:52 -0800)]

More small speed tweaks to macroblock.c

commit | commitdiff | tree

Fiona Glaser [Mon, 8 Dec 2008 21:44:23 +0000 (13:44 -0800)]

Much faster CAVLC residual coding
Use a VLC table for common levelcodes instead of constructing them on-the-spot
Branchless version of i_trailing calculation (2x faster on Nehalem)
Completely remove array_non_zero_count and instead use the count calculated in level/run coding. Note: this slightly changes output with subme > 7 due to different nonzero counts being stored during qpel RD.

commit | commitdiff | tree

Guillaume Poirier [Fri, 5 Dec 2008 21:26:55 +0000 (22:26 +0100)]

fix compilation with GCC-4.3+

commit | commitdiff | tree

Fiona Glaser [Sun, 30 Nov 2008 07:13:58 +0000 (23:13 -0800)]

High Profile allows 25% higher maxbitrate/cpb
Correct level detection to take this into account.

commit | commitdiff | tree

BugMaster [Sat, 29 Nov 2008 22:04:29 +0000 (14:04 -0800)]

s/nasm/yasm in VS project file

commit | commitdiff | tree

Fiona Glaser [Sat, 29 Nov 2008 12:49:18 +0000 (04:49 -0800)]

Cosmetic: update various file headers.

commit | commitdiff | tree

Loren Merritt [Sat, 29 Nov 2008 11:54:02 +0000 (11:54 +0000)]

add date and compiler to `x264 --version`

commit | commitdiff | tree

Fiona Glaser [Fri, 28 Nov 2008 22:32:11 +0000 (14:32 -0800)]

10L in r1041

commit | commitdiff | tree

Fiona Glaser [Fri, 28 Nov 2008 03:37:56 +0000 (19:37 -0800)]

Significantly faster CABAC and CAVLC residual coding and bit cost calculation
Early-terminate in residual writing using stored nnz counts
To allow the above, store nnz counts for luma and chroma DC
Add assembly functions to find the last nonzero coefficient in a block
Overall ~1.9% faster at subme9+8x8dct+qp25 with CAVLC, ~0.7% faster with CABAC
Note this changes output slightly with CABAC RDO because it requires always storing correct nnz values during RDO, which wasn't done before in cases it wasn't useful.
CAVLC output should be equivalent.

commit | commitdiff | tree

Fiona Glaser [Thu, 27 Nov 2008 07:42:55 +0000 (23:42 -0800)]

dequant_4x4_dc assembly
About 3.5x faster DC dequant on Conroe

commit | commitdiff | tree

Loren Merritt [Thu, 27 Nov 2008 02:37:46 +0000 (02:37 +0000)]

fix an overflow in dct4x4dc_mmx
(unlikely to have occurred in any real video)

commit | commitdiff | tree

Fiona Glaser [Wed, 26 Nov 2008 00:30:39 +0000 (16:30 -0800)]

Remove nasm support
Nasm won't correctly parse the SSE4 code introduced a few revisions ago, so we're removing support.
Users should upgrade to yasm 0.6.1 or later.

commit | commitdiff | tree

BugMaster [Tue, 25 Nov 2008 23:11:24 +0000 (15:11 -0800)]

Fix rare warning messages in ratecontrol due to r1020

commit | commitdiff | tree

BugMaster [Tue, 25 Nov 2008 23:10:43 +0000 (15:10 -0800)]

Fix MSVC compilation and clean up MSVC build file
Remove Release64 which never worked anyways.

commit | commitdiff | tree

Fiona Glaser [Tue, 25 Nov 2008 09:04:26 +0000 (01:04 -0800)]

Faster width4 SSD+SATD, SSE4 optimizations
Do satd 4x8 by transposing the two blocks' positions and running satd 8x4.
Use pinsrd (SSE4) for faster width4 SSD
Globally replace movlhps with punpcklqdq (it seems to be faster on Conroe)
Move mask_misalign declaration to cpu.h to avoid warning in encoder.c.
These optimizations help on Nehalem, Phenom, and Penryn CPUs.

commit | commitdiff | tree

Guillaume Poirier [Tue, 25 Nov 2008 16:27:27 +0000 (17:27 +0100)]

fix indentation, whitespace cleanup, more consistent indentation of macro backslashes

commit | commitdiff | tree

David Wolstencroft [Sat, 22 Nov 2008 16:54:38 +0000 (17:54 +0100)]

Change some macros to be more sensitive to memory alignment, thus avoiding
useless loads/stores and calculations of permutation vectors.
Affected functions are all of mc_luma, mc_chroma, 'get_ref', SATD, SA8D and deblock.
Gains globally vary from ~5% - 15% on a depending on settings running on a 1.42 ghz G4.

commit | commitdiff | tree

Loren Merritt [Fri, 7 Nov 2008 05:31:24 +0000 (05:31 +0000)]

refactor satd. 20KB smaller binary.
refactor sa8d. slightly faster.
more checkasm for hadamard.

commit | commitdiff | tree

Fiona Glaser [Tue, 25 Nov 2008 05:56:24 +0000 (21:56 -0800)]

Fix crash with threads and SSEMisalign on Phenom
Misalign mask needed to be set separately for each encoding thread.

commit | commitdiff | tree

Fiona Glaser [Fri, 21 Nov 2008 11:39:11 +0000 (03:39 -0800)]

Phenom CPU optimizations
Faster hpel_filter by using unaligned loads instead of emulated PALIGNR
Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it).
Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref.
Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom.
Merge cpu-32.asm and cpu-64.asm
Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.

commit | commitdiff | tree

Fiona Glaser [Fri, 21 Nov 2008 04:11:14 +0000 (20:11 -0800)]

A few tweaks to decimate asm
A little bit faster on both 32-bit and 64-bit

commit | commitdiff | tree

Fiona Glaser [Thu, 13 Nov 2008 00:50:31 +0000 (16:50 -0800)]

Nehalem optimization part 2: SSE2 width-8 SAD
Helps a bit on Phenom as well
~25% faster width8 multiSAD on Nehalem

commit | commitdiff | tree

Fiona Glaser [Tue, 11 Nov 2008 07:34:02 +0000 (23:34 -0800)]

Add subme=0 (fullpel motion estimation only)
Only for experimental purposes and ultra-fast encoding. Probably not a good idea for firstpass.

commit | commitdiff | tree

Fiona Glaser [Mon, 10 Nov 2008 23:34:48 +0000 (15:34 -0800)]

Fix minor memory leak in r1022

commit | commitdiff | tree

Fiona Glaser [Mon, 10 Nov 2008 23:32:06 +0000 (15:32 -0800)]

r1024 borked checkasm
Remove idct/dct2x2 from checkasm as they are no longer in dctf

commit | commitdiff | tree

Fiona Glaser [Mon, 10 Nov 2008 01:39:21 +0000 (17:39 -0800)]

Faster chroma encoding
9-12% faster chroma encode.
Move all functions for handling chroma DC that don't have assembly versions to macroblock.c and inline them, along with a few other tweaks.

commit | commitdiff | tree

Fiona Glaser [Mon, 10 Nov 2008 01:34:31 +0000 (17:34 -0800)]

Various cosmetics and minor fixes
Disable hadamard_ac sse2/ssse3 under stack_mod4
Fix one MSVC compilation warning
Fix compilation in debug mode in certain cases on x64
Remove eval.c from MSVC project
Fix crash when VBV is used in CQP mode
Patches by MasterNobody

commit | commitdiff | tree

Fiona Glaser [Sun, 9 Nov 2008 04:16:17 +0000 (20:16 -0800)]

Faster b-adapt + adaptive quantization
Factor out pow to be only called once per macroblock. Speeds up b-adapt, especially b-adapt 2, considerably.
Speed boost is as high as 24% with b-adapt 2 + b-frames 16.

commit | commitdiff | tree

Fiona Glaser [Fri, 7 Nov 2008 19:39:43 +0000 (11:39 -0800)]

Faster CABAC residual encoding
6% faster block_residual_write_cabac in RD mode.

commit | commitdiff | tree

Fiona Glaser [Thu, 6 Nov 2008 03:51:59 +0000 (19:51 -0800)]

Fix potential crash in the case that the input statsfile is too short
Also resolve various other potential weirdness (such as multiple copies of the same error message in threaded mode).

commit | commitdiff | tree

Fiona Glaser [Wed, 5 Nov 2008 11:11:45 +0000 (03:11 -0800)]

Initial Nehalem CPU optimizations
movaps/movups are no longer equivalent to their integer equivalents on the Nehalem, so that substitution is removed.
Nehalem has a much lower cacheline split penalty than previous Intel CPUs, so cacheline workarounds are no longer necessary.
Thanks to Intel for providing Avail Media with the pre-release Nehalem CPU needed to prepare these (and other not-yet-committed) optimizations.
Overall speed improvement with Nehalem vs Penryn at the same clock speed is around 40%.

commit | commitdiff | tree

Gabriel Bouvigne [Tue, 4 Nov 2008 17:56:03 +0000 (09:56 -0800)]

Fix potential infinite loop in VBV under GCC 4.2

commit | commitdiff | tree

Fiona Glaser [Tue, 4 Nov 2008 06:59:49 +0000 (22:59 -0800)]

Encoder_reconfig: esa/tesa can only be enabled if they were on to begin with
Bug report by kemuri-_9.

commit | commitdiff | tree

Loren Merritt [Thu, 30 Oct 2008 07:47:09 +0000 (00:47 -0700)]

Fix bug in hadamard_ac SSE assembly
Some extreme inputs could cause overflows.

commit | commitdiff | tree

Fiona Glaser [Wed, 29 Oct 2008 03:35:15 +0000 (20:35 -0700)]

Full sub8x8 RD mode decision
Small speed penalty with p4x4 enabled, but significant quality gain at subme >= 6
As before, gain is proportional to the amount of p4x4 actually useful in a given input at the given bitrate.

commit | commitdiff | tree

Fiona Glaser [Sat, 25 Oct 2008 08:50:08 +0000 (01:50 -0700)]

Optimize CABAC bit cost calculation
Speed up cabac mvd and add new precalculated transition/entropy table.
Add "noup" function for cabac operations to not update the state table when it isn't necessary.
1-3% faster macroblock_size_cabac.
Cosmetics

commit | commitdiff | tree

Anders Ossowicki [Fri, 24 Oct 2008 05:36:11 +0000 (22:36 -0700)]

Replace "git-command" with "git command" in version.sh for git 1.6 support

commit | commitdiff | tree

Loren Merritt [Thu, 23 Oct 2008 20:45:04 +0000 (13:45 -0700)]

Add assembly version of CAVLC 8x8dct interleave
Faster CAVLC encoding and RDO with 8x8dct

commit | commitdiff | tree

Alexander Strange [Wed, 22 Oct 2008 22:55:30 +0000 (15:55 -0700)]

Add support for psy-rd/trellis to encoder_reconfig

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom