Deb Mukherjee [Mon, 22 Jul 2013 21:47:57 +0000 (14:47 -0700)]
Flexible support for various pattern searches
Adds a few pattern searches to achieve various tradeoffs
between motion estimation complexity and performance.
The search framework is unified across these searches so that a
common pattern search function is used for all. Besides it will
be easier to experiment with various patterns or combinations
thereof at different scales in the future.
The new pattern search is multi-scale and is capable of using
different patterns at different scales.
The new hex search uses 8 points at the smallest scale
and 6 points at other scales.
Two other pattern searches - big-diamond and square are
also added. Big diamond uses 4 points at the smallest scale and
8 points in diamond shape at the larger scales.
Square is very similar conceptually to the default n-step search
but is somewhat faster since it keeps only one survivor across
all scales.
Move fdct32x32 SSE2 implementation in separate file.
This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.
Deb Mukherjee [Wed, 31 Jul 2013 16:33:58 +0000 (09:33 -0700)]
Add variance based mode/skipping
Adds a speed feature to skip all intra modes other than
DC_PRED if the source variance is small. This feature is
made part of speed 1 and up.
Results on derf300: psnr -0.07%, speedup about 1-2%
Also uses the source variance to fine-tune the early
termination criteria when FLAG_EARLY_TERMINATE is on.
This feature is made part of speed 2 and up.
Results on derf300: psnr -0.52%, speedup about 5-7%
Dmitry Kovalev [Mon, 5 Aug 2013 19:26:15 +0000 (12:26 -0700)]
Changing the order switchable filter enum constants.
This changeset allows to remove vp9_switchable_interp and
vp9_switchable_interp_map arrays and make code much clear. Actually we
still have to use these mapping but only inside read_interp_filter_type and
write_interp_filter_type functions.
Jim Bankoski [Mon, 5 Aug 2013 19:07:30 +0000 (12:07 -0700)]
Begin to restrict x86inc.asm usage
Chromium does not support 32bit builds for Mac which use x86inc.asm.
Make the files which include it work if 64bit or not PIC enabled
starting with vp9_copy_sse2.asm
Mans Rullgard [Tue, 30 Jul 2013 17:11:06 +0000 (18:11 +0100)]
vp9: neon: optimise loads in horiz convolve functions
Loading to single lanes in multiple registers is expensive since
it requires a read and write of each register which saturates
the register file access. Loading to single registers followed
by a separate transpose reduces this pressure.
Deb Mukherjee [Thu, 1 Aug 2013 19:56:12 +0000 (12:56 -0700)]
Adds a source variance computation function
Adds a function to compute source variance for various
sb_types to be used for pruning mode and partition searches.
[The existing activity measure function is currently specialized
for only 16x16 MBs and needs to be updated].
Jingning Han [Thu, 1 Aug 2013 19:45:16 +0000 (12:45 -0700)]
Remove unused vp9_short_idct10_32x32_add
The inverse 32x32 transform detects all zero entries and skips the
computations accordingly per 8 rows in the first 1-D operation. The
function vp9_short_idct10_32x32_add performs differently and is not
used anywhere, hence removed.
Adrian Grange [Wed, 31 Jul 2013 19:58:19 +0000 (12:58 -0700)]
Changed name of rd_pick_intra4x4mby_modes
The function name rd_pick_intra4x4mby_modes is confusing, so
I changed it to rd_pick_intra_sub_8x8_y_modes to better
reflect what the function does. Also added const qualifiers
to some of the input parameters and removed camel-case.
Jingning Han [Wed, 31 Jul 2013 23:50:34 +0000 (16:50 -0700)]
Optimize 32x32 2D inverse DCT for speed-up
This commit exploits the sparsity of quantized coefficient matrix.
It detects each 32x8 array and skip the corresponding inverse
transformation if all entries are zero.
For ped1080p at 8000 kbps, this on average reduces the runtime of
32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200
cycles. It makes the overall encoding process about 2% faster at
speed 0. The speed-up is more pronounceable for the decoding process.
Jingning Han [Thu, 1 Aug 2013 00:02:06 +0000 (17:02 -0700)]
Remove unnecessary arguments in rd_pick_ref_frame
This commit removes redundant arguments passing in the function of
rd_pick_reference_frame. This resolves the clang warnings about
potential use of uninitialized values.
Jingning Han [Tue, 30 Jul 2013 22:47:12 +0000 (15:47 -0700)]
Make the use of ref_frame index consistent
Refactor the frame buffer referencing in choose_partition and make
it consistent with other places. This means to prevent potential
issues when we extend reference frame buffer.
Jingning Han [Mon, 29 Jul 2013 21:54:31 +0000 (14:54 -0700)]
Skip redundant tokenization in rd loop
This commit makes the encoder skip the redundant tokenization process
in the rate-distortion optimization search loop, while updating the
entropy contexts accordingly. It makes the speed 0 encoding process
about 0.5% faster at no performance change.