Ronald S. Bultje [Tue, 25 Jun 2013 18:26:49 +0000 (11:26 -0700)]
Add averaging-SAD functions for 8-point comp-inter motion search.
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.
Jingning Han [Fri, 21 Jun 2013 18:45:47 +0000 (11:45 -0700)]
Add 8x8 dct/adst unit tests
This commit enables 8x8 DCT and hybrid transform unit tests. It
also tunes the forward hybrid transform rounding opertions for
more precise round-trip performance.
John Koleszar [Sat, 22 Jun 2013 00:06:43 +0000 (17:06 -0700)]
Fix loopfilter of leftmost 4x4 edges in SB
For cases where there's no transform set in bit 0 (the left edge of
the SB) but bit 0 of mask_4x4_int is set (the edge 4 pixels from the
left edge needs filtering), it was incorrectly being skipped before.
This situation only happens on the leftmost edge of the image, as
the edge at column 0 is intentionally skipped since there aren't
pixels to the left to read.
Ronald S. Bultje [Fri, 21 Jun 2013 19:54:52 +0000 (12:54 -0700)]
Implement SSE2 block_error.
Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.
Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.
Ronald S. Bultje [Thu, 20 Jun 2013 22:59:48 +0000 (15:59 -0700)]
SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.
Deb Mukherjee [Wed, 19 Jun 2013 23:23:21 +0000 (16:23 -0700)]
Improving model rd with variance and quant step
Improves the rd modeling function and implements them using interpolation
from a table which is a little faster. Also uses sse as input to the
modeling function rather than var - since there is no dc prediction
used and as a result the sse works a little better.
Ronald S. Bultje [Thu, 20 Jun 2013 16:34:25 +0000 (09:34 -0700)]
Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):
James Zern [Wed, 19 Jun 2013 02:19:20 +0000 (19:19 -0700)]
test_libvpx: disable pthreads in gtest
currently threading is internal to libvpx so thread safety is unneeded
in libgtest -- visual studio builds already operate in this way as they
do not have pthread.h available by default.
this removes an unconditional link to libpthread using $(extralibs)
should libvpx require it.
Yunqing Wang [Wed, 19 Jun 2013 00:09:50 +0000 (17:09 -0700)]
Add two-pass quantization
Optimized the quantization function by making it a two-pass
process. The first pass does a quick checking of the transform
coefficients against the base ZBIN, and only keep the good
enough set of coefficients for quantization. A skipping
check is added. If all coefficients are within the base ZBIN, no
quantization is needed. The second pass is the actual quantization
pass, which only processes the coefficient subset determined
in first pass. This reduces the computation. Furthermore, an
alternitive method is used for large transform size, which often
has sparse nonzero quantized coefficients.
Overall, the encoder speedup is about 4%. The quantization function
itself gets 20% faster.
Jingning Han [Fri, 14 Jun 2013 18:28:56 +0000 (11:28 -0700)]
Make fdct32 computation flow within 16bit range
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.
This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.
Ronald S. Bultje [Mon, 17 Jun 2013 23:54:09 +0000 (16:54 -0700)]
Move subpixel variance function from common/ to encoder/.
This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.
Dmitry Kovalev [Mon, 17 Jun 2013 20:50:22 +0000 (13:50 -0700)]
Fixing compilation error on Mac OS.
The error happened because of vp8_decrypt_cb typedef redefinition in both
treereader.h and vp8dx.h. Removing typedef from vp8dx.h in favor of raw
function pointer declaration.