Jingning Han [Thu, 22 May 2014 16:44:40 +0000 (09:44 -0700)]
Inverse 16x16 2D-DCT SSSE3 implementation
This commit enables the SSSE3 implementation of full inverse 16x16
2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles,
about 7% speed-up.
Yaowu Xu [Fri, 23 May 2014 19:23:29 +0000 (12:23 -0700)]
Use extreme values for input in convovle tests
The intepolation filter functions can be better tested withe extreme
values, especially given the optimization functions are prone to
overflow signed 16 bit intermediate value when operation order is
wrong.
Paul Wilkins [Wed, 21 May 2014 12:17:00 +0000 (13:17 +0100)]
Further first pass allocation changes.
Further changes to first pass allocation for gf/arf groups.
Three variables removed from TWO_PASS structure as only
now used locally. Dont adjust gf_group_bits in the post
encode update as this will no longer have any effect.
Yunqing Wang [Fri, 23 May 2014 18:52:20 +0000 (11:52 -0700)]
Fix decoder mismatch in sub-pixel SSSE3 intrinsic filters
In 8-tap filtering, to guarantee the intermediate results fit in
16 bits, the order of accumulating the products needs to be done
correctly, and the largest product should be added last. This
patch fixed the problem using the method in commit "Correct ssse3
8/16-pixel wide sub-pixel filter calculation".
Deb Mukherjee [Fri, 23 May 2014 08:26:01 +0000 (01:26 -0700)]
Fixes a bug for uninitialized frame buffers
Fixes a bug introduced in
https://gerrit.chromium.org/gerrit/#/c/69779/13, where
uninitialized frame buffers due to corrupt and short
buffer sizes, may cause a crash.
This patch fixes the currently failing
video/processing/static_image/vp8_convert_test
Yaowu Xu [Wed, 21 May 2014 16:52:23 +0000 (09:52 -0700)]
change to use assembly version of ssse3 filter code
As mismatchs were found between the intrinsic version and c only. The
commit temporarily revert to use the matching assembly version to
allow further investigation.
Alex Converse [Thu, 22 May 2014 23:17:59 +0000 (16:17 -0700)]
Use offset mode info when filling pc tree.
Use the appropriate subblock offset mode info rather than the parent
block base, when filling mbmi in the pc tree in nonrd_use_partition.
This mimics what is done in the vertical case and what is done for
both cases in nonrd_pick_partition.
This change has little practical effect at the moment since in speed 5
rt horizontal and vertical partitions are currently only used unpaired
at edges of the picture.
Deb Mukherjee [Thu, 22 May 2014 15:11:37 +0000 (08:11 -0700)]
Adjust cq_level in constrained quality mode
If we are already saving a lot in bits from the target (maximum)
bitrate in the constrained quality mode, allow the quantizer
to go lower than the cq level. This hopefully will solve issues
with getting too low a bitrate and consequently poor quality for
certain videos in cq mode.
Deb Mukherjee [Wed, 21 May 2014 20:55:56 +0000 (13:55 -0700)]
Renames x86_64 specific asm files
Renames all x86_64 specific assembly files to consistently
end in _x86_64.asm. This will be useful for build systems to
handle these files differently.
All new 64-bit specific assembly files should use the new
naming convention.
Yaowu Xu [Tue, 20 May 2014 18:36:44 +0000 (11:36 -0700)]
Enable various thresholds of motion detection
This commit changed to enable the encoder to adjust motion dection
speed threshold based on picture size. In addition, cpu-used 1 now
does a partition search every other frame instead of every third
frame for low resolution inputs.
The change has no quality/speed impact for 720p and above. Test
showed the change increase encoding time by between 3% to 6% for
cpu-used 2 encodiong of 360p sequences. It also has a compression
gain about .3%.
For cpu-used 2, the change resolved some very disturbing visual
artifacts in certain sequences when large block partitionings and
transforms are used as a result of copying the partition from a
previous frame.
For example, if a tile video has one row and four cols, decode_tiles
will decode the Tile1, then Tile2, then Tile3, then Tile4.
And during decode each tile, decode_tile will decode row by row in
each tile.
For frame parallel decoding, decode_tiles will decode video in row order
across the tiles. So the order will be:
"Decode 1st row of Tile1" -> "Decode 1st row of Tile2"
-> "Decode 1st row of Tile3" -> "Decode 1st row of Tile4"
-> "Decode 2nd row of Tile1" -> "Decode 2nd row of Tile2"
-> "Decode 2nd row of Tile3" -> "Decode 2nd row of Tile4"-> "loopfilter 1st row"
Jingning Han [Mon, 19 May 2014 19:33:40 +0000 (12:33 -0700)]
Adjust the forward 16x16 DCT computation steps
This commit adjusts the forward 16x16 DCT computation steps to
simplify the register level operations. It fixes the corresponding
sse2 version accordingly.
Yunqing Wang [Wed, 7 May 2014 17:39:00 +0000 (10:39 -0700)]
Add static-threshold skipping in non-rd mode
Added a skipping test in non-rd inter-mode. After interpolation
prediction step, the residuals are tested to see if they will be
quantized to 0 based on modeling between spatial domain and
frequency domain.
Set static-thresh to 800 for >=720p and 300 for <720p, rtc set
tests showed
1. Speed 5, psnr: -0.514%; ssim: -1.748%;
speedup on related clips: 5% -11%
2. Speed 6, psbr: -0.628%; ssim: -1.637%;
speedup on related clips: 4% - 9%