Jingning Han [Thu, 4 Dec 2014 22:53:36 +0000 (14:53 -0800)]
Enable conditional skip path in rd_pick_intra_sby_mode
These speed-up features for key frame coding are only turned on
in the settings of hybrid non-RD and RD mode decision. It provides
about 20% speed-up to the hybrid key frame coding at the expense
of certain compression performance loss. For vidyo1, the key frame
coding statistics are changed
9838F, 35.020 dB, 61677 us -> 9920F, 34.834 dB, 47556 us
Overall rtc set compression performance is down by -0.257%.
Jingning Han [Thu, 4 Dec 2014 17:29:24 +0000 (09:29 -0800)]
Use hybrid RD and non-RD coding flow for key frame coding
When block size is below 16x16, the encoder swap from non-RD to
RD mode for key frame coding. This largely brough back the key
frame compression performance. For vidyo1 at 1000 kbps, the key
frame coding statistics are changed
9978F, 34.183 dB, 36807 us -> 9838F, 35.020 dB, 61677 us
As compared to the full RD case
7187F, 34.930 dB, 214470 us
The overall rtc set coding performance (single key frame setting)
is improved by 1.5%.
Peter de Rivaz [Thu, 4 Dec 2014 10:51:10 +0000 (10:51 +0000)]
Use the RTC optimizations when in high bitdepth mode.
Change 72193 made the encoder behave differently
when configured with and without high bitdepth.
This change means the same algorithm is used for both.
Yunqing Wang [Tue, 2 Dec 2014 23:47:41 +0000 (15:47 -0800)]
vp9_ethread: the tile-based multi-threaded encoder
Currently, VP9 supports column-tile encoding, which allows a frame
to be encoded in multiple column tiles independently. The number of
column tiles are set by encoder option "--tile-columns". This
provides a way to encode a frame in parallel.
Based on previous set of patches, this patch implemented the tile-
based multi-threaded encoder. Each thread processes one or more
tiles.
Usage:
For HD clips:
--tile-columns=2 --threads=1/2/3/4
While using 4 threads, tests showed that the encoder achieved
2.3X - 2.5X speedup at good-quality speed 3, and 2X speedup at
realtime speed 5.
Marco [Wed, 12 Nov 2014 22:51:49 +0000 (14:51 -0800)]
Enable non-rd mode coding on key frame, for speed 6.
For key frame at speed 6: enable the non-rd mode selection in speed setting
and use the (non-rd) variance_based partition.
Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames),
mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16.
Loss in key frame quality (~0.6-0.7dB) compared to rd coding,
but speeds up key frame encoding by at least 6x.
Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6.
Jingning Han [Wed, 3 Dec 2014 02:16:06 +0000 (18:16 -0800)]
Rework coeff probability model update for rtc coding
This commit reworks the ONE_LOOP_REDUCED coefficient probability
model update process. It allows model update for every coefficient
across the spectrum at a coarser resolution, instead of performing
precise update only for certain subset of probability models.
The overall runtime remains nearly same (<1% change) for speed -6.
The compression performance is improved by 7.5% in PSNR for speed
-5 and 4.57% for speed -6, respectively.
Peter de Rivaz [Tue, 2 Dec 2014 12:14:52 +0000 (12:14 +0000)]
Reinsert macro to fix issue 884.
Change 72056 unfolded some macro definitions,
but lost some alternative behaviour required for
high bitdepth encodes.
This causes the encoder to crash, see issue 884.
Jingning Han [Tue, 2 Dec 2014 18:50:39 +0000 (10:50 -0800)]
Enforce error resilient mode on in temporal svc real-time mode
This commit makes the codec automatically turn on error resilient
mode when using real-time mode for temporal scalable coding. It
fixes an enc/dec mismatch issue and re-enables the corresponding
unit test.
Jingning Han [Mon, 1 Dec 2014 01:40:58 +0000 (17:40 -0800)]
Turn off temporal svc unit test in RTC setting
A hidden enc/dec mismatch bug was accidentally triggered by
https://gerrit.chromium.org/gerrit/#/c/72247/
Adaptively adjust mode test kick-off thresholds in RTC coding
This commit temporarily turns off the broken unit tests to avoid
blocking other CLs while fixing.
Yunqing Wang [Wed, 26 Nov 2014 00:53:47 +0000 (16:53 -0800)]
vp9_ethread: calculate and save the tok starting address for tiles
Each tile's tok starting address is calculated before the encoding
process. These addresses are stored so that the same calculation
won't be done again in packing bit stream.
Yaowu Xu [Fri, 21 Nov 2014 19:55:27 +0000 (11:55 -0800)]
Separate rate_correction_factor for boosted GFs
When the golden frame is boosted, the rate correction factor is not
correlated well with other inter frames even in CBR mode. This commit
changes to use GF specific rate_correction_factor when gf_cbr_boost
is greater than 20%.
Jingning Han [Mon, 24 Nov 2014 22:40:42 +0000 (14:40 -0800)]
Adaptively adjust mode test kick-off thresholds in RTC coding
This commit allows the encoder to increase the mode test kick-off
thresholds if the previous best mode renders all zero quantized
coefficients, thereby saving motion search runs when possible.
The compression performance of speed -5 and -6 is down by -0.446%
and 0.591%, respectively. The runtime of speed -6 is improved by
10% for many test clips.
vidyo1, 1000 kbps
16578 b/f, 40.316 dB, 7873 ms -> 16575 b/f, 40.262 dB, 7126 ms
nik720p, 1000 kbps
33311 b/f, 38.651 dB, 7263 ms -> 33304 b/f, 38.629 dB, 6865 ms
dark720p, 1000 kbps
33331 b/f, 39.718 dB, 13596 ms -> 33324 b/f, 39.651 dB, 12000 ms
mmoving, 1000 kbps
33263 b/f, 40.983 dB, 7566 ms -> 33259 b/f, 40.978 dB, 7531 ms
Yunqing Wang [Fri, 21 Nov 2014 19:11:06 +0000 (11:11 -0800)]
vp9_ethread: modify VP9_COMP structure
This patch modified struct VP9_COMP. Created a struct ThreadData
to include data that need to be copied for each thread. In
multiple thread case, one thread processes one tile. all threads
share one copy of VP9_COMP,
(refer to VP9_COMP *cpi in the code)
but each thread has its own copy of ThreadData,
(refer to ThreadData *td in the code).
Therefore, within the scope of encode_tiles(), both cpi and td
need to be passed as function parameters.
In single thread case, the FRAME_COUNTS pointer in ThreadData
points to "counts" in VP9_COMMON.
Jingning Han [Mon, 24 Nov 2014 17:31:10 +0000 (09:31 -0800)]
Remove redundant intra mode penalty from vp9_pick_inter_mode
The intra mode penalty is covered by intra_cost_penalty. This
commit removes the other intra cost threshold, provided that the
constant 50 is negligible in normal rate-distortion cost.
Peter de Rivaz [Fri, 24 Oct 2014 07:37:39 +0000 (08:37 +0100)]
Refactored idct routines and headers
This change is made in preparation for a
subsequent patch which adds acceleration
for the highbitdepth transform functions.
The highbitdepth transform functions attempt
to use 16/32bit sse instructions where possible,
but fallback to using the C implementations if
potential overflow is detected. For this reason
the dct routines are made global so they can be
called from the acceleration functions in the
subsequent patch.
Jingning Han [Fri, 21 Nov 2014 20:18:53 +0000 (12:18 -0800)]
Rework forward txfm/quantization skip system in RTC coding mode
This commit allows more aggressive decision to skip forward
transform and quantization for luma component in RTC coding mode.
The chroma components remains going through the normal coding
routine, since they are not included in the non-RD mode search
process.
It reduces the runtime cost by 2% - 10%. In speed -6,
vidyo1 1000 kbps
16576 b/f, 40.281 dB, 8402 ms -> 16576 b/f, 40.323 dB, 7764 ms
nik720p 1000 kbps
33337 b/f, 38.622 dB, 7473 ms -> 33299 b/f, 38.660 dB, 7314 ms
dark720p 1000 kbps
33330 b/f, 39.785 dB, 13505 ms -> 33325 b/f, 39.714 dB, 13105 ms
The compression performance of speed -6 is improved by 0.44% in
PSNR and 1.31% in SSIM.
Paul Wilkins [Fri, 21 Nov 2014 02:32:44 +0000 (18:32 -0800)]
Remove rate component adjustment for AQ1
In AQ1 a rate adjustment was applied for blocks coded with a
deltaq. This tends to skew the partition selection and cause
rate overshoot.
For example, consider a 64x64 super block where some but not all
sub blocks are in a low q segment and some are in a high q segment.
The choice of Q when considering large partition and transform sizes
is defined by the lowest sub block segment id (currently this implies the
lowest Q). If some parts of the larger partition are very hard this will
cause a high rate component.
The correct behavior here is for the rd code to discard the large partition
choice and break down to sub blocks where some have low and some
have high Q. However the rate correction factor above mask the high
cost of coding at a larger partition size.
Johann [Thu, 20 Nov 2014 21:24:55 +0000 (13:24 -0800)]
Correctly initialize "ones" value in neon quantize
By using 0xff for a short it was not setting the high bits. When
comparing the output with vtst to find non-zero elements it was skipping
vaules which had no low bits set such as -512 / 0xFE00.
Using -8191 as the first element of coeff will generate this condition.