Alpha Lam [Tue, 9 Aug 2011 19:59:45 +0000 (20:59 +0100)]
Copy less when active map is in use
When active map is specified and the current frame is not a key frame,
golden frame nor a altref frame then copy only those active regions.
This significantly reduces encoding time by as much as 19% on the test
system where realtime encoding is used. This is particularly useful
when the frame size is large (e.g. 2560x1600) and there's only a few
action macroblocks.
John Koleszar [Wed, 3 Aug 2011 20:12:12 +0000 (16:12 -0400)]
Fix source buffer selection
This patch fixes a bug in the interaction between the recode loop and
spatial resampling. If the codec was in a spatial resampling state,
and a subsequent iteration of the recode loop disables resampling,
then the source buffer must be reset to the unscaled source.
Yunqing Wang [Wed, 3 Aug 2011 15:51:07 +0000 (11:51 -0400)]
Adjust half-pixel only search
Changed motion search in vp8_find_best_half_pixel_step() to be the
same as in vp8_find_best_sub_pixel_step(), which checks 5 points
instead of 8 points. This only affects real-time mode with
cpu-used >=9. Tests showed it gives 2% encoding speedup with
a quality loss(psnr) of up to 0.5%.
John Koleszar [Wed, 3 Aug 2011 13:20:37 +0000 (09:20 -0400)]
Fix building of static libs on universal-darwin
The static libs should not be built from sources during the top level
of a universal build. This regression was introduced in commit 495b241fa6b03345baf2b2f39aa8c06c735fccc2, which made the static
libs selectable under CONFIG_STATIC.
John Koleszar [Thu, 28 Jul 2011 13:17:32 +0000 (09:17 -0400)]
Convert rc_max_intra_bitrate_pct to control
Since this is the only ABI incompatible change since the last release,
convert it to use the control interface instead. The member of the
configuration struct is replaced with the VP8E_SET_MAX_INTRA_BITRATE_PCT
control.
More significant API changes were expected to be forthcoming when this
control was first introduced, and while they continue to be expected,
it's not worth breaking compatibility for only this change.
Yunqing Wang [Fri, 22 Jul 2011 20:01:11 +0000 (16:01 -0400)]
Preload reference area in sub-pixel motion search (real-time mode)
This change implemented same idea in change "Preload reference area
to an intermediate buffer in sub-pixel motion search." The changes
were made to vp8_find_best_sub_pixel_step() and vp8_find_best_half
_pixel_step() functions which are called when speed >= 5. Test
result (using tulip clip):
Yunqing Wang [Tue, 28 Jun 2011 13:14:13 +0000 (09:14 -0400)]
Preload reference area to an intermediate buffer in sub-pixel motion search
In sub-pixel motion search, the search range is small(+/- 3 pixels).
Preload whole search area from reference buffer into a 32-byte
aligned buffer. Then in search, load reference data from this buffer
instead. This keeps data in cache, and reduces the crossing cache-
line penalty. For tulip clip, tests on Intel Core2 Quad machine(linux)
showed encoder speed improvement:
3.4% at --rt --cpu-used =-4
2.8% at --rt --cpu-used =-3
2.3% at --rt --cpu-used =-2
2.2% at --rt --cpu-used =-1
Test on Atom notebook showed only 1.1% speed improvement(speed=-4).
Test on Xeon machine also showed less improvement, since unaligned
data access latency is greatly reduced in newer cores.
Next, I will apply similar idea to other 2 sub-pixel search functions
for encoding speed > 4.
Mark ARM asm objects as allowing a non-executable stack.
This adds the magic .note.GNU-stack section at the end of each ARM
asm file (when built with gas), indicating that a non-executable
stack is allowed.
Without this section, the linker will assume the object requires an
executable stack by default, forcing an executable stack for the
entire program.
This is done by expanding luma row to 32-byte alignment, since
there is currently a bunch of code that assumes that
uv_stride == y_stride/2 (see, for example, vp8/common/postproc.c,
common/reconinter.c, common/arm/neon/recon16x16mb_neon.asm,
encoder/temporal_filter.c, and possibly others; I haven't done a
full audit).
It also uses replaces the hardcoded border of 16 in a number of
encoder buffers with VP8BORDERINPIXELS (currently 32), as the
chroma rows start at an offset of border/2.
Together, these two changes have the nice advantage that simply
dumping the frame memory as a contiguous blob produces a valid,
if padded, image.
This version of the check doesn't work with generic-gnu, and figuring
out the correct symbol version at configure time is probably more work
than this is worth. May revisit in the future.
John Koleszar [Wed, 29 Jun 2011 15:41:50 +0000 (11:41 -0400)]
Improved 1-pass CBR rate control
This patch attempts to improve the handling of CBR streams with
respect to the short term buffering requirements. The "buffer level"
is changed to be an average over the rc buffer, rather than a long
running average. Overshoot is also tracked over the same interval
and the golden frame targets suppressed accordingly to correct for
overly aggressive boosting.
Testing shows that this is fairly consistently positive in one
metric or another -- some clips that show significant decreases
in quality have better buffering characteristics, others show
improvenents in both.
Optimized C-code of the following functions:
- vp8_tokenize_mb
- tokenize1st_order_b
- tokenize2nd_order_b
Gives ~1-5% speed-up for RT encoding on Cortex-A8/A9
depending on encoding parameters.
John Koleszar [Mon, 11 Jul 2011 15:25:25 +0000 (11:25 -0400)]
Disable __longjmp_chk protection
glibc implements some checking on longjmp() calls by replacing it with
an internal function __longjmp_chk(), when FORTIFY_SOURCE is defined.
This can be problematic when compiling the library under one version of
glibc and running it under another. Work around this issue for the one
symbol affected for now, before taking out the undef hammer.
Yunqing Wang [Wed, 13 Jul 2011 18:51:02 +0000 (14:51 -0400)]
Add improvements made in good-quality mode to real-time mode
Several improvements we made in good-quality mode can be added
into real-time mode to speed up encoding in speed 1, 2, and 3
with small quality loss. Tests using tulip clip showed:
Yunqing Wang [Thu, 7 Jul 2011 15:21:41 +0000 (11:21 -0400)]
Adjust full-pixel clamping and motion vector limit calculation
Do mvp clamping in full-pixel precision instead of 1/8-pixel
precision to avoid error caused by right shifting operation.
Also, further fixed the motion vector limit calculation in change: b7480454706a6b15bf091e659cd6227ab373c1a6
Attila Nagy [Fri, 10 Jun 2011 11:10:21 +0000 (14:10 +0300)]
New loop filter interface
Separate simple filter with reduced no. of parameters.
MB filter level picking based on precalculated table. Level table updated for
each frame. Inside and edge limits precalculated and updated just when
sharpness changes. HEV threshhold is constant.
ARM targets use scalars and others vectors.
Change works only with --target=generic-gnu
All other targets have to be updated!
Yunqing Wang [Thu, 30 Jun 2011 15:20:13 +0000 (11:20 -0400)]
Bug fix in motion vector limit calculation
Motion vector limits are calculated using right shifts, which
could give wrong results for negative numbers. James Berry's
test on one clip showed encoder produced some artifacts. This
change fixed that.