Here's a description of how mp4 timestamps work and what is going wrong
for the curious.
Terminology:
pts = presentation timestamp, when a frame is displayed
dts = decode timestamp, when a frame is decoded
cts = composition offset, pts - dts
empty edit = defines the pts of the first frame in an mp4 track
mp4 timestamps are computed from 3 primary values that are in the mp4
stream.
An "empty edit" in the track edit list
per frame duration
per frame cts
Here's where things get messy. How do you compute pts(N) and dts(N) for
some frame N from only the above 3 values in the mp4 file?
empty edit == pts(0) and is read from the mp4 file (EDTS table)
duration(N) is read from the mp4 file (STTS table)
cts(N) is read from the mp4 file (CTTS table)
We know cts(0) = pts(0) - dts(0) by definition of cts
And cts(0) and pts(0) are known since they can be read from the mp4 file
This is the step libav gets wrong!
Therefore we can compute dts(0) = pts(0) - cts(0).
libav computes dts(0) = pts(0) which shifts all frames by cts(0)
After that dts(N) = dts(0) + duration(0) + ... + duration(N-1)
And finally pts(N) = dts(N) + cts(N)