Re: [MSE] Establishing the Presentation Start Timestamp from Mark Watson on 2012-07-18 (public-html-media@w3.org from July 2012)

From: Mark Watson <watsonm@netflix.com>
Date: Wed, 18 Jul 2012 18:55:43 +0000
To: Aaron Colwell <acolwell@google.com>
CC: "<public-html-media@w3.org>" <public-html-media@w3.org>
Message-ID: <CF04452D-CD47-41F2-BF02-7DB510E1F419@netflix.com>

On Jul 18, 2012, at 11:03 AM, Aaron Colwell wrote:

Hi Mark,

Comments inline...

On Thu, Jul 12, 2012 at 2:32 PM, Mark Watson <watsonm@netflix.com<mailto:watsonm@netflix.com>> wrote:

4. How close do the starting timestamps on the first media segments from each SourceBuffer need to be?
- In this example I've shown them to be only 30 milliseconds apart, but would 0.5 seconds be acceptable? Would 2 seconds?
- How much time do we allow here before we consider there to be missing data and playback can't start?
- What happens if the gap is too large?

I think this is roughly the same question as 'what happens if I append a video segment which starts X ms after the end of the last video segment' ?

if X <= one frame interval, this is definitely not a 'gap' and playback continues smoothly. If X > 1 second this is definitely a gap and playback should stall (in the same way as it does today on a network outage).

For X values in between, I am not sure: implementations have to draw a line somewhere. A gap of multiple frame intervals could occur when switching frame rate. You might also get a couple of frame intervals gap when switching if you do wacky things with frame reordering around segment boundaries.

When looking at differences between audio and video, we need to be tolerant of differences as much as the larger of the audio frame size and the video frame interval.

if the gap is too large, this element just stays in the same state. Perhaps I append video from 0s and audio from 2s and this is because my network requests got re-ordered and any millisecond now I am going to append the 0-2s audio. Playback should start when that 0-2s is appended.

[acolwell] I agree. We need to come up with some spec text for this and then we can then debate the merits of these various magic numbers. Care to volunteer for this? :)

Ok, assign me a bug.

Any insights or suggestions would be greatly appreciated.

We have the same problem with push/popTimeOffset. Suppose I want your media above to appear at offset 200s in both audio and video source buffers. What I really want is for the audio to start at 200s and the video at 200.030ms.

In this case the application knows better than the media what the internal media times are. I know that the video segment has all the video from time 0s, even though the first frame is at 30ms. I really want to provide the actual offset to be applied to the internal timestamps, rather than providing the source buffer time that the next segment should start at.

[acolwel] One way I think we could get around this is to mandate that the media segments actually have a start time of 0. In WebM there is a Cluster timestamp and then all blocks are relative to this timestamp. If the Cluster timestamp is 0 and the first frame in the cluster is at 30ms then there is enough information for the UA to "do the right thing". I'm not sure if a similar mechanism exists in ISO.

Not really - the rather complex combination of decode times, composition offsets and edit lists results in a presentation timestamp for each sample on a global timeline (shared across all bitrates etc.). But if the timestamp of the first sample is X there is nothing to say, for example, "there are no other samples between time Y (< X) and X".

The application that creates the demuxed files just need to make sure the separate files both have the same segment start time.

That's not always possible because of skew caused by audio frame durations being different from video frame intervals.

I think we need to say that all source buffers share a common global timeline and that timestamps in the media segments must be mapped to that in a way that is common across source buffers. This means any offset applied to media internal timestamps needs to be the same across source buffers. It means that establishing such offsets needs to be done explicitly by the application or, if they are derived from timestamps in the media it needs to be done in a consistent way (in terms of which out of audio and video the time offset is taken from).

I think this has implications for the push/pop time offset as well. They should be global methods which establish a global offset based on the next-appended segment(s).

We do also need a way to handle the user starting content in the middle. If I have a 30 min content item and the user wants to start at minute 10 (because of a bookmark, say) then I should be able to start appending data at position 10min in the source buffer timeline. The seek bar needs to show the playback starting at minute 10 and if the user seeks backwards this should be ok.

pushOffset isn't right for this case because the media internal timestamps are correct: the first segment appended really does start at timestamp 10min.

I wonder whether we should just say that the source buffer timestamp starts at zero and not derive a start point from the appended media. If the media internal timestamp corresponding to the start of the content is not zero you need to explicitly handle this with a pushOffset call ?

Applications could also just append the segments to a "scratch" SourceBuffer to see what the initial timestamp is and then use that information to compute the proper offset to apply. It's not the greatest solution, but it does provide a way for people to handle this if they aren't as careful about how they create their demuxed content.

Aaron

Hmm - no clear answer here - I'll think about this some more.

…Mark

Aaron

Received on Wednesday, 18 July 2012 18:56:12 UTC