Re: [MSE] Establishing the Presentation Start Timestamp from Mark Watson on 2012-07-12 (public-html-media@w3.org from July 2012)

From: Mark Watson <watsonm@netflix.com>
Date: Thu, 12 Jul 2012 21:32:14 +0000
To: Aaron Colwell <acolwell@google.com>
CC: "<public-html-media@w3.org>" <public-html-media@w3.org>
Message-ID: <B7DEEEE5-D231-473C-8308-E8A524D6047F@netflix.com>

On Jul 12, 2012, at 10:27 AM, Aaron Colwell wrote:

Hi,

While doing some testing with demultiplexed content that uses separate SourceBuffers for the audio & video streams, we ran into some issues around establishing the presentation start timestamp that I don't think are covered well in the existing spec text.

Section 6.1.3<http://dvcs.w3.org/hg/html-media/raw-file/tip/media-source/media-source.html#webm-start-timestamp> for WebM states :
The timestamp in the first block of the first media segment appended establishes the starting timestamp for the presentation timeline. All media segments appended after this first segment are expected to have timestamps greater than or equal to this timestamp.

Section 6.2.3<http://dvcs.w3.org/hg/html-media/raw-file/tip/media-source/media-source.html#iso-start-timestamp> has similar text for ISO.

This language is pretty straightforward if we are only dealing with a single SourceBuffer. When more than one SourceBuffer is involved things get a little more tricky when the first media segment for each SourceBuffer don't start with the same timestamp.

Say I have an audio stream that starts at timestamp 0, and the video stream starts at 30 milliseconds. If I follow the existing language very strictly, then whichever stream appends a media segment first establishes the presentation start time. This means that I can either have a start time of 0 or 30 miliseconds. This raises several questions that I think need to be discussed.

1. Should we expect the web application to be aware of this situation and always ensure that the earliest segment gets appended first?

No. We should require that all tracks share the same global timeline. In this case the above means that the audio should start and there should be 30ms of blank screen before the first video frame is displayed. Metadata available to the JS app probably says both segments start as zero (both contain all the media from time zero onwards). So the JS is unlikely to know which to append first.

2. Should we wait until the first media segments are appended to all SourceBuffers in MediaSource.activeSourceBuffers before determining the start time and then simply take the earliest timestamp?

I forget how 'activeSourceBuffers' works exactly. Is it possible the app wants to set up separate SourceBuffers for the English and French audio tracks but only the French is enabled and only media for the French is being appended.

3. If a media segment is appended that starts before the established presentation start time and continues past it, how should we handle that?
- Should this trigger an error?
- Should it be treated like an end overlap<http://dvcs.w3.org/hg/html-media/raw-file/tip/media-source/media-source.html#source-buffer-overlap-end> where the presentation start time acts like the end of a range already in the buffer? This would essentially keep everything after the first random access point that has a timestamp >= the presentation start timestamp.

It seems to me like this should be an error, because I can't think of a use-case were this wouldn't be a mistake on the part of the application.

4. How close do the starting timestamps on the first media segments from each SourceBuffer need to be?
- In this example I've shown them to be only 30 milliseconds apart, but would 0.5 seconds be acceptable? Would 2 seconds?
- How much time do we allow here before we consider there to be missing data and playback can't start?
- What happens if the gap is too large?

I think this is roughly the same question as 'what happens if I append a video segment which starts X ms after the end of the last video segment' ?

if X <= one frame interval, this is definitely not a 'gap' and playback continues smoothly. If X > 1 second this is definitely a gap and playback should stall (in the same way as it does today on a network outage).

For X values in between, I am not sure: implementations have to draw a line somewhere. A gap of multiple frame intervals could occur when switching frame rate. You might also get a couple of frame intervals gap when switching if you do wacky things with frame reordering around segment boundaries.

When looking at differences between audio and video, we need to be tolerant of differences as much as the larger of the audio frame size and the video frame interval.

if the gap is too large, this element just stays in the same state. Perhaps I append video from 0s and audio from 2s and this is because my network requests got re-ordered and any millisecond now I am going to append the 0-2s audio. Playback should start when that 0-2s is appended.

Any insights or suggestions would be greatly appreciated.

We have the same problem with push/popTimeOffset. Suppose I want your media above to appear at offset 200s in both audio and video source buffers. What I really want is for the audio to start at 200s and the video at 200.030ms.

In this case the application knows better than the media what the internal media times are. I know that the video segment has all the video from time 0s, even though the first frame is at 30ms. I really want to provide the actual offset to be applied to the internal timestamps, rather than providing the source buffer time that the next segment should start at.

Hmm - no clear answer here - I'll think about this some more.

…Mark

Aaron

Received on Thursday, 12 July 2012 21:32:44 UTC