Re: Streaming of WebVTT from Silvia Pfeiffer on 2012-07-28 (public-texttracks@w3.org from July 2012)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Sat, 28 Jul 2012 17:49:04 +1000
To: Cyril Concolato <cyril.concolato@telecom-paristech.fr>
Cc: Glenn Maynard <glenn@zewt.org>, public-texttracks@w3.org
Message-ID: <CAHp8n2nnLE59Sx=7GhqPvKf377ZnSvEt-dqidWDG0z+Fo0Upbg@mail.gmail.com>
Hi Cyril,

On Fri, Jul 27, 2012 at 7:30 PM, Cyril Concolato
<cyril.concolato@telecom-paristech.fr> wrote:
>
> I don't *want* to discard the captions. The client has just not received
> them.

For the client not to receive them, the server has to not send them.
This means, the server - which has all the cues in the WebVTT  file -
has to create (i.e. encode) a new WebVTT file from the current one,
which starts only with the cues from the random access point onwards.
(i.e. it discards the previous cues)

> I agree to pass the signature, header, unmodified cues, but that is
> not sufficient to produce the same result as if the client had joined from
> the beginning. The client should know how long it has to wait until the
> result becomes as if it had joined from the beginning. That's called the
> roll distance in MP4 but that does not exist in other transport formats.
> That's why you should be able to create specific RAP, if needed.
>
> Take the following example:
> cue 1 with a start=10, dur=10, text on line 1:
> |---------------------------|
> cue 2 with a start=15, dur=10, text on line 2:
> |---------------------------|
> cue 3 with a start=21, dur=10, text on line 1:
> |---------------------------|
>
> If the client connects at T=12, it will receive the content of cue 2, and
> not having received the content of cue 1, some clients could display
> something between T=15 and T=20.

That's not what the server should do. It would have a list of all the
active cues at T=12, which includes cue1, and thus it would send from
cue1 onwards.

> This will be partial text and maybe
> incorrect. Some other clients could decide to wait until cue 3 is received.
> But these clients have no guarantee that another line of text has not been
> set by a previous cue.

Indeed: the clients can't do anything about what they receive. It's
the server that has to send the correct data. Only the server will
know at what time a client connects and what video frames, audio
frames, and cues are active at that time and need to be sent out.

> That's what the signaling of RAP or roll distance is
> for.
>
> It would be good, as it is the case in most codecs, to be able to prepare
> the content, such that from time to time there is a RAP. One could prepare
> the content in the following way (that might not be the only option):
> cue 1 with a start=10, dur= 5, text on line 1:
> |---------------------------| /* duration could be reduced in time*/
> cue 2 with a start=15, dur=10, text on line 1 & 2, RAP:
> |------------|
> cue 2 with a start=20, dur= 5, text on line 2:
> |------------|
> cue 3 with a start=21, dur=10, text on line 1:
> |---------------------------|

WebVTT is a text format. RAP into a text format is trivial.

If your problem is with files that have WebVTT encapsulated together
with audio and video packets, then indeed you may need to find a way
to multiplex the file such that e.g. WebVTT cues that last for "a long
time" get repeated (maybe in sync with the video's I frames) or for
simple a way to locate the currently still active WebVTT cues from the
encoding information. That's a problem that we had to solve for text
tracks in Ogg (see granulerate in
http://svn.annodex.net/standards/draft-pfeiffer-cmml-current.txt).
It's also something that had to be considered for WebVTT in WebM
(http://wiki.webmproject.org/webm-metadata/temporal-metadata/webvtt-in-webm
- IIUC cues are placed on the same cluster as the video frames).

> In my first email, I was suggesting a possible way (allowing cue settings as
> CSS properties and using top-level spans) but this might be problematic, I
> don't know. I don't care about the solution, but I think the requirement is
> valid.

The requirement is valid. The solutions is, however, trivial for a
text file. It's not trivial when multiplexed with media data, but that
is a different problem.

HTH.

Cheers,
Silvia.
Received on Saturday, 28 July 2012 07:49:52 UTC