Re: Streaming of WebVTT from Cyril Concolato on 2012-07-27 (public-texttracks@w3.org from July 2012)

From: Cyril Concolato <cyril.concolato@telecom-paristech.fr>
Date: Fri, 27 Jul 2012 11:30:59 +0200
To: Glenn Maynard <glenn@zewt.org>
CC: Silvia Pfeiffer <silviapfeiffer1@gmail.com>, public-texttracks@w3.org
Message-ID: <50125FD3.7030004@telecom-paristech.fr>
Hi Glenn,

Le 7/27/2012 1:19 AM, Glenn Maynard a écrit :
> On Wed, Jul 25, 2012 at 9:18 AM, Cyril Concolato 
> <cyril.concolato@telecom-paristech.fr 
> <mailto:cyril.concolato@telecom-paristech.fr>> wrote:
>
>     Aside from the problem of random access points that I mentionned
>     in my previous email [1]
>     [1]
>     http://lists.w3.org/Archives/Public/public-texttracks/2012Jul/0000.html
>
>
> By the way, this mail doesn't seem to talk about random access.
Sorry, it was my answer to Silvia's answer to this mail. This mail was 
jumping to a possible solution. The correct link is: 
http://lists.w3.org/Archives/Public/public-texttracks/2012Jul/0002.html.

>
>
> On Thu, Jul 26, 2012 at 5:59 PM, Silvia Pfeiffer 
> <silviapfeiffer1@gmail.com <mailto:silviapfeiffer1@gmail.com>> wrote:
>
>     That's the Syntax definition. In step 11 in the parser, thought it
>     says:
>     http://dev.w3.org/html5/webvtt/#parsing
>
>     "Header: Collect a sequence of characters that are not U+000A LINE
>     FEED (LF) characters. Let line be those characters, if any."
>
>     The way I read this is that while we haven't defined in the Syntax
>     that there should be a header, the parser allows putting extra
>     characters beyond the identifier line and skips everything until it
>     finds the two line terminators it requires (i.e. the empty line).
>
>
> Right.  The "WEBVTT" line is the signature, not the header. The file 
> format doesn't yet define a header format, but the parser is defined 
> in a way that allows it to be added later without breaking 
> compatibility (as long as people follow the spec, of course).
>
> When writing something that reads WebVTT files, be very sure to parse 
> it as specified by the parser--*not* by reading the syntax and coming 
> up with your own parsing algorithm.
It's a bit unusual for a standard to specify the parsing algorithm, but 
I can understand why. However, I'm not sure it's a good practice to make 
future version of the files be syntactically invalid with respect to a 
previous version of the standard. The syntax should be aligned with the 
parsing algorithm. If you plan extension points, fine, but add them to 
the syntax. In this particular case, what does this cost you to add it? 
The risk is only to make the spec clearer.

>
>     > Right. So what about the Random Access Point problem that I
>     mentionned in a
>     > my previous email? Don't you think it should possible to rewrite
>     any WebVTT
>     > file such that any (or some) cue doesn't need information from
>     previous cues
>     > to be processed? Just like you can re-encode a video to have all
>     (or some)
>     > frames to be an I frame?
>
>
> If you want to send a stream of WebVTT data, but discard captiosn 
> which were shown before the user joined (eg. the user joins an hour 
> into a stream), simply discard all cues with an end time less than the 
> current time.  Retain all other data, including headers (pass these 
> through verbatim) and the signature, and don't modify the cues or cue 
> timings (eg. do not adjust them to a new zero point--that's the 
> timeline offset's job).
I don't *want* to discard the captions. The client has just not received 
them. I agree to pass the signature, header, unmodified cues, but that 
is not sufficient to produce the same result as if the client had joined 
from the beginning. The client should know how long it has to wait until 
the result becomes as if it had joined from the beginning. That's called 
the roll distance in MP4 but that does not exist in other transport 
formats. That's why you should be able to create specific RAP, if needed.

Take the following example:
cue 1 with a start=10, dur=10, text on line 1: |---------------------------|
cue 2 with a start=15, dur=10, text on line 2:           
|---------------------------|
cue 3 with a start=21, dur=10, text on line 1:                      
|---------------------------|

If the client connects at T=12, it will receive the content of cue 2, 
and not having received the content of cue 1, some clients could display 
something between T=15 and T=20. This will be partial text and maybe 
incorrect. Some other clients could decide to wait until cue 3 is 
received. But these clients have no guarantee that another line of text 
has not been set by a previous cue. That's what the signaling of RAP or 
roll distance is for.

It would be good, as it is the case in most codecs, to be able to 
prepare the content, such that from time to time there is a RAP. One 
could prepare the content in the following way (that might not be the 
only option):
cue 1 with a start=10, dur= 5, text on line 1: 
|---------------------------| /* duration could be reduced in time*/
cue 2 with a start=15, dur=10, text on line 1 & 2, RAP:               
|------------|
cue 2 with a start=20, dur= 5, text on line 2: 
                                     |------------|
cue 3 with a start=21, dur=10, text on line 1: 
|---------------------------|

In my first email, I was suggesting a possible way (allowing cue 
settings as CSS properties and using top-level spans) but this might be 
problematic, I don't know. I don't care about the solution, but I think 
the requirement is valid.

Regards,
Cyril

-- 
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Multimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.wp.mines-telecom.fr/
Received on Friday, 27 July 2012 09:31:42 UTC