Re: Inband styling (was Re: Evidence of 'Wide Review' needed for VTT) from Cyril Concolato on 2015-10-09 (public-texttracks@w3.org from October 2015)

From: Cyril Concolato <cyril.concolato@telecom-paristech.fr>
Date: Fri, 9 Oct 2015 18:21:57 +0200
To: public-texttracks@w3.org
Message-ID: <5617E9A5.60001@telecom-paristech.fr>
Le 09/10/2015 16:09, Nigel Megitt a écrit :
> Hi Cyril,
>
> On 09/10/2015 14:55, "Cyril Concolato"
> <cyril.concolato@telecom-paristech.fr> wrote:
>
>> Hi Philipp, all,
>>
>> Le 26/02/2015 03:18, Philip Jägenstedt a écrit :
>>> On Thu, Feb 26, 2015 at 12:13 AM, David Singer <singer@apple.com> wrote:
>>>>> On Feb 24, 2015, at 18:57 , Philip Jägenstedt <philipj@opera.com>
>>>>> wrote:
>>>>>
>>>>> I think I agree with Silvia here, a STYLE block seems more natural
>>>>> than putting it in the header. Note that we could still, if there are
>>>>> strong reasons, drop any such blocks that come after any cue. It gives
>>>>> us some flexibility with the streaming case, even if we don't use it
>>>>> now.
>>>> I don’t mind if it’s a block or part of the header, as long as it has
>>>> to occur before the first cue. The point is that at the moment one can
>>>> random access into a VTT file (not load it all from the beginning),
>>>> once one has the ‘header’.  I don’t want to lose that.  In text, one
>>>> might lose cues that have an end time that overlaps where you random
>>>> access to, but in MP4 packing we even deal with that.
>>> What does this mean? The parser consumes all data from beginning to
>>> end as a stream. Perhaps it could be proven that if you seek to a
>>> random point and put the tokenizer+parser in a particular state then
>>> the cues that it will output will be a subset of the cues output for a
>>> sequential parse, but this isn't a property of WebVTT files I've ever
>>> even considered.
>>>
>>> I think it would be fine to require style blocks to precede any cues,
>>> but I think I'm maybe missing the actual rationale...
>> When storing a WebVTT file in an MP4 track, the WebVTT file is parsed,
>> the header is stored in a place that is not timed and the cues are
>> stored in timed places. This storage simplifies file editing (as timed
>> cues may be removed, including the first one, or added even before the
>> first one, and without caring about the header). This helps also
>> playback from non-0 time because the MP4 demux will conceptually create
>> a WebVTT file by concatenating the header followed by the cues starting
> >from the requested time. When doing DASH streaming, the header is
>> provided upfront also, in the initialization segment, which means that
>> all WebVTT-in-4 media segments are random accessible, which is simple
>> and easy to handle. Inserting non-timed styles between cues (ie. even
>> valid for cues located before in the file) would require changes in this
>> storage and modification to associated implementations.
> I may be repeating your point here - I'm not sure
Actually, you're not repeating, but making an interesting point. Let me 
clarify (see below).
>   - but if you have a
> scheme that requires styles to be in a header and doesn't facilitate those
> styles being augmented on the fly e.g. by adding new styles, then that
> scheme doesn't work for live subtitles in the general case. It would work
> in the specific case that the style set can somehow be constrained so it
> is predefined and never changes during a presentation. From a broadcaster
> perspective I wouldn't accept that as a constraint.
You're right. Having a scheme that allows updating styles is interesting 
in live streaming/broadcasting, especially when you don't know in 
advance the styles you will use later on.

It can be done with what I would call timed styles, ie. style that have 
a time range validity, like cues today. Actually, I had proposed some 
time ago to put cue styles in the cue settings directly, but it can be 
done also by defining a new type of cue: style only, with a time range 
overlapping the cues it applies to.

It can also be done with untimed styles valid for the whole file when 
using segment files, like the current HLS approach. In that approach, I 
imagine that each style carried in a separate WebVTT file would replace 
existing style. Each WebVTT file would be considered as a random access 
segment. One issue though with that approach is that the concatenation 
of segment files may not produce the desired result if you don't care 
for selector clashes.

Regarding the MP4 storage, the current spec does not need any 
modification to store untimed styles, but indeed is not ideal for live 
streaming of styles. For the storage of timed styles, an update to the 
MP4 spec would probably be needed.

Hope I'm clearer ...

Cyril

-- 
Cyril Concolato
Multimedia Group / Telecom ParisTech
http://concolato.wp.mines-telecom.fr/
@cconcolato
Received on Friday, 9 October 2015 16:22:27 UTC