Re: Inband styling (was Re: Evidence of 'Wide Review' needed for VTT) from Philip Jägenstedt on 2015-10-21 (public-texttracks@w3.org from October 2015)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 21 Oct 2015 13:35:39 +0200
To: David Singer <singer@apple.com>
Cc: Cyril Concolato <cyril.concolato@telecom-paristech.fr>, "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <CAMQvoCmTKON8ibumVJVMtUD8dBiPiQa+up-8C3QC63R3Krm=2Q@mail.gmail.com>

On Wed, Oct 21, 2015 at 12:50 PM, David Singer <singer@apple.com> wrote:
>
>> On Oct 20, 2015, at 13:40 , Philip Jägenstedt <philipj@opera.com> wrote:
>>
>> On Fri, Oct 9, 2015 at 3:55 PM, Cyril Concolato <cyril.concolato@telecom-paristech.fr> wrote:
>> Hi Philipp, all,
>>
>> Le 26/02/2015 03:18, Philip Jägenstedt a écrit :
>> On Thu, Feb 26, 2015 at 12:13 AM, David Singer <singer@apple.com> wrote:
>>
>> On Feb 24, 2015, at 18:57 , Philip Jägenstedt <philipj@opera.com> wrote:
>>
>> I think I agree with Silvia here, a STYLE block seems more natural
>> than putting it in the header. Note that we could still, if there are
>> strong reasons, drop any such blocks that come after any cue. It gives
>> us some flexibility with the streaming case, even if we don't use it
>> now.
>> I don’t mind if it’s a block or part of the header, as long as it has to occur before the first cue. The point is that at the moment one can random access into a VTT file (not load it all from the beginning), once one has the ‘header’.  I don’t want to lose that.  In text, one might lose cues that have an end time that overlaps where you random access to, but in MP4 packing we even deal with that.
>> What does this mean? The parser consumes all data from beginning to
>> end as a stream. Perhaps it could be proven that if you seek to a
>> random point and put the tokenizer+parser in a particular state then
>> the cues that it will output will be a subset of the cues output for a
>> sequential parse, but this isn't a property of WebVTT files I've ever
>> even considered.
>>
>> I think it would be fine to require style blocks to precede any cues,
>> but I think I'm maybe missing the actual rationale...
>> When storing a WebVTT file in an MP4 track, the WebVTT file is parsed, the header is stored in a place that is not timed and the cues are stored in timed places. This storage simplifies file editing (as timed cues may be removed, including the first one, or added even before the first one, and without caring about the header). This helps also playback from non-0 time because the MP4 demux will conceptually create a WebVTT file by concatenating the header followed by the cues starting from the requested time. When doing DASH streaming, the header is provided upfront also, in the initialization segment, which means that all WebVTT-in-4 media segments are random accessible, which is simple and easy to handle. Inserting non-timed styles between cues (ie. even valid for cues located before in the file) would require changes in this storage and modification to associated implementations.
>>
>> Isn't this just a matter of parsing the whole WebVTT file into memory before trying to mux it into MP4? If you just collect all the style blocks and put them in the header, is there still a problem?
>
> People dynamically generate the files (both VTT and MP4) on the fly, so the ‘just’ in this sentence then becomes hard.

If one is generating both a standalone WebVTT file and an MP4 file at
the same time, then the input could presumably be any format at all.
If it it is another standalone WebVTT file, is it actually hard to
collect the style blocks and put them together in the MP4 header? It
just seems to be a matter of parsing input up front, which is
generally speaking easier than creating a streaming parser and
handling the output as it comes.

Philip

Received on Wednesday, 21 October 2015 11:36:07 UTC