Re: Inband styling (was Re: Evidence of 'Wide Review' needed for VTT) from Philip Jägenstedt on 2015-10-21 (public-texttracks@w3.org from October 2015)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 21 Oct 2015 14:36:24 +0200
To: David Singer <singer@apple.com>
Cc: Cyril Concolato <cyril.concolato@telecom-paristech.fr>, "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <CAMQvoCkqdpXVgf2TpHOg7HSVXq7P_xnwLyAS9MuUBstxiQXC4A@mail.gmail.com>

On Wed, Oct 21, 2015 at 2:17 PM, David Singer <singer@apple.com> wrote:
> Hi
>
>
>> On Oct 21, 2015, at 13:35 , Philip Jägenstedt <philipj@opera.com> wrote:
>>
>> On Wed, Oct 21, 2015 at 12:50 PM, David Singer <singer@apple.com> wrote:
>>>
>>> People dynamically generate the files (both VTT and MP4) on the fly, so the ‘just’ in this sentence then becomes hard.
>>
>> If one is generating both a standalone WebVTT file and an MP4 file at
>> the same time, then the input could presumably be any format at all.
>> If it it is another standalone WebVTT file, is it actually hard to
>> collect the style blocks and put them together in the MP4 header? It
>> just seems to be a matter of parsing input up front, which is
>> generally speaking easier than creating a streaming parser and
>> handling the output as it comes.
>
> I wasn’t very clear.
>
> If one is, for example, live captioning, and then making chunks of a VTT file, or of an MP4 file encapsulating that VTT data, available as they are ready, then it’s obviously not possible to ‘go back’ and adjust the file header if new styles come along.
>
> People tune-in to such live streams, so we try very hard to make it true that one can get the needed information from (a) the stream setup information and (b) the stream itself, possibly starting at a random access point (e.g. a video I-frame).
>
> Style blocks interleaved in the stream forces one to roll-through a possibly long presentation, including the requirement to load it all, just to get the styling right. Obviously one could mark places that (re-)establish all the styles as sync points, but even that is hard: do I keep re-asserting all the styles I have seen, in case they are used again?
>
> Yes, the static transcoding case is easier.  It is, alas, not the only one.

What we are talking about is the conformance requirements of
standalone WebVTT files and what the WebVTT parser will do if
encountering style blocks after a cue. In this context, static
resources really is all that exists, as live captioning with
<track>+WebVTT [1] hasn't been spec'd. If there are other contexts
that use the WebVTT syntax and parser in a streaming mode, then that
would be interesting to know. AFAICT, it would only be a situation
like that where there could be a problem, and if it's only a
hypothetical at this point I don't think that should affect how WebVTT
works in the context of <track>.

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=18029

Philip

Received on Wednesday, 21 October 2015 12:36:53 UTC