Re: metadata in the VTT file header, re-starting the conversation from Silvia Pfeiffer on 2012-05-13 (public-texttracks@w3.org from May 2012)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Mon, 14 May 2012 08:36:27 +1000
To: Glenn Maynard <glenn@zewt.org>
Cc: Philip Jägenstedt <philipj@opera.com>, public-texttracks@w3.org
Message-ID: <CAHp8n2khpKm7GBF7HJqf4QTH7i_e6kv=yt_a3+3456hkby5jRw@mail.gmail.com>
On Mon, May 14, 2012 at 2:44 AM, Glenn Maynard <glenn@zewt.org> wrote:
> On Wed, May 9, 2012 at 7:27 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com>
> wrote:
>>
>> >  If the
>> > information has to be loaded out of each .VTT file, it could require a
>> > lot
>> > of seeking around the file to load it; slow on optical media, even if it
>> > happens to be stored in the same file.
>>
>> We're only talking about header-style metadata. There is no seeking
>> around required: it comes straight after the WEBVTT magic string.
>
>
> Reading data at the beginning of the file is exactly what requires seeking
> around.  If you're loading metadata for ten VTT files embedded in a WebM
> file, you have to seek to the location of each embedded file to read it.
> That's why formats like WebM store metadata like that in a single index that
> can be loaded all at once.

There is a fixed location in a WebVTT file where the data is and there
is a fixed location in WebM files where the data is. I don't follow
your argument. For 10 tracks, of course you have to seek to 10
different locations - each one has different metadata. They can't be
mingled together.


>> > Mirroring that information only seems important for #3.  That case is
>> > uncommon, but it does happen.  I can't decide if the problem I mention
>> > below
>> > is worth the relative infrequency of this use case...
>>
>> I don't know where you get your statistics, but almost all usage of
>> SRT files on a desktop work the #3 way and they all fall short of the
>> metadata problem, which is something we don't want to repeat with
>> WebVTT. The rest of the desktop use cases (in particular MPEG-4 and
>> QuickTime files) have it muxed in-band, i.e. the #2 case. We're
>> introducing #1 because it's the Web way, but it's a new way and by far
>> not the most common way yet.
>
>
> Almost all uses of SRT and SSA I've seen in many years are embedded in MKV
> files (#2); media players get subtitle metadata from the MKV structure.
> Loose SRT files are rare these days.

In the world of MKV users you might be right. I doubt that's the
majority use case for captions though.

In either case, we are preparing for a online world where captions
come from WebVTT files and are synchronized with MP4, WebM and Ogg
files by the browser. So, we are expecting people to author WebVTT
files - and most won't encapsulate them into their media files to
avoid duplication of content. Also, text files are easier to edit than
binary files, so the typical Web Dev will likely want to keep it as
text. (Large content providers may not since it's one more way to
obfuscate and 'protect' your content.)

Creation of caption files will most certainly be done by companies
that create text files (be that WebVTT or TTML). This is where the
metadata is created and should be delivered as part of the file. Why
remove it later for Web publishing? Instead it can be used as a author
hint, but overridden by the Web publisher if they think it's more
adequate.


>> I don't regard that as a problem, but as an opportunity. The file
>> itself has one set of metadata. That's data that the Web Dev can
>> decide to use. Or instead they can decide to overrule it with specific
>> directions in the <track> element.
>
>
> That'll break if anyone tries to use the data in anything that doesn't parse
> HTML and gets metadata from VTT files.

No it won't. The VTT file is being used in the way it is supposed to
be used as prepared by the VTT author. The Web publisher has overruled
those hints for *their* Website. That doesn't mean that now the hints
in the VTT file are incorrect.


> My prediction is that putting metadata like "kind" and "language" in the VTT
> file will never be done consistently anyway; most people will put it in the
> HTML, see that it works in browsers, and not bother to put the data in the
> VTT file too.  That's fine for the Web, but it means standalone players
> won't be able to rely on it.  Handing the information to muxers seems like
> the main use case this would actually work reliably for.

I really don't see why a VTT author would get the metadata value of
@kind wrong - they are the ones who create the files and know exactly
what they create them for: captions, subtitles, descriptions, chapters
or even metadata.

I see the process as follows (feel free to replace "Company" with "Person"):

* Company A authors VTT files
  They write the metadata into the VTT header as a production hint.

* Company B creates media files
  They get their text tracks made by Company A in VTT.
  They encapsulate (encode) the VTT files into WebM, MP4, etc making
use of the metadata hints contained in the VTT file.

* Company C publishes video on their Website
  They have media files without text tracks and use raw VTT files
produced by Company A to publish text tracks.
  Their Web Server reads the metadata out of the VTT files, including @kind.
  They have their own way of rendering text tracks and thus use
<track> elements with @kind=metadata to avoid any native rendering.

* Company D publishes video on their Website
  They get their media files with text tracks from Company B.
  They don't need to use <track> to get the tracks rendered - the
@kind is automatically interpreted since it comes from the media file.

* Company E publishes video on their Website
  They have media files without text tracks and use raw VTT files
produced by Company A to publish text tracks, just like Company C.
  In contrast to company C, they take the hints from the VTT file
directly to fill also @kind and make use of native rendering.
  This includes caption files, subtitle files, description files, and
chapter files.


>> I would most likely create the attributes of a <track> element by
>> analysing the content of the WebVTT files that I am serving and just
>> hand that data through. In this way the browser gets all the
>> information that it needs out of the WebVTT file without actually
>> having to download and parse anything from the WebVTT file. It's
>> proxied information, not redundant information.
>
>
> It's easiest to think of it as a caching mechanism.  The only reason to put
> the data in both the HTML file *and* the VTT file is because it's faster to
> read it all at once out of the HTML file; the HTML data effectively becomes
> a cache of the metadata stored in the VTT files.
>
> So long as the cache is consistent, everything's fine.  It's just
> unfortunate when the cache gets out of sync (eg. people update one and not
> the other).

No, not a cache: a hint provided by the author of the VTT file.


>> That's like saying you can't trust any information given to you in files.
>>
>> In the end, you have to be able to rely on some data: either you rely
>> on the Web dev doing the correct thing or you rely on the WebVTT
>> author doing the right thing. Who can you rely on more? If done
>> properly, the Web dev will just use what's in the file, and the WebVTT
>> author will be the one making sure the file is correct.
>
>
> When the duplication doesn't exist to begin with, you don't have to worry
> about either author doing the right thing; there's only one thing they *can*
> do, since the data is in only one place.

That duplication will always exist, because VTT files are not
typically created by Web developers who author the Web page at the
same time as the VTT file. Thus, there will be a note somewhere about
what the VTT file is - most often I have seen it done in the name of
the VTT file both for @kind and @lang. When a Web dev has to rely on
external information to identify the type of file he/she is dealing
with, the risk of getting out of sync are indeed high. Explicit
metadata in the VTT file that has been defined consistently and that
will be used and parsed consistently is a much less error prone way to
go down on.


> I'm not calling this a fatal problem, but if we're consciously introducing a
> new category of problem to solve other problems, we should be aware of the
> tradeoff.

Agreed.


>> We haven't figured out how to deal with external CSS and WebVTT for
>> non-browser apps either. The WebM mechanism will simply rely on
>> whatever we come up with. If it's independent files that have to be
>> delivered with the media and the WebVTT file (maybe in a zip file),
>> then that works for WebM. I'm wary of putting a file name into WebVTT
>> - I'd much rather leave it informally to be delievered in zip files
>> with same names. In-line css in WebVTT headers would also work for
>> WebM.
>
>
> WebM already lets you store files with associated filenames (used for
> fonts), so it seems natural to just say eg:
>
> Stylesheet: file.css
>
> and depending on how the VTT file was loaded, it'll either treat it as a
> relative URL and fetch the file (if it was loaded via HTTP), look in the
> same directory (if it was loaded as a loose file), or look for a WebM
> attachment with that name (the common case for standalone files).

Ah nice. I wasn't aware of that. Is that a WebM feature or a MKV feature?


>> > It also says: "This is how roll-up captions work: multiple cues are
>> > rendered
>> > simultaneously, and when the top cue expires, the other cues move up and
>> > a
>> > new cue appears at the bottom."  I don't know why it says that, since
>> > WebVTT
>> > doesn't do roll-up captions.
>>
>> Such a shame, isn't it! Just look at:
>> http://www.youtube.com/watch?v=oxkZTF-7Lgw - how will we do that with
>> WebVTT?
>
>
> I hope we won't.  Word-at-a-time is the worst possible presentation mode for
> captions.  Watching that, I never get to look at the video; I have to stare
> at the captions the whole time--I may as well be reading a transcript.
> After the video was no longer realtime, the incremental captions should have
> been flattened into individual captions that can be shown all at once.  Ian
> suggested the same thing:
> http://lists.w3.org/Archives/Public/public-texttracks/2011Dec/0033.html
> ("for rebroadcast...").

They should have been, but they hardly ever will. Also, how are we
going to do the live broadcast in the first instance without rollup?

Anyway - that's for a different thread.


>> >> metadata is stored in CodecPrivate etc.
>> >
>> > (It doesn't look like that's what it's currently suggesting, FYI: "no
>> > WebVTT
>> > data is stored in the CodecPrivate element of the WebM Track header".
>> >  It's
>> > a wiki, so maybe it changed since you read it last.)
>>
>> You're mis-reading. This refers to storing no payload data (i.e. no
>> CUES) into the CodecPrivate header.
>
>
> WebVTT headers will also be WebVTT data.  The above text should say "WebVTT
> cues".

Possibly. I'm just saying that the author of that wiki page didn't use
the term in that manner - they regard "WebVTT data" as "WebVTT payload
data", which refers to cues - just their own terminology. I'm not sure
we have clearly defined what "WebVTT data" is. We have a "WebVTT data
state" in the parser, but that's somewhat unrelated.

Cheers,
Silvia.
Received on Sunday, 13 May 2012 22:37:17 UTC