Re: metadata in the VTT file header, re-starting the conversation

On Tue, May 29, 2012 at 3:32 AM, Glenn Maynard <glenn@zewt.org> wrote:
> On Sun, May 13, 2012 at 5:36 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com>
> wrote:
>>
>> There is a fixed location in a WebVTT file where the data is and there
>> is a fixed location in WebM files where the data is. I don't follow
>> your argument. For 10 tracks, of course you have to seek to 10
>> different locations - each one has different metadata. They can't be
>> mingled together.
>
>
> Sure they can.  Extract the headers and group them in the same region at the
> beginning of the WebM file, so you can read all of them in a single burst.

Physically they would indeed all be in the same region. But logically
they still have to be separate. That's all I was referring to.


> This is standard practice for any file meant to be read from media with
> costly seeks (eg. optical media and networks).  I'm pretty sure WebM already
> does this, at least for certain bits of WebVTT metadata.
>
> This is exactly the same as putting VTT metadata in <video>.  They're all
> "mingled together", grouped together into a single DOM element, so you don't
> have to load every VTT file to find out what each one's for.

Right. I still don't understand what we are disputing.

I think we agree that when VTT files are embedded in a video file,
there are efficiencies. I think we can also agree that there is a use
case where video files and VTT files are handles as separate resources
- in particular by a video player that is not a Web browser and does
not combine the presentation through a HTML page with its track
elements. My argument is simply that the most appropriate place to
keep metadata about VTT files is at the beginning of a VTT file in
this latter use case.


>> >> I don't know where you get your statistics, but almost all usage of
>> >> SRT files on a desktop work the #3 way and they all fall short of the
>> >> metadata problem, which is something we don't want to repeat with
>> >> WebVTT. The rest of the desktop use cases (in particular MPEG-4 and
>> >> QuickTime files) have it muxed in-band, i.e. the #2 case. We're
>> >> introducing #1 because it's the Web way, but it's a new way and by far
>> >> not the most common way yet.
>> >
>> > Almost all uses of SRT and SSA I've seen in many years are embedded in
>> > MKV
>> > files (#2); media players get subtitle metadata from the MKV structure.
>> > Loose SRT files are rare these days.
>>
>> In the world of MKV users you might be right. I doubt that's the
>> majority use case for captions though.
>
>
> I'm not talking about *all* uses of captions, just delivering savable static
> video files with subtitles (that's what "usage of SRT files on a desktop"
> sounded like--not sure what you meant if not that).

I'm talking in particular about two use cases for Desktop players: the
ones where they get video files with embedded VTT files, but also the
ones where they just get a directly with video files and a bunch of
VTT files. That latter one is the one that has a particular lack of
metadata (as described above).


>> No it won't. The VTT file is being used in the way it is supposed to
>> be used as prepared by the VTT author. The Web publisher has overruled
>> those hints for *their* Website. That doesn't mean that now the hints
>> in the VTT file are incorrect.
>
>
> If the VTT says it's in French but the captions are actually in German, and
> it's been overridden by HTML, then of course the VTT file is incorrect.

If the VTT files says it's in French and the captions are in French,
but the HTML author has overriden them to be German, then the Web page
is incorrect.

I don't think you can say that where they disagree that it's always
the VTT author's mistake.


>> I really don't see why a VTT author would get the metadata value of
>> @kind wrong - they are the ones who create the files and know exactly
>> what they create them for: captions, subtitles, descriptions, chapters
>> or even metadata.
>
>
> All most people care about is if the file does what they expect it to do.

Indeed. VTT authors will want to check their VTT files offline, too,
not just online. So they will use a player such as VLC. If they get
the language and/or the kind wrong, VLC would show just as well as a
Web page shows a Web author what they got wrong in their HTML markup.
I don't really see the difference.


> If a captions VTT file is being treated as captions, they won't notice (or
> care) if there's a "Kind: subtitles" line in the VTT file (being overridden
> by the HTML).
>
> Anyway, we're getting a bit afield.  Let's back up a bit and reexamine the
> use cases.
>
> 1: Allowing WebM muxing software to automatically detect the metadata, so
> users don't have to do it manually (like they have to today for MKV/SRT
> muxing).
> 2: Allowing software to automatically generate complete HTML <video>
> snippets.  (This is basically the same as #1.)
> 3: Allowing video players to display captions without needing to know how to
> parse HTML.
>
> This works fine for #1 and #2.  It really doesn't help #3 at all--even if
> you assume VTT metadata will never be out of date (a tough assumption), a
> huge number of VTT files will simply not have it.  That means no player can
> ever really depend on the metadata being in each VTT file.

So what? A Web page cannot rely on @kind and @srclang being available
for a text track either. There are default settings that players &
browsers will use. I don't see how this is different.


> But I don't think #3 is a real use case anyway.  If a site like YouTube (a
> very hypothetical example :) wants to allow saving videos to disk with
> captions, it should mux them into a WebM or MKV file and present *that* for
> download, not dump a dozen separate VTT files on the user and expect him to
> keep them together.  That is, standalone video players probably don't need
> to support reading loose VTT files *or* HTML parsing--their MKV/WebM support
> is enough.

This is wishful thinking. I've tried for over 10 years to get
stand-along caption files included into binary media resources. While
it sometimes happens, it doesn't in the majority of cases. The reasons
are simple: text files are editable - video files not so much. Keeping
them separate gives you more control. The problem of shipping around
multiple related files has been solved by zip - it is a smaller
problem than encapsulating/extracting text files into/from binary
files. The only place where it will happen is where publishers want to
take away the control over the caption files from their users and thus
enforce the distribution of the video files with encapsulated text
tracks (mainly as an obfuscation mechanism to make it harder for the
ordinary user to change them).

As for YouTube: right now you can download the text track files and
the video files separately (some hacking involved), but there is no
muxed download. That seems to contradict your example.


>> > WebM already lets you store files with associated filenames (used for
>> > fonts), so it seems natural to just say eg:
>> >
>> > Stylesheet: file.css
>> >
>> > and depending on how the VTT file was loaded, it'll either treat it as a
>> > relative URL and fetch the file (if it was loaded via HTTP), look in the
>> > same directory (if it was loaded as a loose file), or look for a WebM
>> > attachment with that name (the common case for standalone files).
>>
>> Ah nice. I wasn't aware of that. Is that a WebM feature or a MKV feature?
>
>
> I don't know--if it's in MKV but not WebM, I'd argue that it should be added
> to WebM.  Putting aside CSS, this is how embedded fonts are handled in MKV,
> and WebM should be consistent with that.

Agreed.

Cheers,
Silvia.

Received on Friday, 8 June 2012 00:55:33 UTC