Re: metadata in the VTT file header, re-starting the conversation from Glenn Maynard on 2012-06-08 (public-texttracks@w3.org from June 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Thu, 7 Jun 2012 21:10:22 -0500
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: Philip Jägenstedt <philipj@opera.com>, public-texttracks@w3.org
Message-ID: <CABirCh-K-+YaJimDdkNViv0ggBbfDjjht+vcjDLXCa3WkL1Ypw@mail.gmail.com>
On Thu, Jun 7, 2012 at 7:54 PM, Silvia Pfeiffer
<silviapfeiffer1@gmail.com>wrote:

> >> There is a fixed location in a WebVTT file where the data is and there
> >> is a fixed location in WebM files where the data is. I don't follow
> >> your argument. For 10 tracks, of course you have to seek to 10
> >> different locations - each one has different metadata. They can't be
> >> mingled together.
> >
> >
> > Sure they can.  Extract the headers and group them in the same region at
> the
> > beginning of the WebM file, so you can read all of them in a single
> burst.
>
> Physically they would indeed all be in the same region. But logically
> they still have to be separate. That's all I was referring to.
>

(You said that 10 tracks would require 10 seeks.  I'm just saying that it
doesn't: group the metadata together on the media, and you can read all 10
tracks' metadata with 1 seek.)

I think we can also agree that there is a use
> case where video files and VTT files are handles as separate resources
> - in particular by a video player that is not a Web browser and does
> not combine the presentation through a HTML page with its track
> elements. My argument is simply that the most appropriate place to
> keep metadata about VTT files is at the beginning of a VTT file in
> this latter use case.
>

Okay, what I disagree with is with the idea of distributing files to
end-users as loose files.  While it's probably harmless to allow players to
use the VTT headers for that, they couldn't depend on it, since lots of
files won't have them.

I agree that putting <track> metadata inside the VTT is useful for the
editing/authoring phase.  That is, to allow implementing WebM/MKV muxers
and autogenerating <track> text, without the user having to supply this
metadata (as you have to today, for example, when muxing an MKV file from a
video file and an .SRT, or hand-typing <video>).  I do think that these use
cases are enough to justify the feature.  (That means the rest of this is
mostly tangental.)

 If the VTT files says it's in French and the captions are in French,
> but the HTML author has overriden them to be German, then the Web page
> is incorrect.
>
> I don't think you can say that where they disagree that it's always
> the VTT author's mistake.
>

I didn't say that.  I said that if an author fixes a wrong VTT header by
overriding it in <track>, then if you load the VTT file in something that
uses only the VTT headers, it'll break and show the old, wrong data.

 Indeed. VTT authors will want to check their VTT files offline, too,
> not just online. So they will use a player such as VLC. If they get
> the language and/or the kind wrong, VLC would show just as well as a
> Web page shows a Web author what they got wrong in their HTML markup.
> I don't really see the difference.
>

Some will, of course, but you have a much higher opinion of how much people
test content than I do if you think this will be common.  Even if they
notice it, many will probably ignore it, knowing that it'll be fixed later
when they set up the <track> (which is what happens with the SRT/MKV
process).

> If a captions VTT file is being treated as captions, they won't notice (or
> > care) if there's a "Kind: subtitles" line in the VTT file (being
> overridden
> > by the HTML).
> >
> > Anyway, we're getting a bit afield.  Let's back up a bit and reexamine
> the
> > use cases.
> >
> > 1: Allowing WebM muxing software to automatically detect the metadata, so
> > users don't have to do it manually (like they have to today for MKV/SRT
> > muxing).
> > 2: Allowing software to automatically generate complete HTML <video>
> > snippets.  (This is basically the same as #1.)
> > 3: Allowing video players to display captions without needing to know
> how to
> > parse HTML.
> >
> > This works fine for #1 and #2.  It really doesn't help #3 at all--even if
> > you assume VTT metadata will never be out of date (a tough assumption), a
> > huge number of VTT files will simply not have it.  That means no player
> can
> > ever really depend on the metadata being in each VTT file.
>
> So what? A Web page cannot rely on @kind and @srclang being available
> for a text track either. There are default settings that players &
> browsers will use. I don't see how this is different.
>

It's completely different.  People will be naturally encouraged to supply
@srclang, because the problems that happen if they don't ("Language:
unknown") are obvious and will show up immediately.  People won't supply
VTT headers most of the time, because it won't cause any evident problems.
They'll have no idea that they've omitted anything.

 > But I don't think #3 is a real use case anyway.  If a site like YouTube
> (a
> > very hypothetical example :) wants to allow saving videos to disk with
> > captions, it should mux them into a WebM or MKV file and present *that*
> for
> > download, not dump a dozen separate VTT files on the user and expect him
> to
> > keep them together.  That is, standalone video players probably don't
> need
> > to support reading loose VTT files *or* HTML parsing--their MKV/WebM
> support
> > is enough.
>
> This is wishful thinking. I've tried for over 10 years to get
> stand-along caption files included into binary media resources. While
> it sometimes happens, it doesn't in the majority of cases.


It's not wishful thinking.  It's experience from watching subtitled video
online for as long as video has been online.  Standalone caption files are
rare, while I've seen thousands of video files with embedded captions.
(Before MKV became popular, standalone SRT files were a little more common,
but baking subtitles into the video was the common practice.)

 The reasons are simple: text files are editable - video files not so much.
> Keeping
> them separate gives you more control. The problem of shipping around
> multiple related files has been solved by zip - it is a smaller
> problem than encapsulating/extracting text files into/from binary
> files.


I've never once seen anyone distributing video for end-user consumption in
a ZIP.  (I'd fight tooth and nail against anybody encouraging that, but
I've elided it for the sake of not starting another tangent...)

The only place where it will happen is where publishers want to
> take away the control over the caption files from their users and thus
> enforce the distribution of the video files with encapsulated text
> tracks (mainly as an obfuscation mechanism to make it harder for the
> ordinary user to change them).
>

It happens all the time, and not to try to take away control.  Muxing isn't
DRM.

As for YouTube: right now you can download the text track files and
> the video files separately (some hacking involved), but there is no
> muxed download. That seems to contradict your example.
>

I was talking about presenting videos to the user for download, not
manually downloading the underlying resources that the Flash player
accesses.  That's no different than downloading the individual resources
pointed to by <track> elements..  YouTube doesn't present downloads at all,
unless this has changed recently or is very well hidden.

-- 
Glenn Maynard
Received on Friday, 8 June 2012 02:10:52 UTC