Re: Metadata in the VTT file header (bug 15851), use cases (and a need to close this) from Ian Hickson on 2012-08-30 (public-texttracks@w3.org from August 2012)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 30 Aug 2012 00:46:38 +0000 (UTC)
To: David Singer <singer@apple.com>
cc: public-texttracks <public-texttracks@w3.org>
Message-ID: <Pine.LNX.4.64.1208300016270.614@ps20323.dreamhostps.com>
On Wed, 29 Aug 2012, David Singer wrote:
> On Aug 29, 2012, at 16:53 , Ian Hickson <ian@hixie.ch> wrote:
> > On Wed, 29 Aug 2012, David Singer wrote:
> >> 
> >> 1) Authoring.  Quite often caption files are authored/written in a 
> >> different workflow from the media, and must be re-united later. We'd 
> >> like to keep track of attributes of the files in-band, so that they 
> >> don't get lost (e.g. the language of the captions), and indeed, of 
> >> the proposed values for the <track> element attributes when the file 
> >> is referenced from HTML. It can also be useful to include a link-back 
> >> to the content that was captioned, using an identifier (e.g. URL).
> > 
> > This would be entirely addressed by in-file comments, and doesn't need 
> > name-value pairs.
> 
> Really?  I thought comments were free-form, and HTML5 attributes had a 
> name and a value.  Perhaps you could indicate how software could parse 
> the 'comments' to form the initial/suggested attribute values?

However they want to. It's a one-vendor problem, after all.


> > In fact name-value pairs wouldn't address the problem sufficiently, 
> > since some people have data that isn't name-value pairs (e.g. an 
> > author might want to include name,language,value tuples, or binary 
> > data, or structured data). In addition, author-specific workflow data 
> > doesn't need to follow a standard, since it only needs to be 
> > interoperable within the application the user uses.
> 
> Caption houses are often separate from authoring houses, that are 
> separate from the content distribution house;  this is what 
> interoperability is about. :-)  Indeed, caption houses often have the 
> rights to database captions but not the content, and content houses the 
> opposite (I know, it seems bizarre that Disney doesn't retain the rights 
> to captions they paid to have made, but, it's often the case).

If there are other use cases then I'm certainly eager to hear about them, 
but the use case you gave in #1 above was just that within a workflow, 
data needed to be tracked. There's many ways of solving this. You can use 
WebVTT+MyData, e.g., a proprietary format you make up to include your data 
during your workflow. You can have parallel files. You can have databases 
with the data in one cell and the WebVTT in another. You can put the data 
in a custom format in a comment that you read/write during production and 
strip during publication. There's all kinds of ways to do it.

Maybe more concrete descriptions of the use cases here would help explain 
it and show it as being valid. You suggest there are multiple vendors 
involved, for example, but your original use case description didn't 
mention this. Can you give a concrete example of what you mean here?


> >> 2) Use in other embeddings.  MPEG has started work on specifying MP4 
> >> carriage of WebVTT in a track of the MP4 file. In this context, we 
> >> need some of the attributes that are carried in the HTML layer.  
> >> Some are already covered or partially covered (e.g. all tracks can 
> >> carry a language in MP4) but not all.  WebM embedding is also under 
> >> way.
> > 
> > This should be at the container level, not in VTT, IMHO. It is trivial 
> > for a container format to define how to include such information; even 
> > in the case of a format that can only embed data directly, the payload 
> > format can always be defined as being the data from the <track> 
> > element followed by the WebVTT data itself.
> > 
> > So this is not a use case for a name-value pair metadata header within 
> > WebVTT itself either.
> 
> It's true that most containers have some provision for some of this.  
> However, material that is specific to VTT is best placed within it, 
> IMHO.

It's not specific to WebVTT, though. It's specific to WebVTT-in-this- 
container.

What data are we talking about here, anyway? <track> only has four 
relevant attributes as far as I can tell; srclang="" will be dealt with 
inline via this bug:

   https://www.w3.org/Bugs/Public/show_bug.cgi?id=15922

kind="" and label="" are needed in the container format, since without 
them it's not clear how you would even know what to do with _any_ text 
track; they're not WebVTT-specific in the least. That leaves default="", 
which is not really necessary, but could trivially be supported in any 
container format if truly necessary just by prefixing the WebVTT payload 
with "DEFAULT" or some such. No need to add a big name-value pair 
infrastructure to WebVTT in order to pass 1 bit of data.


> >> 3) Side-band use in other contexts. In some delivery scenarios, it 
> >> makes sense for WebVTT caption files not be embedded but carried in a 
> >> 'side-band' (e.g. in HTTP streaming systems), that is, loaded as a 
> >> side-file. In this case, we need the ability to carry attributes that 
> >> the referencing file does not carry.
> > 
> > Can you elaborate on this use case? What attributes? Why?
> 
> The attributes that the VTT file would have had if it had been embedded 
> in HTML, for a start;  language, kind, to name two explicitly.

Language will be handled by the bug cited above. I have no idea what you 
mean in the case of kind. Can you give a concrete example of "side-band 
delivery" of WebVTT where the rest of the system does not already support 
giving the kind of text track being delivered?


> >> 4) Style-sheets.  Maybe it's satisfactory to define that WebVTT 
> >> inherits styling from its container (e.g. HTML5), but in the case 
> >> where the container doesn't carry styling (e.g. HTTP streaming, MP4), 
> >> or in the case where specific styling is needed for the WebVTT, we 
> >> need to be able to reference or include style sheets in the WebVTT 
> >> layer itself. As an example, a style-sheet giving 608/708 appearance 
> >> is being worked on as part of the 608/708 conversion.
> > 
> > This is handled by the proposal(s) in:
> >   https://www.w3.org/Bugs/Public/show_bug.cgi?id=15023
> > 
> > This is not a use case for a name-value pair metadata header.
> 
> Except that in that very bug, Glenn helpfully formats the example into 
> exactly this general syntax.

So? He's wrong to do so. :-)

Sure, you can turn everything into a name-value syntax if you push hard 
enough. It doesn't mean that that's the right solution for every problem.

WebVTT already has a way to extend it to support new data blocks like 
style, as is also discussed in that bug. I don't see why we'd want to use 
a complicated name-value pair syntax for embedding CSS.


> >> 5) Time alignment. When WebVTT is used as the caption source for a 
> >> system where timestamps are from an arbitrary origin (e.g. a 
> >> continuous MPEG-2 Transport stream) we need a way to say that 
> >> 'timestamp X in this VTT file aligns with Timestamp Y in the media 
> >> stream' so as to get synchronization.  This is naturally put into the 
> >> header.
> > 
> > If there's a WebVTT file with fixed timestamps and a media stream with 
> > arbitrary timestamps, then the only place where it makes sense to put 
> > the synchronisation information is in the media stream. Putting it in 
> > the WebVTT stream makes no sense; if you are able to adjust that 
> > stream then why not just adjust the timestamps?
> 
> Pardon?  You're suggesting completely re-writing the timestamps in the 
> mpeg-2 transport stream so as to … do exactly what?  What we need is a 
> mapping, not a need to re-write whole streams.

I meant the WebVTT stream, in my last sentence.


> > Or even better, don't have arbitrary time stamps, and have both the 
> > media stream and the captions use the same timeline.
> 
> Great, are you ready for a new timestamp syntax in VTT, and a need to 
> re-write the entire file, when all that is needed is a mapping?

?

No new syntax, just change the timestamps by the offset.


> Also, what about the case (not unusual) when a pre-authored piece of 
> content is used as part of a broadcast?  In the broadcast, the media 
> will have a continuous timestamp flow that, effectively, has an 
> arbitrary origin.  When the pre-authored piece is transmitted, we want 
> to align the pre-authored caption file with its media.  That is much 
> more natural than re-writing the entire caption file, or (dream on) 
> re-writing the entire transport stream.

Re-writing the caption file is trivial. It's really just a few lines of 
code. In a larger broadcast you'll presumably want to embed the captions 
into the captions for the rest of the broadcast, so you'd want to do that 
anyway.

We _could_, if it was for some reason deemed prohibitively expensive to 
add numbers together (which is really all you have to do to adjust the 
timestamps in the captions), introduce a syntax in WebVTT that offset all 
timestamps by a certain amount, but name-value pairs at the top of the 
file would be a terrible way to do that. A much more sensible way would be 
to have a block syntax that introduced an offset, since then you could 
concatenate multiple files's cues without rewriting them, prefixing each 
block with the new offset.

As in:

   WEBVTT

   00:11.000 --> 00:13.000
   <v Roger Bingham>We are in New York City

   OFFSET -01:00.000

   01:13.000 --> 01:16.000
   <v Roger Bingham>We're actually at the Lucern Hotel, just down the 

...or some such.

But if we did this (support WebVTT concatenation, essentially) we'd also 
presumably want block-level language selection, e.g. for when an American 
TV channel broadcasts a Spanish-language or British show, or when a UK TV 
channel switches between Welsh and English programmes.


> Can you explain why you want to resist what many of us see as a natural 
> direction to go?

Two reasons.

First, there have really not been any compelling use cases. All the use 
cases presented are either better handled in other ways in WebVTT (e.g. 
how to embed styles, offsets), or are already handled sufficiently by 
WebVTT now or WebVTT with other additions like the block comment syntax 
(e.g. anything involving proprietary workflow additions only needed during 
production). Adding a feature that doesn't have compelling use cases is a 
recipe for disaster

Second, what we have seen with HTML is that providing arbitrary name-value 
pair syntax that anyone can plug into tends to lead authors down this 
massive rabbit hole of timewasting. People see name-value pair metadata 
syntax and they go crazy adding all kinds of metadata in random syntaxes 
to it, with no common vocabulary, no common processing model, and with 
absolutely no idea what is ever going to consume it. And then: nothing 
consumes it. It is a _huge_ waste of time. Nothing _can_ consume it, 
because the data is of so poor quality (having never been tested) and is 
of so many different formats (there being either no standard or so many 
standards for how to expose it). Formats that have no general name-value 
pair syntax, e.g. CSS and JavaScript, have not suffered the _slightest_ 
for it. People still put their proprietary data in those formats (e.g. 
"javadoc"-style documentation in JavaScript), but they do so _when they 
need it_, with testing, with consumers. They include their copyrights in 
comments, and are none the worse for it. You don't get week-long threads 
on forums of people asking what syntax their copyright metadata in CSS 
should be, because the answer is trivial: put it in a comment.


> You even proposed a syntax for it, yet you seem to be reaching for 
> reasons not to do it.

Honestly I feel like it is you who is reaching for reasons to do it.

But having said that: we _should_ _always_ be looking for reasons _not_ to 
do something: every time we add a feature to the Web platform, it has 
massive long-term costs. We should be hugely reluctant to do so. It is our 
responsibility as language designers to keep everything out of our 
languages unless the cost is justified by the massive gains. The default 
answer to every proposal should be "no" followed only then by "why?". If 
we can't find a _strong_ justification, we should not include it.

(I often examine the solution space while looking at use cases in 
parallel. It is no commentary on the validity of use cases.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 30 August 2012 00:47:02 UTC