W3C home > Mailing lists > Public > public-texttracks@w3.org > February 2012

Re: meta-data in the VTT file header, a strawman proposal

From: David Singer <singer@apple.com>
Date: Mon, 27 Feb 2012 09:39:40 -0800
Cc: public-texttracks@w3.org
Message-id: <C9BE9454-FA78-41C9-B120-E1335929A539@apple.com>
To: Ralph Giles <giles@mozilla.com>
Hi Ralph

thanks for the comments.


On Feb 24, 2012, at 19:36 , Ralph Giles wrote:

>> The 'meta-data' consists of a series of keyword-value pairs,
>> separated by an equals ('=') sign, with white space around the
>> equals sign ignored.
> 
> As I mentioned on the bug, I think using a colon (':') as the
> key-value separator is easier to read in such a text-oriented format.
> I am inspired by RFC 822 and HTTP message headers, which unlike HTML
> attributes have an established multi-line extension.

hm, key=value seems pretty natural, but maybe it's not worth bikeshedding on this character.  However, there is a snag: as we say below, it might be nice to use HTML5 attribute name syntax, and ":" is a valid part of a name, whereas = is not. In addition, it seems natural to use the HTML attribute=value syntax.

"8.1.2.3 Attributes

Attributes for an element are expressed inside the element's start tag.

Attributes have a name and a value. Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name."


> 
>> 1) plain keywords, consisting of the upper and lower-case 'ASCII'
>> letters, and numerals, only;
> 
> I think we should match the attribute character set. Webvtt documents
> themselves are utf-8, so I don't see the value of this restriction.

Agreed.

> 
> I would however recommend that any keywords this group defines respect
> this restriction, or something like it.

Right, I'll revise to say "it's an HTML5 attribute name, with all names NOT [containing the '-' character | starting with 'X-'] reserved to this specification.

> 
>> Case is not significant in keywords.
> 
> We can likewise borrow the 'ASCII case-insensitive' matching used by
> HTML attributes to extend this to utf-8 keywords. That is, ASCII
> characters are case insensitive, but other unicode characters are
> compared exactly as codepoints.

Agreed again.

> 
>> 2) qualified keywords, that contain one or more dash ('-')
>> characters
> 
> I don't think vendor prefixes have been a success in css. I'd rather
> see us agree on things quickly and/or reuse the _attr or x-vendor-attr
> extentions from the the HTML spec.

I am pretty sure that if we don't define what's experimental/vendor and what's reserved, then people will simply make it up.  I'd rather have a defined sandbox.  You use X- as that prefix, below.  I've written this email as a choice between 'containing -' and 'starting with X-'.

> 
>> [multiline quotes with [[..]] and '\' to escape empty lines] Note
>> that it is an error to encounter a blank or white-space only line,
>> while accumulating lines before the closing ']]' line; such lines
>> are required to have been escaped.
> 
> Do you propose that the whole file be rejected in that case?

How you recover from an error is a matter for debate, for sure.  For the moment, I just said it is an error.

> A parser
> ignoring metadata is just looking for a repeated eol sequence, so
> whitespace-only lines aren't really a problem.

Agreed, but it's visually confusing, so I thought it safer to require the escape.

> If there's an unescaped
> newline, the parser will try to interpret the remaining metadata as a
> cue, which will fail unless it happens to contain a timestamp.

Yes.

> 
> Message-Header style gives us another way to escape continued lines,
> which is to indent them with whitespace. This isn't any harder
> programmatically, but is easier to read. (You did it anyway in your
> example!) Might be more confusing to debug blank line escapes though.

Yes. I also toyed with SMTP-like multi-line (no opening bracket, and the end is a period on a line by itself), but I don't think that's so easy to read, visually, and seems more error-prone.  Few people read SMTP input; people do expect to read VTT files.  I am guessing that for many cases, no escaping at all will be needed (']]' on a line by itself is pretty unlikely in CSS, as are lines starting with \, and stylesheets don't need to have blank lines).

> 
>> Examples:
>> 
>> kind=captions examplecompany-test = for steve 
>> initialTStimestamp=162642774 
>> stylesheetURL=http://www.example.com/vtt-plain.css stylesheet=
>> [[ p { font-size: 100px; } \ p::first-line { background:
>> url(http://www.w3.org/StyleSheets/TR/logo-REC) no-repeat; 
>> font-size: 10px; span { border-left: solid 1em black; } } ]] 
>> srclang=en-US label=Zeroes for King!
> 
> My suggestion would look like:
> 
> Kind: captions
> X-examplecompany-test: for steve
> Timestamp-offset: 162642774
> StylesheetURL: http://www.example.com/vtt-plain.css
> Style:
> p { font-size: 100px; }
> 
> p::first-line {
>   background: url(http://www.w3.org/StyleSheets/TR/logo-REC) no-repeat;
>   font-size: 10px;
>   span { border-left: solid 1em black; }
> }
> srclang: en_US
> Label: 𝟎s for 王!
> 

what terminates the block, in your syntax?

> There is one argument for using '=' instead of ':' as the separator,
> which is that namespaced xhtml attributes contain colons, but cannot
> contain and equals sign. If that's a concern (a la dc:creator) we can
> make the separator ':' plus whitespace.
> 
> We could avoid the stylesheetURL keyword by using "style: @import('url')".

True, but then we introduce CSS @ rules, which might be a can of worms, or might be useful.




My inclination is to stay with the '=' sign, as it's reserved by the attribute syntax, and go with something like this, but we need input from others, please.

* * * * * * *

A VTT file may have non-cue data after the VTT header line, and before the blank line that precedes the first cue.  That non-cue data consists of meta-data, and maybe comments.

(I don't discuss comment syntax here)

The 'meta-data' consists of a series of keyword-value pairs, separated by an equals ('=') sign, with white space around the equals sign ignored.

Keyword names are attribute names as defined in 8.1.2.3 of HTML5, with name matching using the same rules. The keyword namespace is divided into two:

1) plain keywords, not [containing the '-' character | not starting with 'X-']
2) qualified keywords, that [contain one or more dash ('-') characters | start with 'X-']


All plain keywords are reserved for the specification.  Qualified keywords may be used for experimental or vendor-specific attributes; qualified keywords will not be specified by this specification.  It is recommended that vendor-specific keywords use the vendor name [before the first dash | immediately after the 'X-'].

The value may be a single-line value, or a multi-line value. In either case, the line-break at the end of the value does not form part of the value.

A single-line value extends from the first non-white-space character after the '=', to the end of line.

A multi-line value has the character pair '[[' as the only non-white-space characters on the keyword line; the line break after the '[[' does not form part of the value.  The value is the set of lines starting at the next line, and ending at a line containing only the characters ']]' (not including the line break preceding the ']]').  Within that set of lines (the 'body'), escaping is performed by adding a '\' character at the start of the line to all lines that are one of:
* lines that start with a '\'
* lines that are empty or consist solely of white-space
* lines that consist of only ']]'

On reception, the client removes all leading \ characters from the lines of the body.  Note that it is an error to encounter a blank or white-space only line, while accumulating lines before the closing ']]' line; such lines are required to have been escaped. (Error recovery is a TBD).


Examples:

kind=captions
examplecompany-test = for steve
initialTStimestamp=162642774
stylesheetURL=http://www.example.com/vtt-plain.css
stylesheet=     [[
p { font-size: 100px; }
\
p::first-line {
  background: url(http://www.w3.org/StyleSheets/TR/logo-REC) no-repeat;
  font-size: 10px;
  span { border-left: solid 1em black; }
}
]]
srclang=en-US
label=Zeroes for King!


Specfiied keywords:

The attributes of the HTML5 track element are all valid keywords; each may occur only once.

In addition, the following are specified:
stylesheet - the value is a style sheet in CSS syntax; may occur more than once
stylesheetURL - the value is a URL for a CSS style sheet; may occur more than once. If the URL is relative, it is relative to the VTT file's URL
…


David Singer
Multimedia and Software Standards, Apple Inc.
Received on Monday, 27 February 2012 17:40:42 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 8 May 2014 13:18:50 UTC