Re: WebVTT feedback from Ian Hickson on 2011-12-06 (public-texttracks@w3.org from December 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 6 Dec 2011 00:38:14 +0000 (UTC)
To: "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <Pine.LNX.4.64.1112052346250.9078@ps20323.dreamhostps.com>
On Thu, 1 Dec 2011, Glenn Maynard wrote:
> On Thu, Dec 1, 2011 at 7:34 PM, Ian Hickson <ian@hixie.ch> wrote:
> > > 
> > > But it doesn't have to, since HTML does this with @lang.
> >
> > HTML doesn't do any font selection or word wrapping.
> >
> > Per the HTML and CSS specs, lang="" has no effect on rendering.
> 
> Huh?  I'm confused: 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-lang-and-xml:lang-attributes
> 
> "User agents may use the element's language to determine proper 
> processing or rendering (e.g. in the selection of appropriate fonts or 
> pronunciations, or for dictionary selection)."

Hm, good point.


> If wrapping to similar line lengths is what's wanted, then the VTT 
> renderer (and so CSS) should support that.

Possibly. It's not currently. It would require a new 'white-space' value. 
Given that, it would be easy to support in VTT; just have authors not use 
line breaks at all, and let the renderer do it.


> > What's the use case, though? If it's notes to a translator, or notes 
> > about uncertain captioning, presumably you would want to strip those 
> > out before publishing the captions.
> 
> Commercial translations would probably strip comments, but we just left 
> them in--if people want to poke at the comments to see why we translated 
> something one way or another, that's cool.
> 
> I'm fine with leaving this to styling (removing them if necessary for 
> publishing); this isn't important enough to do more than that for.  
> (Note that this may lead to content in the wild that breaks if CSS is 
> disabled, especially if tools use this method; people won't always strip 
> comments. I'm not worried about that--I don't personally consider it 
> reasonable to expect all content to render sanely with stylesheets 
> disabled.)

I guess I'd have to see evidence that anyone actually wants to include 
inline comments like this. I don't recall seeing any in the files I 
examined when researching captions.


On Sat, 3 Dec 2011, Philip Jägenstedt wrote:
> 
> We're going to be doing the same script detection heuristics that we do 
> on web pages. Differentiating between simplified Chinese, traditional 
> Chinese and Japanese isn't particularly hard.

Can we define these for interoperability, or are they proprietary? (I 
don't imagine people writing their own small WebVTT implementations are 
going to know how to do this if we don't have a spec.)


On Fri, 2 Dec 2011, Simon Pieters wrote:
> > 
> > > 00889: seconds too few '(^|\s|>)\d+[:\.,]\d+[:\.,]\d([:.,-]|\s|$)'
> > 
> > We could zero pad by default, but that seems a bit dodgy -- what if the
> > missing digit is not zero? Given how rare this is, I wonder what causes
> > it. Is it hand-authoring mistake? Were the seconds always "0" in these
> > cases, or were they non-zero seconds?
> 
> They were (line: count):
> 
> 00:00:0,300 --> 00:00:12,00: 15
> 00:00:0,500 --> 00:00:10,00: 101
> 00:00:0,500 --> 00:00:13,00: 58
> 00:00:0,500 --> 00:00:15,00: 1
> 00:00:0,500 --> 00:00:2,00: 672
> 00:00:0,500 --> 00:00:9,00: 1
> 00:00:0,779 --> 00:00:05,500: 1
> 00:00:05,540 --> 00:00:7,000: 1
> 00:00:07,840 --> 00:00:9,440: 3
> 00:00:30,000 --> 00:01:0,000: 1
> 00:00:5,501 --> 00:0:09,870: 1
> 00:00:6,449 --> 00:00:9,449: 1
> 00:04:9,183 --> 00:04:12,867: 1

These all look like missing zero-padding.

> 00:02:23,256 --> 00:02:2: 1
> 00:03:15,737 --> 00:03:7,781: 1
> 00:08:45,300 --> 00:08,0: 1

This does not; looks like missing data.

I didn't look at the rest.

The conclusion though seems to me that we're better off not fixing this 
mistake.


> > > 00922: spaces in timestamp '(\d[\d\s]*[:\.,]\s*){2,3}\d[\d\s]*' and not
> > > '(\d+[:\.,]){2,3}\d+'
> > 
> > Odd. Anecdotally, any idea what was going on with these?
> > 
> > How many files did this affect?
> 
> It looks like most of it is from four files having all their timestamps as:
> 
> 01: 15: 06, 493 --> 01: 15: 07, 776

Ok. Not worth supporting then.


> > > 02085: decimals too few
> > > '(^|\s|>)\d+[:\.,]\d+[:\.,]\d+[:\.,]\d{1,2}(\s|$|-)'
> > 
> > I wonder if people saying 0.1s mean 0.001s or 0.100s. If the latter, we
> > can probably support this without too much trouble. Do you have any
> > insight into this? e.g. what were the other times around such shortened
> > times? Were they also short? e.g. did it ever go ...:0.9 --> ...:0.800
> > (meaning the time was intended to be milliseconds), or was 0.9 always
> > followed by a time greater than a second later (meaning the time was
> > intended to be a fraction)?
> 
> The cases with more than one occurrence are (line: count):
> 
> 00:00:0,300 --> 00:00:12,00: 15
> 00:00:0,500 --> 00:00:10,00: 101
> 00:00:0,500 --> 00:00:13,00: 58
> 00:00:0,500 --> 00:00:2,00: 672
> 00:17:32,831 --> 00:17:34,38: 2
> 00:18:08,854 --> 00:18:11,04: 2
> 00:34:02,065 --> 00:34:05,8: 3
> 00:45:10,083 --> 00:45:13,22: 2
> 01:05:39,669 --> 01:05:41,83: 2
> 01:06:45.00 --> 01:06:49.75: 2
> 
> Of those, it's not obvious to me what was intended (except the "00" cases
> which are likely intended to be "000", though most of them also have too few
> seconds in the same line).

Yeah, that's weird. I think we should skip supporting these, too.


On Fri, 2 Dec 2011, Simon Pieters wrote:
> > 
> >   WEBVTT
> >   language: fr
> 
> Experience with <script language> suggests that people have trouble 
> spelling it. If we add it, I suggest it be called "lang".

Good point.

One other problem with the syntax above is that it conflicts with our 
metadata space -- people might end up doing things like:

   WEBVTT
   author: foo
   lang: en
   copyright: bar

...and somehow think that "author" and "lang" are equally meaningful. This 
would also mean that we would end up with a very polluted namespace which 
would hurt us if we ever want to add another field here.

One option is to just have it cue-level, and then rely on the "DEFAULTS" 
idea to make it file-wide (essentially block-level), and not bother with a 
system that is only file-wide.


On Fri, 2 Dec 2011, Simon Pieters wrote:
>
> IANA considerations says:
> 
> Magic number(s):
> WebVTT files all begin with one of the following byte sequences:
> 
> EF BB BF 57 45 42 56 54 54 0A
> EF BB BF 57 45 42 56 54 54 0D
> EF BB BF 57 45 42 56 54 54 20
> EF BB BF 57 45 42 56 54 54 09
> 57 45 42 56 54 54 0A
> 57 45 42 56 54 54 0D
> 57 45 42 56 54 54 20
> 57 45 42 56 54 54 09
> (An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab,
> or line break.)
> 
> However, the parser accepts files that contain only "WEBVTT" (or BOM 
> followed by "WEBVTT"), without a newline. The syntax requires two 
> newlines, though.
> 
> Is the above intended to match what the parser accepts? Or what the 
> syntax allows?

To rephrase your question, you're asking if it's intentional that a WEBVTT 
file that is empty (only contains the WEBVTT signature, without even a 
newline after it) would not be sniffed as WebVTT, but would not fire an 
'error' event if parsed as WebVTT?

I've fixed it.


On Fri, 2 Dec 2011, Simon Pieters wrote:
>
> IANA considerations says:
> 
> [[
> Interoperability considerations:
> Rules for processing both conforming and non-conforming content are defined in
> this specification.
> 
> Some legacy files violate the requirement to use UTF-8.
> ]]
> 
> I think the statement about UTF-8 is about SRT files. However, since 
> WebVTT is intentionally incompatible, that legacy is not relevant to 
> WebVTT interop, so it is probably good to drop that paragraph.

Fixed.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 6 December 2011 00:38:48 UTC