RE: WebVTT (was RE: TTML Agenda for 15/05/13 - Proposed updates to charter)

OK. I  got a bit carried away there with my hyperbole, and I apologize unreservedly for that. No personal attack intended; Ian is a very smart guy and has obviously put a lot of time into this, as have many other smart people, and I respect that.  Don’t get me wrong, I am genuinely trying to get my head around this and am attempting to  not only understand what this spec actually says, but also if possible why it says it. As Glenn once quipped, doing spec work is the art of reading carefully. Which is what I am attempting to do. I'm prepared for the fact that the spec might be buggy, and that’s to be expected, but since the spec offers almost no redundancy through  motivational text, pictures or examples, and this model is so unlike anything else, it's really hard to spot what is intentional and what is a bug.

My major concern at this point is how stable this spec is. If we (Microsoft) put an implementations out in the world, it can be very hard to retract if it's wrong, and as you are no doubt aware, if the spec subsequently changes this is often used by detractors to berate those efforts. If there are parts of the spec that are known to be wrong/under dispute (i.e. more than normally unstable - as obviously it's all under development), can you please, as a matter of some urgency, mark them in some way - no need to have a fix at this point, but some highlight and/or a link to a relevant entry in the Bugzilla database would be very helpful. I have read through the extant bugs, and I think I have a handle on the things that you are trying to fix, but I cannot be 100% sure I got everything, and many of the threads there peter out inconclusively so it's hard to know what their current status is.

My secondary concern however is whether,  even when corrected, it specifies something that is genuinely useful, and at this point I am struggling with that. One part of that concern stems of course from whether this could in practice be used as a delivery format for content stored in TTML or other formats, either as a conversion in the browser or in a server somewhere. My feeling at this point is that VTT does not offer a sufficiently general positioning mechanism to allow that to happen.

I cannot look at your source code for legal reasons, so I'll have to take your word on how close to the written spec that implementation is.  And I can't know what value you chose for the bottom margin, but it does seem to me that the spec does require you to put text in that margin. And if the margin is small; (and I arbitrarily chose 1% of the video height for the top and bottom, since apparently 0 is not allowed). Then I believe the spec does require you to place text off-screen.

>> Where do you get that from? I don't think that's correct - why would it drop half the line?
As to an underlying rationale, I can offer no opinion, but the I base my reasoning on the following:

>If the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is set
   y-poition -> 0
...
>14.Adjust the positions of boxes according to the appropriate steps from the following list:
  >If cue's text track cue snap-to-lines flag is set (it is)
  > margin = 1vh
             - In the absence of overscan, this value should be picked for aesthetics  (to avoid text being aligned precisely on the bottom edge of 
              - the video, which can be ugly).
                Notwithstanding that 'aesthetics' is a remarkably subjective term to use in such an algorithmic oriented specification, 
                this  important value is UA dependent, which means vertical position is actually not predictable for authoring.

  > Let full dimension be the height of video's rendering area  = 100vh
  > 3.Let max dimension be full dimension - (2 × margin).    (margin = 1  max = 98)
 > 4. Let step be the height of the first line box in boxes.  (step -> 6 based on line-height:normal =>  1.2 and a font height of 5vh from section 5.2.2)
 > 6.Let line position be the text track cue computed line position.
                >> 7. Let n be the number of text tracks whose text track mode is showing and that are in the media element's list of text tracks before track.
                          n -> 0
                >> 8. Increment n by one.
                          n -> 1
                 >> 9. Negate n.
                          n -> -1
                 >> 10. Return n.
     line position -> -1
> 8.Let position be the result of multiplying step and line position
         Position -> -6
> 10.If line position is less than zero then increase position by max dimension, and negate step
         Position = 92
         Step = 6
> Move all the boxes in boxes down by the distance given by position.
   
So where top was 0, it is now 92vh; sufficient for one line and a bit - since the spec requires partial lines to be removed, only the first line shows up. Or gets clipped depending on the outcome of that bug issue. Given that the example text is 50 characters long, and with a font advance of ~3vw, its probable that this text needs to break into at least two lines, at least for the purposes of determining the container height; it may subsequently be restyled to a different size, and thus reflow, but I believe that does not change the layout box.

> Firstly, "line:60%" has no influence on it disappearing (it's 60% down from the top of the video).  
True it's not the reason it disappears, but it is not 60% down from the top of the video, at least not according to the written text,  Although as you say that appears to be buggy, it requires the first line to be at 42% down, because in the repositioning section:

> Let x be a percentage given by the text track cue text position, and let y be a percentage given by the text track cue computed line position
> 2.Position the boxes in boxes such that ... the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area....

Thus line=60% causes the 60% point of box to be placed at the 60% point of the height of the video.  The height of the bounding box is determined by the font advance (not specified - so again the final value is in fact UA dependent) but probably around 3 vw * width  and the amount of text.

> However, you're telling the browser to position a cue of 50% width middle aligned at the left side of the video. At the left there is no space for a middle aligned cue, so as much as the browser is trying to squeeze this cue  in, it wasn't given any space and thus the cue disappears (1 char long is a bug in an implementation). So it disappears. 

Well my expectation was, possibly naively, that the cue will be subsequently moved, to have its left edge  at 0vw, so, in actuality there is plenty of room for it. The size constraint appears to happen at the wrong time, and is IMO actually unnecessary. Just define the video viewport to clip all cues, and let the author  be responsible for keeping their content visible. That is the CSS way of things.

Whether it shows 1 char or not is an ambiguity in the spec. Since at least as far as I can tell (and I have looked pretty hard) the special breaking rules do not define what happens when the width is less than 1 character, and CSS would I believe allow overflow to occur in this case. Also I'd point out that those special breaking rules don’t seem to work particularly well for internationalization. Japanese for example is pretty strict about where you can break lines, and in languages like Arabic where words are essentially continuous marks, it doesn't work very well if you aren’t at least a little selective in where you break. 

Even in English you need to be somewhat careful about where you break so as not to damage the meaning of text (e.g.

                His clean but
                  toned coat

Reads quite differently from

                   His clean
              buttoned coat.

Even though the former has a better 'delta'. 

Now depending on whether the spec truly requires an absolute minimum delta (in which case the proper breaking would be something like the truly awful, 
  
              His clean bu
              ttoned coat

The relative weighting of the two rules is not clear, so there is definitely a reading which would imply that the spec would actually require the first layout, over the second.. 

There is a very good reason CSS is not overly prescriptive in this area and they have been thinking about it for 20 years or so. Far better I think to leave it up to the author in the first instance, and CSS in the second.

>Try adding an "align:start" and you will be fine.
Yes, but I don’t want a left aligned caption, I want a centered aligned caption which stretches from 0vw to 50vw, i.e. centered around  25vw.  Let's say because I am modeling a two speaker dialog and I want each speaker to have their own half of the screen. There is plenty of room for that.  Can you tell me what values I should use to achieve it?  

I think one significant problem is that the cue properties are doing partial double duty here and interact in subtle ways and this leads to some problems, not least of which, although I haven’t done a full analysis yet, is that it seems there are many quite valid layouts that VTT is unable to express, like the fairly common one I just gave. And really I'm not sure why these controls are even necessary, CSS provides all the controls you could possibly need to position a box in the video rectangle; As VTT is supposed to be a simple browser optimized format, why even have these controls at all?

VTT seems to be heavily optimized to solve a corner case of a corner case, that is automatically moving captions to avoid overlap; which almost never happens in practice, and could quite easily be taken care of as an authoring constraint as is done in SDP and indeed in VTT for time order. It wouldn’t matter so much, but this is seems to be largely at the expense of predictability and expressiveness which are two key requirements if VTT is going to be the target for the worlds caption corpus. The position of a caption is part of its semantics, as they are often placed to indicate which speaker is speaking; to avoid specific areas in the video itself; and for aesthetics. If the basic controls don’t offer the author the ability to at least semi accurately place captions, IMO the format is going to be very hard to use, if not a complete non-starter - hence my admittedly unfortunate comment.

> If you don't author accurate cues, you are bound to get rubbish. That's the case with any format.
While I agree with your sentiment, the problem is in VTT it is very hard to author accurate cues. Even apparently simple cases are fiendishly difficult to arrange, some common arrangements seem actually impossible, and there are a number of UA specific values that can move my content in unknown ways,. This is my point.

Hoping you take this in the spirit I intend it, which is to make captioning the web the best experience it can be for users, achieved with the minimum of expense and effort on the part of providers.

Regards,
Sean.


-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com] 
Sent: 11 June 2013 07:35
To: Sean Hayes
Cc: John Birch; public-tt@w3.org
Subject: Re: WebVTT (was RE: TTML Agenda for 15/05/13 - Proposed updates to charter)

On Mon, Jun 10, 2013 at 6:11 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> Hi John,
>
>>>It may be the case that taking the specification as written and 
>>>codifying it into an appropriate programming language does result in a compliant implementation, although I can't speak to that since I haven't attempted it...
>
> Well unfortunately, once you get through the impenetrable language, you'll find the core of the spec is in fact really quite odd. If you get time you should definitely try coding it up, it's extremely illuminating, I don’t think it's possible to really understand all the interplay of position, line and size otherwise. I haven’t spent any time investigating other VTT implementations to see how they behave, but I can only surmise that either no one is actually writing much with this format, or the other implementations have not stuck very closely to what's in the spec.


There are a few small bugs that are all registered.
Here is an implementation that has followed the spec to the line (except for the few bugs):
https://github.com/silviapfeiffer/WebVTT-with-regions .
You can test it at
http://html5videoguide.net/test/WebVTT-with-regions/player.html .

> For example, take the most basic of cue's:
>
> 00:11.000 --> 00:13.000
> Sed ut perspiciatis unde omnis iste natus error sit
>
> This gets positioned so that only the top line " Sed ut perspiciatis unde " ends up showing at the bottom of the video (assuming there is only one track playing for the video).

Where do you get that from? I don't think that's correct - why would it drop half the line?


> Well, OK we can maybe forgive that, since we didn’t offer any 
> direction, so another example
>
> 00:11.000 --> 00:13.000 position:0% size:50% line:60% Sed ut 
> perspiciatis unde omnis iste natus error sit
>
> We now get a zero width box down the left edge of the video which , depending on how you read the special wrap/overflow rules, either produces nothing or a single character column down the left edge.


Firstly, "line:60%" has no influence on it disappearing (it's 60% down from the top of the video). However, you're telling the browser to position a cue of 50% width middle aligned at the left side of the video. At the left there is no space for a middle aligned cue, so as much as the browser is trying to squeeze this cue  in, it wasn't given any space and thus the cue disappears (1 char long is a bug in an implementation). So it disappears. Try adding an "align:start" and you will be fine.


> For the next example the result is a little less weird, but still not exactly intuitive:
>
> 00:11.000 --> 00:13.000 position:10% size:50% line:30% Sed ut 
> perspiciatis unde omnis iste natus error sit
>
> here we do get output, but its only 20% of the horizontal width, not 50% as specified, and it shows up with its outer box 5% from the top not 30%., and 8% in y not 10. Remember this is before we start factoring in second, potentially overlapping cues that get moved all over the place. The use case for which I agree with you is highly dubious.


You're again trying to squeeze a 50% width cue, middle aligned, into position 10% from the left. All you are going to get is a cue of 20% width, centered around the 10% mark and broken into multiple lines.
Turn off your default middle alignment and you will be fine. 8% sounds like an implementation bug.


> I could go on, but I think you'll get my point.

Not at all. If you don't author accurate cues, you are bound to get rubbish. That's the case with any format.


> I suspect that in reality no one can actually be following the spec, especially since in the recently issued 608 conversion doc:  https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html it claims that the preamble codes for row 7 column 9 should be represented by  line: 42% position: 30% (based on the idea of 15 rows and 32 columns  in an 80% safe area). Notwithstanding that these are two of the few values given there that wouldn't be actually ignored by the parser, they don’t in fact produce what the author thinks they should.
>
> 00:11.000 --> 00:13.000 align:left line:42% position:30%

That should be align:start .

> Sed ut perspiciatis unde omnis iste natus error sit
>
> According to http://dev.w3.org/html5/webvt this should show up at x=9% -which is not even in the title safe area, and y = 30%  so,  it seems the author is basing these rules  on something other than the spec. I am reading.

I think you may be reading the spec (the authoring part) differently from how I am reading it. Since this is start aligned text, the first character is actually positioned at x=30% and y=42% . After debugging your calculations on the rendering algorithm below, I come to the conclusion that you have correctly calculated x and y. You then stumbled across a bug, which has been pointed out before. It might be worthwhile keeping the bugs in mind.


> Unless this weird behavior is in fact what is in the extant implementations, which I can hardly believe, I'm growing increasingly concerned that this text is nowhere near ready to be on a rec track. The good news here though is that I think what people actually seem to be expecting to happen is what you would get applying a pure CSS semantics, and removing Mr. Hickson's flight of fancy, which frankly IMO doesn't pass the giggle test.


Ian authored the spec and several people have been able to implement it successfully within browsers and outside (bar a few bugs), so I think your personal attack here is really unwarranted. The spec may be hard to read for anyone who has not previously had experience with that style of spec (in fact, it took me a while to grasp it), but that doesn't make it fundamentally wrong as you seem to imply.

TTML has had 10 years of specification and is still finding bugs and fixing them - please give WebVTT a bit more leeway on bugs.


Best Regards,
Silvia.

>
> All the best,
>
> Sean.
>
> ---- detailed working for the last example,  in case someone wants to 
> check my math ---
>
>>> the text track cue writing direction is horizontal, and the text 
>>> track cue alignment is left
>                                             Maximum Size = 100 - text track cue text position;  // therefore max size -> 70.
>
>>> the text track cue size is less than maximum size, then let size be text track cue size. Otherwise, let size be maximum size.
>                         Therefore Size -> 70
>
>>> the text track cue writing direction is horizontal, then let width be 'size vw' and height be 'auto'.
>         Width = 70 vw. Height = auto
>
>>> the text track cue writing direction is horizontal, the text track cue alignment is left, and direction is 'ltr'
>         x-position = text track cue text position;   therefore x-position -> 30
>
>>> the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is not set.
>>> If the text track cue line position is numeric,  return the value of the text track cue line position and abort these steps.
>         y-position = compute line position     ->    42
>
>>> Let left be 'x-position vw' and top be 'y-position vh'.  ->
>         Therefore left: 30vw  top: 42vh


That's correct: x=30% of the video width, y=42% of the video height is the start position of the text.



> So then we invoke CSS to find the height, which is a little funky, because it's not 100% clear how styles actually get applied, and therefore  what font to use, but I work it based on the text in 5.2.2 where a '5vh sans serif font' is called out, and so a reasonable height is 30vh for my sample text.

There is no need to go further into CSS - you have already found the start position of the text relative to the video viewport.

> So the absolutely positioned box for the cue is at top:  42vh  left:  
> 30vw; width: 70vw height:30vh

Your height calculation is wrong - the text is one line high, so 5vh high, unless somebody changes the font or the text does not fit within the 70% video width space that the cue is given to grow into.

> Then we get into the repositioning:
>
>>> Adjust the positions of boxes according to the appropriate steps from the following list:
>>> the cue's text track cue snap-to-lines flag is not set (it's not since both line and position are % values):
>         >> the text track cue writing direction is horizontal, and direction is 'ltr'
>                >> Let x be a percentage given by the text track cue text position,
>                                 x = textTrackCueTextPosition -> 30
>                >>let y be a percentage given by the text track cue computed line position.
>                                 y = computelinePosition  ->  42.

OK.


>>> Position the boxes in boxes such that the point 30% along the width of the bounding box of the boxes in boxes is 30x% of the way across the width of the video's rendering area:
>
>      Box is currently lying at x = 30vw., its width is 70% of the video, so the point 30% along that is currently lying at 51vw ( i.e 30% + (30% * 70%))
>      We need it to lie at 30vw, so we need to subtract 21vw from its x position, so the box is now lying at 30-21 = 9vw. And since its left aligned, the 1st character is not in the title safe part.

Yup, this is a bug : https://www.w3.org/Bugs/Public/show_bug.cgi?id=20037 .


>>> and the point y% along the height of the bounding box of the boxes 
>>> in boxes is y% of the way across the height of the video's rendering area, while maintaining the relative positions of the boxes in boxes to each other.
>
> 42% of 30vh -> 12.6vh + top = 54.6.
> Target = 42vh  ->  delta = -12.6  -> new y pos ~= 30vh.

This is still the same bug. We're aware of this issue. There is no need to change anything further when there is no overlap. I'll work on this.




> -----Original Message-----
> From: John Birch [mailto:John.Birch@screensystems.tv]
> Sent: 07 June 2013 11:03
> To: Silvia Pfeiffer; Sean Hayes
> Cc: Michael Jordan; public-tt@w3.org
> Subject: RE: TTML Agenda for 15/05/13 - Proposed updates to charter
>
> Hi Silvia, Sean,
>
> I've been tracking this conversation with interest...
>
> Some comments:
>
> RE: the WebVTT specification and the 'rendering algorithm'... I too find the current specification ambiguous and complex. It may be the case that taking the specification as written and codifying it into an appropriate programming language does result in a compliant implementation, although I can't speak to that since I haven't attempted it... but using such an algorithmic approach in a specification does, for me at least, make it far less personally (and IMHO generally) comprehensible. (Even though my background is in software development).
>
> RE: caption overlap.... this for me is talking about an extreme corner case. It is extremely unusual to have simultaneous independent captions, and when this does occur they would normally be carefully placed by a captioner to avoid overlap. If the intention is to facilitate non-human mediated caption creation workflows, (e.g. automated speech to caption) then again IMHO there are other significant aspects that would also need to be supported in the WebVTT specification.
>
> RE: there is a quality captions requirement about balancing multiline captions... see my previous comment. Multi line balancing should be human mediated - algorithms do not perform this task well.
>
> RE: I'd leave it to the market to create lossless conversion tools and support them... In my experience this is unlikely. IMHO conversion between distribution formats is undesirable, since the contents of a subtitle or caption stream in a distribution format typically represents the result of a series of compromises to accommodate the limitations of the distribution format and the associated video presentation (e.g. resolution, screen size, line length limitations, assumptions about target audience etc.) Caption and Subtitle conversions currently often occur as a production process, and ideally should proceed from an abstracted authoring or master format, not a distribution format. I do not believe WebVTT is currently a suitable ideal candidate as an archive caption (subtitle) file format, though annotated (extended) TTML may be.
>
> I welcome the effort to normalise the functionality or feature sets between TTML and WebVTT, one of the greatest impediments to simple conversion between UK captions and US captions is the seemingly trivial matter of 32 characters versus 40 character line length limits.
>
> Best regards,
> John
>
> John Birch | Strategic Partnerships Manager | Screen Main Line : +44 
> 1473 831700 | Ext : 270 | Direct Dial : +44 1473 834532 Mobile : +44 
> 7919 558380 | Fax : +44 1473 830078 John.Birch@screensystems.tv | 
> www.screensystems.tv | https://twitter.com/screensystems
>
> Visit us at
> Broadcast Asia 2013, 18 - 21 June 2013, Marina Bay Sands, Singapore
>
>
> P Before printing, think about the environment-----Original 
> Message-----

Received on Tuesday, 11 June 2013 14:28:38 UTC