WebVTT (was RE: TTML Agenda for 15/05/13 - Proposed updates to charter)

Hi John,

>>It may be the case that taking the specification as written and codifying it into an appropriate programming language does result in a compliant implementation, 
>>although I can't speak to that since I haven't attempted it...

Well unfortunately, once you get through the impenetrable language, you'll find the core of the spec is in fact really quite odd. If you get time you should definitely try coding it up, it's extremely illuminating, I don’t think it's possible to really understand all the interplay of position, line and size otherwise. I haven’t spent any time investigating other VTT implementations to see how they behave, but I can only surmise that either no one is actually writing much with this format, or the other implementations have not stuck very closely to what's in the spec.

For example, take the most basic of cue's:

00:11.000 --> 00:13.000
Sed ut perspiciatis unde omnis iste natus error sit

This gets positioned so that only the top line " Sed ut perspiciatis unde " ends up showing at the bottom of the video (assuming there is only one track playing for the video).

Well, OK we can maybe forgive that, since we didn’t offer any direction, so another example

00:11.000 --> 00:13.000 position:0% size:50% line:60%
Sed ut perspiciatis unde omnis iste natus error sit 

We now get a zero width box down the left edge of the video which , depending on how you read the special wrap/overflow rules, either produces nothing or a single character column down the left edge.

For the next example the result is a little less weird, but still not exactly intuitive:

00:11.000 --> 00:13.000 position:10% size:50% line:30%
Sed ut perspiciatis unde omnis iste natus error sit 

here we do get output, but its only 20% of the horizontal width, not 50% as specified, and it shows up with its outer box 5% from the top not 30%., and 8% in y not 10. Remember this is before we start factoring in second, potentially overlapping cues that get moved all over the place. The use case for which I agree with you is highly dubious.

I could go on, but I think you'll get my point. I suspect that in reality no one can actually be following the spec, especially since in the recently issued 608 conversion doc:  https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html it claims that the preamble codes for row 7 column 9 should be represented by  line: 42% position: 30% (based on the idea of 15 rows and 32 columns  in an 80% safe area). Notwithstanding that these are two of the few values given there that wouldn't be actually ignored by the parser, they don’t in fact produce what the author thinks they should.

00:11.000 --> 00:13.000 align:left line:42% position:30%
Sed ut perspiciatis unde omnis iste natus error sit 

According to http://dev.w3.org/html5/webvt this should show up at x=9% -which is not even in the title safe area, and y = 30%  so,  it seems the author is basing these rules  on something other than the spec. I am reading.

Unless this weird behavior is in fact what is in the extant implementations, which I can hardly believe, I'm growing increasingly concerned that this text is nowhere near ready to be on a rec track. The good news here though is that I think what people actually seem to be expecting to happen is what you would get applying a pure CSS semantics, and removing Mr. Hickson's flight of fancy, which frankly IMO doesn't pass the giggle test..

All the best,

Sean.

---- detailed working for the last example,  in case someone wants to check my math ---

>> the text track cue writing direction is horizontal, and the text track cue alignment is left
                                            Maximum Size = 100 - text track cue text position;  // therefore max size -> 70.

>> the text track cue size is less than maximum size, then let size be text track cue size. Otherwise, let size be maximum size.  
   Therefore Size -> 70

>> the text track cue writing direction is horizontal, then let width be 'size vw' and height be 'auto'.
 Width = 70 vw. Height = auto

>> the text track cue writing direction is horizontal, the text track cue alignment is left, and direction is 'ltr'
 x-position = text track cue text position;   therefore x-position -> 30

>> the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is not set. 
>> If the text track cue line position is numeric,  return the value of the text track cue line position and abort these steps.
   y-position = compute line position     ->    42          

>> Let left be 'x-position vw' and top be 'y-position vh'.  -> 
 Therefore left: 30vw  top: 42vh     
  
So then we invoke CSS to find the height, which is a little funky, because it's not 100% clear how styles actually get applied, and therefore  what font to use, but I work it based on the text in 5.2.2 where a '5vh sans serif font' is called out, and so a reasonable height is 30vh for my sample text.

So the absolutely positioned box for the cue is at top:  42vh  left:  30vw; width: 70vw height:30vh

Then we get into the repositioning:

>> Adjust the positions of boxes according to the appropriate steps from the following list:
>> the cue's text track cue snap-to-lines flag is not set (it's not since both line and position are % values):
 >> the text track cue writing direction is horizontal, and direction is 'ltr'
               >> Let x be a percentage given by the text track cue text position, 
                                x = textTrackCueTextPosition -> 30
               >>let y be a percentage given by the text track cue computed line position.
                                y = computelinePosition  ->  42.

>> Position the boxes in boxes such that the point 30% along the width of the bounding box of the boxes in boxes is 30x% of the way across the width of the video's rendering area:
              
     Box is currently lying at x = 30vw., its width is 70% of the video, so the point 30% along that is currently lying at 51vw ( i.e 30% + (30% * 70%)) 
     We need it to lie at 30vw, so we need to subtract 21vw from its x position, so the box is now lying at 30-21 = 9vw. And since its left aligned, the 1st character is not in the title safe part.
                                     
>> and the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area, 
>> while maintaining the relative positions of the boxes in boxes to each other.

42% of 30vh -> 12.6vh + top = 54.6.
Target = 42vh  ->  delta = -12.6  -> new y pos ~= 30vh.


-----Original Message-----
From: John Birch [mailto:John.Birch@screensystems.tv] 
Sent: 07 June 2013 11:03
To: Silvia Pfeiffer; Sean Hayes
Cc: Michael Jordan; public-tt@w3.org
Subject: RE: TTML Agenda for 15/05/13 - Proposed updates to charter

Hi Silvia, Sean,

I've been tracking this conversation with interest...

Some comments:

RE: the WebVTT specification and the 'rendering algorithm'... I too find the current specification ambiguous and complex. It may be the case that taking the specification as written and codifying it into an appropriate programming language does result in a compliant implementation, although I can't speak to that since I haven't attempted it... but using such an algorithmic approach in a specification does, for me at least, make it far less personally (and IMHO generally) comprehensible. (Even though my background is in software development).

RE: caption overlap.... this for me is talking about an extreme corner case. It is extremely unusual to have simultaneous independent captions, and when this does occur they would normally be carefully placed by a captioner to avoid overlap. If the intention is to facilitate non-human mediated caption creation workflows, (e.g. automated speech to caption) then again IMHO there are other significant aspects that would also need to be supported in the WebVTT specification.

RE: there is a quality captions requirement about balancing multiline captions... see my previous comment. Multi line balancing should be human mediated - algorithms do not perform this task well.

RE: I'd leave it to the market to create lossless conversion tools and support them... In my experience this is unlikely. IMHO conversion between distribution formats is undesirable, since the contents of a subtitle or caption stream in a distribution format typically represents the result of a series of compromises to accommodate the limitations of the distribution format and the associated video presentation (e.g. resolution, screen size, line length limitations, assumptions about target audience etc.) Caption and Subtitle conversions currently often occur as a production process, and ideally should proceed from an abstracted authoring or master format, not a distribution format. I do not believe WebVTT is currently a suitable ideal candidate as an archive caption (subtitle) file format, though annotated (extended) TTML may be.

I welcome the effort to normalise the functionality or feature sets between TTML and WebVTT, one of the greatest impediments to simple conversion between UK captions and US captions is the seemingly trivial matter of 32 characters versus 40 character line length limits.

Best regards,
John

John Birch | Strategic Partnerships Manager | Screen Main Line : +44 1473 831700 | Ext : 270 | Direct Dial : +44 1473 834532 Mobile : +44 7919 558380 | Fax : +44 1473 830078 John.Birch@screensystems.tv | www.screensystems.tv | https://twitter.com/screensystems


Visit us at
Broadcast Asia 2013, 18 - 21 June 2013, Marina Bay Sands, Singapore


P Before printing, think about the environment-----Original Message-----

Received on Sunday, 9 June 2013 20:12:22 UTC