Re: WebVTT (was RE: TTML Agenda for 15/05/13 - Proposed updates to charter) from Silvia Pfeiffer on 2013-06-11 (public-tt@w3.org from June 2013)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 11 Jun 2013 16:34:34 +1000
To: Sean Hayes <Sean.Hayes@microsoft.com>
Cc: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <CAHp8n2kaUwNrUUi8BbuSWtpsGu9RtG-4OQkpsSSjJiAq6ETDtQ@mail.gmail.com>
On Mon, Jun 10, 2013 at 6:11 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> Hi John,
>
>>>It may be the case that taking the specification as written and codifying it into an appropriate programming language does result in a compliant implementation,
>>>although I can't speak to that since I haven't attempted it...
>
> Well unfortunately, once you get through the impenetrable language, you'll find the core of the spec is in fact really quite odd. If you get time you should definitely try coding it up, it's extremely illuminating, I don’t think it's possible to really understand all the interplay of position, line and size otherwise. I haven’t spent any time investigating other VTT implementations to see how they behave, but I can only surmise that either no one is actually writing much with this format, or the other implementations have not stuck very closely to what's in the spec.


There are a few small bugs that are all registered.
Here is an implementation that has followed the spec to the line
(except for the few bugs):
https://github.com/silviapfeiffer/WebVTT-with-regions .
You can test it at
http://html5videoguide.net/test/WebVTT-with-regions/player.html .

> For example, take the most basic of cue's:
>
> 00:11.000 --> 00:13.000
> Sed ut perspiciatis unde omnis iste natus error sit
>
> This gets positioned so that only the top line " Sed ut perspiciatis unde " ends up showing at the bottom of the video (assuming there is only one track playing for the video).

Where do you get that from? I don't think that's correct - why would
it drop half the line?


> Well, OK we can maybe forgive that, since we didn’t offer any direction, so another example
>
> 00:11.000 --> 00:13.000 position:0% size:50% line:60%
> Sed ut perspiciatis unde omnis iste natus error sit
>
> We now get a zero width box down the left edge of the video which , depending on how you read the special wrap/overflow rules, either produces nothing or a single character column down the left edge.


Firstly, "line:60%" has no influence on it disappearing (it's 60% down
from the top of the video). However, you're telling the browser to
position a cue of 50% width middle aligned at the left side of the
video. At the left there is no space for a middle aligned cue, so as
much as the browser is trying to squeeze this cue  in, it wasn't given
any space and thus the cue disappears (1 char long is a bug in an
implementation). So it disappears. Try adding an "align:start" and you
will be fine.


> For the next example the result is a little less weird, but still not exactly intuitive:
>
> 00:11.000 --> 00:13.000 position:10% size:50% line:30%
> Sed ut perspiciatis unde omnis iste natus error sit
>
> here we do get output, but its only 20% of the horizontal width, not 50% as specified, and it shows up with its outer box 5% from the top not 30%., and 8% in y not 10. Remember this is before we start factoring in second, potentially overlapping cues that get moved all over the place. The use case for which I agree with you is highly dubious.


You're again trying to squeeze a 50% width cue, middle aligned, into
position 10% from the left. All you are going to get is a cue of 20%
width, centered around the 10% mark and broken into multiple lines.
Turn off your default middle alignment and you will be fine. 8% sounds
like an implementation bug.


> I could go on, but I think you'll get my point.

Not at all. If you don't author accurate cues, you are bound to get
rubbish. That's the case with any format.


> I suspect that in reality no one can actually be following the spec, especially since in the recently issued 608 conversion doc:  https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html it claims that the preamble codes for row 7 column 9 should be represented by  line: 42% position: 30% (based on the idea of 15 rows and 32 columns  in an 80% safe area). Notwithstanding that these are two of the few values given there that wouldn't be actually ignored by the parser, they don’t in fact produce what the author thinks they should.
>
> 00:11.000 --> 00:13.000 align:left line:42% position:30%

That should be align:start .

> Sed ut perspiciatis unde omnis iste natus error sit
>
> According to http://dev.w3.org/html5/webvt this should show up at x=9% -which is not even in the title safe area, and y = 30%  so,  it seems the author is basing these rules  on something other than the spec. I am reading.

I think you may be reading the spec (the authoring part) differently
from how I am reading it. Since this is start aligned text, the first
character is actually positioned at x=30% and y=42% . After debugging
your calculations on the rendering algorithm below, I come to the
conclusion that you have correctly calculated x and y. You then
stumbled across a bug, which has been pointed out before. It might be
worthwhile keeping the bugs in mind.


> Unless this weird behavior is in fact what is in the extant implementations, which I can hardly believe, I'm growing increasingly concerned that this text is nowhere near ready to be on a rec track. The good news here though is that I think what people actually seem to be expecting to happen is what you would get applying a pure CSS semantics, and removing Mr. Hickson's flight of fancy, which frankly IMO doesn't pass the giggle test.


Ian authored the spec and several people have been able to implement
it successfully within browsers and outside (bar a few bugs), so I
think your personal attack here is really unwarranted. The spec may be
hard to read for anyone who has not previously had experience with
that style of spec (in fact, it took me a while to grasp it), but that
doesn't make it fundamentally wrong as you seem to imply.

TTML has had 10 years of specification and is still finding bugs and
fixing them - please give WebVTT a bit more leeway on bugs.


Best Regards,
Silvia.

>
> All the best,
>
> Sean.
>
> ---- detailed working for the last example,  in case someone wants to check my math ---
>
>>> the text track cue writing direction is horizontal, and the text track cue alignment is left
>                                             Maximum Size = 100 - text track cue text position;  // therefore max size -> 70.
>
>>> the text track cue size is less than maximum size, then let size be text track cue size. Otherwise, let size be maximum size.
>                         Therefore Size -> 70
>
>>> the text track cue writing direction is horizontal, then let width be 'size vw' and height be 'auto'.
>         Width = 70 vw. Height = auto
>
>>> the text track cue writing direction is horizontal, the text track cue alignment is left, and direction is 'ltr'
>         x-position = text track cue text position;   therefore x-position -> 30
>
>>> the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is not set.
>>> If the text track cue line position is numeric,  return the value of the text track cue line position and abort these steps.
>         y-position = compute line position     ->    42
>
>>> Let left be 'x-position vw' and top be 'y-position vh'.  ->
>         Therefore left: 30vw  top: 42vh


That's correct: x=30% of the video width, y=42% of the video height is
the start position of the text.



> So then we invoke CSS to find the height, which is a little funky, because it's not 100% clear how styles actually get applied, and therefore  what font to use, but I work it based on the text in 5.2.2 where a '5vh sans serif font' is called out, and so a reasonable height is 30vh for my sample text.

There is no need to go further into CSS - you have already found the
start position of the text relative to the video viewport.

> So the absolutely positioned box for the cue is at top:  42vh  left:  30vw; width: 70vw height:30vh

Your height calculation is wrong - the text is one line high, so 5vh
high, unless somebody changes the font or the text does not fit within
the 70% video width space that the cue is given to grow into.

> Then we get into the repositioning:
>
>>> Adjust the positions of boxes according to the appropriate steps from the following list:
>>> the cue's text track cue snap-to-lines flag is not set (it's not since both line and position are % values):
>         >> the text track cue writing direction is horizontal, and direction is 'ltr'
>                >> Let x be a percentage given by the text track cue text position,
>                                 x = textTrackCueTextPosition -> 30
>                >>let y be a percentage given by the text track cue computed line position.
>                                 y = computelinePosition  ->  42.

OK.


>>> Position the boxes in boxes such that the point 30% along the width of the bounding box of the boxes in boxes is 30x% of the way across the width of the video's rendering area:
>
>      Box is currently lying at x = 30vw., its width is 70% of the video, so the point 30% along that is currently lying at 51vw ( i.e 30% + (30% * 70%))
>      We need it to lie at 30vw, so we need to subtract 21vw from its x position, so the box is now lying at 30-21 = 9vw. And since its left aligned, the 1st character is not in the title safe part.

Yup, this is a bug : https://www.w3.org/Bugs/Public/show_bug.cgi?id=20037 .


>>> and the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area,
>>> while maintaining the relative positions of the boxes in boxes to each other.
>
> 42% of 30vh -> 12.6vh + top = 54.6.
> Target = 42vh  ->  delta = -12.6  -> new y pos ~= 30vh.

This is still the same bug. We're aware of this issue. There is no
need to change anything further when there is no overlap. I'll work on
this.




> -----Original Message-----
> From: John Birch [mailto:John.Birch@screensystems.tv]
> Sent: 07 June 2013 11:03
> To: Silvia Pfeiffer; Sean Hayes
> Cc: Michael Jordan; public-tt@w3.org
> Subject: RE: TTML Agenda for 15/05/13 - Proposed updates to charter
>
> Hi Silvia, Sean,
>
> I've been tracking this conversation with interest...
>
> Some comments:
>
> RE: the WebVTT specification and the 'rendering algorithm'... I too find the current specification ambiguous and complex. It may be the case that taking the specification as written and codifying it into an appropriate programming language does result in a compliant implementation, although I can't speak to that since I haven't attempted it... but using such an algorithmic approach in a specification does, for me at least, make it far less personally (and IMHO generally) comprehensible. (Even though my background is in software development).
>
> RE: caption overlap.... this for me is talking about an extreme corner case. It is extremely unusual to have simultaneous independent captions, and when this does occur they would normally be carefully placed by a captioner to avoid overlap. If the intention is to facilitate non-human mediated caption creation workflows, (e.g. automated speech to caption) then again IMHO there are other significant aspects that would also need to be supported in the WebVTT specification.
>
> RE: there is a quality captions requirement about balancing multiline captions... see my previous comment. Multi line balancing should be human mediated - algorithms do not perform this task well.
>
> RE: I'd leave it to the market to create lossless conversion tools and support them... In my experience this is unlikely. IMHO conversion between distribution formats is undesirable, since the contents of a subtitle or caption stream in a distribution format typically represents the result of a series of compromises to accommodate the limitations of the distribution format and the associated video presentation (e.g. resolution, screen size, line length limitations, assumptions about target audience etc.) Caption and Subtitle conversions currently often occur as a production process, and ideally should proceed from an abstracted authoring or master format, not a distribution format. I do not believe WebVTT is currently a suitable ideal candidate as an archive caption (subtitle) file format, though annotated (extended) TTML may be.
>
> I welcome the effort to normalise the functionality or feature sets between TTML and WebVTT, one of the greatest impediments to simple conversion between UK captions and US captions is the seemingly trivial matter of 32 characters versus 40 character line length limits.
>
> Best regards,
> John
>
> John Birch | Strategic Partnerships Manager | Screen Main Line : +44 1473 831700 | Ext : 270 | Direct Dial : +44 1473 834532 Mobile : +44 7919 558380 | Fax : +44 1473 830078 John.Birch@screensystems.tv | www.screensystems.tv | https://twitter.com/screensystems
>
> Visit us at
> Broadcast Asia 2013, 18 - 21 June 2013, Marina Bay Sands, Singapore
>
>
> P Before printing, think about the environment-----Original Message-----
Received on Tuesday, 11 June 2013 06:35:23 UTC