RE: TTML Agenda for 15/05/13 - Proposed updates to charter from Sean Hayes on 2013-06-07 (public-tt@w3.org from June 2013)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Fri, 7 Jun 2013 16:37:48 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: Michael Jordan <mijordan@adobe.com>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <E9A92BD0A4FC934EB7935470A46D15241F691260@DB3EX14MBXC324.europe.corp.microsoft.c>
This is getting to be a big thread, but just to deal with just one specific point for now:
>I don't see the relevance of the implementation of the WebVTT rendering algorithm.

It's very relevant to see how the layout mapping might work, this is getting a little specific, but here's an example of one aspect of the specification that I'd need to understand a lot better before I could do a complete mapping.

Step 10 seems to have some problems:
Dealing with margins appears to set size, but this seems too late to affect the width property which is set in step 7. ?
the boxes aren’t repositioned/resized if they fit in the left or right margins. ?
margin right is used where margin bottom is probably intended. ?

Leaving that aside, let's say for sake of argument, I have a cue with settings line:10% position:30% size: 25%

Its default alignment is middle, so the following clause matches:
If the text track cue writing direction is horizontal, the text track cue alignment is middle, and direction is 'ltr'
         The x position is 30% - 12.5%   =  17.5%.

So the CSS absolute position traits for this will be:
left: 17.5vw, top: 10vh, width: 25vw height: auto   

Then In the reposition stage (in the non-snap case) we have:
"2.Position the boxes in boxes such that the point x% along the width of the bounding box of the boxes in boxes is x% of the way across the width of the video's rendering area, and the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area, while maintaining the relative positions of the boxes in boxes to each other."

Where x here is 30% and y is 10%.

We have already computed top and left from the line: and position: settings, but now it appears we must now move the cue, even if there are  no other boxes in the output.

So, Let's say, again for arguments sake, the video is 1000px wide and 700px high.

The outer box for the cue will have been positioned by CSS at 175px, 70px; (according to the specified algorithm which ignores margins as discussed above), the outer box will also be its bounding box assuming no overflow, so according to this step, we now need to move it so that 10% of its width, is at 10% of the video width.

Its width is 25% of the video width = 250px. So 30% of its width is 75px.
So the point 75px along the cue (currently lying at 175+75 = 250px)  needs to be placed at 300px; therefore the cue is moved 50px to the right, 

Then, let's say that it turns out after wrapping and font style application that there are 3  lines in the caption, which for argument sake makes it 100px high. we now need to move it so that 10% of its height is at 10% of the video height.

Therefore the point 10px into the cue which was at 80px now needs to be moved 10px up

Is this correct? If so, what is the rationale for this, It seems highly inconvenient from an authoring point of view to try to predict where my captions are going to show up, and this could easily end up putting the caption over something important in the video. Why isn’t the CSS position sufficient here?

So, the impact on the mapping is that where TTML wants to place the top left corner of the caption at 30%. 10% It turns out in WebVTT  its actually at 22.5%. 9%,  so now I have figure out the reverse of this process to find out what values to plug in for line and position and size to get 30%. 10%.

This is even before we start considering second captions in the output, or dealing with the even more complicated line based positioning which shuffles captions about vertically based on its width and font size.


-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com] 
Sent: 07 June 2013 00:40
To: Sean Hayes
Cc: Michael Jordan; public-tt@w3.org
Subject: Re: TTML Agenda for 15/05/13 - Proposed updates to charter

On Thu, Jun 6, 2013 at 9:24 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>>I may not fully understand what you are trying to achieve, so bear with me. What I read (and I may be wrong) is that you want WebVTT to map >to WebVTT objects ("WebVTT Node objects, see http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules), and TTML to map to TTML >objects, then these objects to map to some abstract object model before mapping that abstract object model to HTML objects for rendering?
>
> No. What I am suggesting is modifying the specifications to define WebVTT to map to TBDO objects and define TTML to map to TBDO objects. Where TBDO is the to be decided object model, as I point out the internal object model of both formats are simple enough that designing TBDO is pretty trivial, although it does require a willingness on both sides to change their specs. If that basic spirit of cooperation is not present then we might as well forget the entire enterprise.
>
>> I would keep this exercise separated from the WebVTT, the TTML, and 
>> the HTML spec and not require implementation. It's mostly interesting 
>> for conversions
>
> I believe this is in fact a perfectly viable approach for implementation for reasons I can't discuss on a public mailing list.
>
>>BTW: have you thought about that you could just define one of the two to be the abstract object model and map the other one and any other format to it?
>
> Yes I believe the XML Infoset would be the better more established choice, however I realize that this would set off the anti-XML knee-jerk reaction, so I'm not necessarily wedded to that idea.
>
>>All browsers that implemented more than the basic text support for 
>>WebVTT implemented creation of WebVTT Node objects as specified in the 
>>WebVTT spec, see 
>>http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules . Those 
>>node objects are being mapped to HTML DOM nodes in 
>>http://dev.w3.org/html5/webvtt/#webvtt-cue-text-dom-construction-rules

>
> And that is fine, TBDO is mostly just a naming exercise, since the objects in WebVTT are not really much more than names anyway, any  implementations wouldn't have to change necessarily.


I'd be curious to understand what "renaming" would entail. Your reference to XML Infoset doesn't seem like a mere renaming exercise.
But it seems like I still don't follow what you are trying to achieve.


>>Does TTML provide an explicit rendering algorithm? As I understand it, TTML relied on XSL-FO for rendering... yes, I just found this quote:
>>"For each resulting document instance F, if processing requires 
>>presentation on a visual medium, then apply formatting and rendering semantics consistent with that prescribed by [XSL 1.1]."
>
> The term *consistent with* here means that you are free to implement as you will, provided you produce visible results that look like those produced by the reference implementation. And in point of fact CSS, for the requirements of TTML, is indeed consistent with XSL-FO in that sense (since XSL-FO references CSS pretty much for the parts we rely on, except for a few details caused by CSS3 not remaining stable which we are cleaning up). The HTML5/CSS mapping will therefore define the reference rendering for CSS.
>
>>The rendering section of the WebVTT spec is quite complicated and uses 
>>many of the specifics of WebVTT cue settings and custom  algorithms to avoid cue overlap etc.
>
> Yes, I believe this is the biggest impediment to progress. I think not only are these rules complicated, they are in fact ambiguous to the point of non-interoperability, and possibly containing circular dependencies.

The algorithm is clearly stated and if there are ambiguities, then they are either bugs in the spec or misunderstandings by the implementer. Since every step of the algorithm is provided, there should be no non-interoperable implementations.


> The proposed region additions also seem to not fit well with them at all.

What makes you think so? The spec for regions has been implemented in blink (and I believe in webkit) with little issues.


> Personally I think it would be much better if the non-overlap constraint was moved into the document conformance, like the timing constraints are and simply rely on CSS with no alterations.

CSS does not do overlap avoidance for explicitly placed blocks of text.


> CSS is at this point a sufficiently general rendering technology that cue settings should be capable of being mapped into un-transformed CSS.

Captions have some specific requirements that CSS is not satisfying yet. In particular there is a quality captions requirement about balancing multiline captions for which CSS has no answer. There are discussions in the CSS WG to come up with a solution, but until then WebVTT needed to define its own.


> I do find the definition of :past and :future troubling however, given the implications of how often they could cause the CSS engine to run. I would like to see if these could be mapped to CSS animation.

That's an implementation quality issue - the fundamental issue of changing the format of sections of text is the same, so should be able to be dealt with in a browser in the same way, no matter if it comes through animations or pseudo-selectors.


>> I'd leave it to the market to create lossless conversion tools and support them. I wouldn't expect authors to do this by hand.
>
> Given the above, while a good approximation is feasible, I don't think truly lossless is actually possible. Certainly not without a better reference implementation of the WebVTT rendering algorithms.

Have you got proof for that? I thought part of the activity as in the new charter is actually about identifying how good a conversion can be. Also, I don't see the relevance of the implementation of the WebVTT rendering algorithm.


>>Well, I would not want to restrict the development of one format by the feature set available to other formats, or to the object model.
>>You wouldn't want to stop adding features to TTML just because these 
>>features are not available in VTT yet and therefore not specified in 
>>the
>  >common object model.
>
> Actually I would. The caption using public has suffered for decades because of the continual need to translate from one format to another, which leads to increased costs, delays and errors; which ultimately adds up to a great deal of non-captioned content.  We had a moment in time where it might have been possible to fix that, however for reasons I'm not particularly interested in rehashing we failed to do so. However we may have another opportunity to at least mitigate against it now.

This fails to recognize that both TTML and VTT may not just be used for captions, but for other things, too. We can't realistically restrict new features in either format to those that are available in the "common object model" (whatever that may be).


> I believe that what the caption and subtitle industry, and more importantly the users that are Deaf or hard of hearing, most urgently need is a single lingua-franca; and we are not serving them well if we don't at least try to merge these efforts. To the extent that we have two formats at all, VTT and TTML should be effectively two syntaxes for the same thing, where inter-conversion is a trivial rewrite. If new features are desirable, then they should be desirable, and usable for all formats.

OK, this requires us to start with analysing what the differences are.
I believe you've started that effort and I'm curious to see what you have found out.

Best Regards,
Silvia.
Received on Friday, 7 June 2013 16:38:33 UTC