RE: WebVTT from Sean Hayes on 2013-06-12 (public-tt@w3.org from June 2013)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Wed, 12 Jun 2013 13:05:43 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
Message-ID: <E9A92BD0A4FC934EB7935470A46D15241F69577B@DB3EX14MBXC324.europe.corp.microsoft.c>


>I'd be happy to make any fixes necessary.

OK - glad to hear that.

>> My major concern at this point is how stable this spec is. If we (Microsoft) put an implementations out in the world, it can be very hard to retract if it's wrong, and as you are no doubt aware, if the spec subsequently changes this is often used by detractors to berate those efforts. If there are parts of the spec that are known to be wrong/under dispute (i.e. more than normally unstable - as obviously it's all under development), can you please, as a matter of some urgency, mark them in some way - no need to have a fix at this point, but some highlight and/or a link to a relevant entry in the Bugzilla database would be very helpful. I have read through the extant bugs, and I think I have a handle on the things that you are trying to fix, but I cannot be 100% sure I got everything, and many of the threads there peter out inconclusively so it's hard to know what their current status is.

>I'd rather spend my limited time fixing them.

Well I understand you have limited time, as we all do; however it's very hard to have any kind of debate on this if we don't know what's going on. I don’t see a lot of discussion happening on the CG mailing list or on the bug threads. How do I stay in touch with this activity?

>> My secondary concern however is whether,  even when corrected, it specifies something that is genuinely useful, and at this point I am struggling with that. One part of that concern stems of course from whether this could in practice be used as a delivery format for content stored in TTML or other formats, either as a conversion in the browser or in a server somewhere. My feeling at this point is that VTT does not offer a sufficiently general positioning mechanism to allow that to happen.

>What's your exact concern about the positioning mechanism? If you can clarify that and raise it on list or in a bug, we can discuss  and fix it.

My concern is that one cannot pick an arbitrary rectangle on the screen for example 0vw 80vh to 60vw 94vh with centered text using the controls in the current spec text.
Essentially In order to be able to translate TTML into this you need be able to directly set the CSS values for top, left, width and height directly on a cue, and not have it subsequently altered (it may of course be clipped by the video viewport)

>> I cannot look at your source code for legal reasons, so I'll have to take your word on how close to the written spec that implementation is.
>It's all under a creative commons license which puts it in the public domain so your legal reasons likely don't apply: http://creativecommons.org/publicdomain/zero/1.0/


Its an internal ruling by my legal department so that I am not influenced by your code should I need to create my own implementation, which I am doing. 
Its not impossible for me to use FOSS licensed code, but it needs me to make a lot more of a case to legal than I am prepared to make at this point 

>>And I can't know what value you chose for the bottom margin, but it does seem to me that the spec does require you to put text in that margin. And if the margin is small; (and I arbitrarily chose 1% of the video height for the top and bottom, since apparently 0 is not allowed). Then I believe the spec does require you to place text off-screen.

>The spec is written to avoid placing anything off-screen at all cost.

Well good to know the intention. Obviously it needs fixing, because it fails to do that as written.

> Text will rather disappear than being placed outside the video viewport.

Yes I understand that.


>>>> Where do you get that from? I don't think that's correct - why would it drop half the line?
>> As to an underlying rationale, I can offer no opinion, but the I base my reasoning on the following:
>>
>>>If the text track cue writing direction is horizontal, and the text 
>>>track cue snap-to-lines flag is set
>>    y-poition -> 0
>> ...
>>>14.Adjust the positions of boxes according to the appropriate steps from the following list:
>>   >If cue's text track cue snap-to-lines flag is set (it is)
>>   > margin = 1vh
>>              - In the absence of overscan, this value should be picked for aesthetics  (to avoid text being aligned precisely on the bottom edge of
>>               - the video, which can be ugly).
>>                 Notwithstanding that 'aesthetics' is a remarkably subjective term to use in such an algorithmic oriented specification,
>>                 this  important value is UA dependent, which means vertical position is actually not predictable for authoring.

> Well, there's a strong recommendation of 1%.

Where is this recommendation documented, and more importantly how does an author work around it since it has a bearing on what to write in the file.

>>
>>   > Let full dimension be the height of video's rendering area  = 100vh
>>   > 3.Let max dimension be full dimension - (2 × margin).    (margin = 1  max = 98)
>>  > 4. Let step be the height of the first line box in boxes.  (step -> 
>> 6 based on line-height:normal =>  1.2 and a font height of 5vh from section 5.2.2)  > 6.Let line position be the text track cue computed line position.
>>                 >> 7. Let n be the number of text tracks whose text track mode is showing and that are in the media element's list of text tracks before track.
>>                           n -> 0
>>                 >> 8. Increment n by one.
>>                           n -> 1
>>                  >> 9. Negate n.
>>                           n -> -1
>>                  >> 10. Return n.
>>      line position -> -1
>>> 8.Let position be the result of multiplying step and line position
>>          Position -> -6
>>> 10.If line position is less than zero then increase position by max 
>>> dimension, and negate step
>>          Position = 92
>>          Step = 6
>>> Move all the boxes in boxes down by the distance given by position.
>>
>> So where top was 0, it is now 92vh; sufficient for one line and a bit
>> - since the spec requires partial lines to be removed,

>I don't see where that conclusion comes from. There are no partial lines to be removed - line breaks are introduced when the text is too long.

Yes as I explain immediately below, the text is too long because its 50 characters of 3vh, which adds up to 150vh. 
If the width of my text is 150% of the height of the video and the width of my video is 133% of its height (in a 4x3 aspect ratio), then as 150 > 133 clearly it needs to be broken into more than one line.
Now your implementation may be using some very narrow characters, or wider video, in which case you might not have seen this; but 3vh for a 5vh font seems to be typical. 

>>only the first line shows up. Or gets clipped depending on the outcome of that bug issue. Given that the example text is 50 characters long, and with a font advance of ~3vw, its probable that this text needs to break into at least two lines, at least for the purposes of determining the container height; 

>>it may subsequently be restyled to a different size, and thus reflow, but I believe that does not change the layout box.
>Yes it does.

If you say so, but thats not following the spec as written, as it contradicts the following text from the spec, which is part of "The rules for interpreting WebVTT cue text ":

"No style sheets are associated with nodes. (The nodes are subsequently restyled using style sheets after their boxes are generated, as described below.)"
And later - in 5.2.2 Applying CSS properties to WebVTT Node Objects
"On the (root) list of WebVTT Node Objects, the 'position' property must be set to 'absolute' ...  the 'top' property must be set to top, the 'left' property must be set to left, the 'width' property must be set to width, and the 'height' property must be set to height, where direction, writing-mode, top, left, width, and height are the values with those names determined by the rules for updating the display of WebVTT text tracks for the text track cue from whose text the list of WebVTT Node Objects was constructed."
therefore the crucial layout traits are fixed here based on the values we computed as above.
Although style can be applied at any point only the traits explicitly set by the properties set in 5.3.3 have any effect:
"The following properties apply to the '::cue' pseudo-element with no argument; other properties set on the pseudo-element must be ignored
•'color'
•'opacity'
•'visibility'
•'text-decoration'
•'text-outline'
•'text-shadow'
•the properties corresponding to the 'background' shorthand
•the properties corresponding to the 'outline' shorthand
•the properties corresponding to the 'font' shorthand, including 'line-height'
•'white-space'
"
Since none of these will affect the absolute positioning traits of top, left, width or height. The layout box remains fixed. and the inner nodes are flowed with respect to that containing block.
Therefore if the font size changes, then the text needs to flow inside that box.

>>> Firstly, "line:60%" has no influence on it disappearing (it's 60% down from the top of the video).
> >True it's not the reason it disappears, but it is not 60% down from 
> >the top of the video, at least not according to the written text,
> It is. If you're testing this in Chrome and haven't adjusted the align property, then I believe Chrome makes space for 1 letter and uses as many lines it can without overlapping another cue, so it likely uses the whole video height. However, "line:60%" originally positioned it 60% down the video height.

I am not testing in Chrome. I am basing this purely on the specification text, what Chrome does is irrelevant to this discussion.  Although if what you say is true, it appears Chrome is not following the spec-as-written.

>>  Although as you say that appears to be buggy, it requires the first line to be at 42% down, because in the repositioning section:
>>
>>> Let x be a percentage given by the text track cue text position, and 
>>> let y be a percentage given by the text track cue computed line position 2.Position the boxes in boxes such that ... the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area....
>>
>> Thus line=60% causes the 60% point of box to be placed at the 60% point of the height of the video.  The height of the bounding box is determined by the font advance (not specified - so again the final value is in fact UA dependent) but probably around 3 vw * width  and the amount of text.

>As I said: this part needs adjusting.

To do what?


>>> However, you're telling the browser to position a cue of 50% width middle aligned at the left side of the video. At the left there is no space for a middle aligned cue, so as much as the browser is trying to squeeze this cue  in, it wasn't given any space and thus the cue disappears (1 char long is a bug in an implementation). So it disappears.
>>
>> Well my expectation was, possibly naively, that the cue will be subsequently moved, to have its left edge  at 0vw, so, in actuality there is plenty of room for it.

>Why should the browser be allowed to freely move a cue when you specified a middle aligned cue that is positioned at the left edge of the video with its middle position there? That would be breaking the author's intention.

Well precisely. It shouldn't. It shouldn't resize it either. In my opinion the whole notion of the browser "fixing up" a layout to suit itself is misguided. However that is what the spec-as-written requires it to do. 


>> The size constraint appears to happen at the wrong time, and is IMO actually unnecessary. Just define the video viewport to clip all cues, and let the author  be responsible for keeping their content visible. That is the CSS way of things.

>As I said: WebVTT tries really hard to keep all cues within the viewport, including avoiding clipping text. We may change that constraint, though, as a consequence of the bugs.

OK well as I said I think the principle should be trust the author, and rely on clipping to the viewport where the author abuses that trust. 
"Fixup" behavior may be fine for a user whipping up some captions for a 5m YouTube video, it's not really appropriate if VTT is intended as the delivery vehicle for the worlds caption corpus, which unless I am very much mistaken Ian has explicitly stated it was never intended to be.

>> Whether it shows 1 char or not is an ambiguity in the spec.
>No, it's an implementation bug. It should not be showing anything.

Only if there is a clear guidance in the spec. I don't see that text, can you point me at a reference that backs this statement up

>> Since at least as far as I can tell (and I have looked pretty hard) the special breaking rules do not define what happens when the width is less than 1 character, and CSS would I believe allow overflow to occur in this case.

>No, only if you allow CSS to have overflow.
It is the default, and the spec doesn't override it; If you want that I suggest that overflow:hidden be set as part of the default properties in 5.2.2 
As per CSS: "If an inline box cannot be split (e.g., if the inline box contains a single character, or language specific word breaking rules disallow a break within the inline box, or if the inline box is affected by a white-space value of nowrap or pre), then the inline box overflows the line box."

>> Also I'd point out that those special breaking rules don’t seem to work particularly well for internationalization. Japanese for example is pretty strict about where you can break lines, and in languages like Arabic where words are essentially continuous marks, it doesn't work very well if you aren’t at least a little selective in where you break.
>Yup, CSS takes care of that.

So don't override it then.

>> Even in English you need to be somewhat careful about where you break so as not to damage the meaning of text (e.g.
>>
>>                 His clean but
>>                   toned coat
>>
>> Reads quite differently from
>>
>>                    His clean
>>               buttoned coat.

>Words are not boken in the middle unless there is no other way.

That is not what the text says.

A: ◾Regardless of the value of the 'white-space' property, lines must be wrapped at the edge of their containing blocks, even if doing so requires splitting a word where there is no line breaking opportunity
B: ◾Regardless of the value of the 'white-space' property, any line breaks inserted by the user agent for the purposes of line wrapping must be placed so as to minimize Δ across each run of consecutive lines between preserved newlines in the source

The only way that I can see that these two phrases can operate simultaneously, and I see no indication that one is subordinate to the other, is if the line breaks can be placed arbitrarily.
If you want to preserve word structure then you need more nuanced wording here (or better still remove it altogether and rely on CSS).


>> Even though the former has a better 'delta'.
>>
>> Now depending on whether the spec truly requires an absolute minimum 
>> delta (in which case the proper breaking would be something like the 
>> truly awful,
>>
>>               His clean bu
>>               ttoned coat
>>
>> The relative weighting of the two rules is not clear, so there is definitely a reading which would imply that the spec would actually require the first layout, over the second..

>You assume that words are broken in the middle, which they aren't.

I don't assume anything, it's the logical conclusion of what the spec says. It explicitly says a) 'even if doing so requires splitting a word' and b) minimize delta.


>> There is a very good reason CSS is not overly prescriptive in this area and they have been thinking about it for 20 years or so. Far better I think to leave it up to the author in the first instance, and CSS in the second.

>Sure, the author can always influence this. CSS does not provide for balanced text at the moment, though the feature is in development and WebVTT will then rely on the CSS spec for it. We have heard from many caption experts that captions require balancing - it makes for faster reading of captions.

Yes captions need balance, and caption authors do that. Don't try and make browsers do it,  because they aren't very good at it.

If and when CSS comes up with a solution, then let's by all means look at it, but I can 100% guarantee it won't be the solution you have here.

>>>Try adding an "align:start" and you will be fine.
>> Yes, but I don’t want a left aligned caption, I want a centered aligned caption which stretches from 0vw to 50vw, i.e. centered around  25vw.
>>
>>  Let's say because I am modeling a two speaker dialog and I want each speaker to have their own half of the screen. There is plenty of room for that.  Can you tell me what values I should use to achieve it?
>If you want it centered around 25vw, then you have to write "position:25%"  and not "position:0%".

That doesn't work in the spec-as-written.
It appears you intend to remove the text in section 15.2.1.10.14.else.2 is that correct?
But even after removing the reposition text I still cannot have a centered box that is left or right aligned in the video and is wider than 50% of the video width.


>> I think one significant problem is that the cue properties are doing partial double duty here and interact in subtle ways and this leads to some problems, not least of which, although I haven’t done a full analysis yet, is that it seems there are many quite valid layouts that VTT is unable to express, like the fairly common one I just gave. And really I'm not sure why these controls are even necessary, CSS provides all the controls you could possibly need to position a box in the video rectangle; As VTT is supposed to be a simple browser optimized format, why even have these controls at all?
>VTT relies on CSS to render them, but VTT makes it easier for VTT authors to write rules that make sense in the limited viewport-related positioning use case that VTT targets rather than forcing authors to understand the full CSS specification.

There is nothing hard to understand about setting top, left width and height, and having the renderer respect those values  Its a good deal simpler than trying to get your head around the current spec as this thread amply demonstrates.

>> VTT seems to be heavily optimized to solve a corner case of a corner case, that is automatically moving captions to avoid overlap; which almost never happens in practice, and could quite easily be taken care of as an authoring constraint as is done in SDP and indeed in VTT for time order. It wouldn’t matter so much, but this is seems to be largely at the expense of predictability and expressiveness which are two key requirements if VTT is going to be the target for the worlds caption corpus. The position of a caption is part of its semantics, as they are often placed to indicate which speaker is speaking; to avoid specific areas in the video itself; and for aesthetics. If the basic controls don’t offer the author the ability to at least semi accurately place captions, IMO the format is going to be very hard to use, if not a complete non-starter - hence my admittedly unfortunate comment.
>
>Caption overlap only comes into play when there is more than 1 caption displayed at the same time. As you say, it's a very unlikely case.
>Therefore, what you seem to be focusing in on isn't even executed 99% of the time. If that's a case that you're not interested in, just ignore that part of the spec (unless of course you are implementing it in IE and need to be compatible with the other browsers when this corner case actually happens).

No the parts of the spec I'm having issues with happen well before considering any overlap, and need to work properly before this document can go to rec. 
What I am implementing and where that implementation might end up isn't relevant to this discussion.

>>> If you don't author accurate cues, you are bound to get rubbish. That's the case with any format.
>> While I agree with your sentiment, the problem is in VTT it is very hard to author accurate cues.
>
>I partially agree with you here - but it's all captured in the bugs.

I think it may only be partially captured in the bugs. But lets see when you have finished correcting them.


>> Even apparently simple cases are fiendishly difficult to arrange, some common arrangements seem actually impossible, and there are a number of UA specific values that can move my content in unknown ways,. This is my point.
>>
>> Hoping you take this in the spirit I intend it, which is to make captioning the web the best experience it can be for users, achieved with the minimum of expense and effort on the part of providers.

>Agreed, that's indeed the intention of WebVTT. I'll go now and make some spec changes. ;-)

Interesting . I would have thought you'd need to resolve the issues with the community group first?
Anyway it seems we may need to table this discussion until the text is a somewhat more mature.

Regards,
Sean
Received on Wednesday, 12 June 2013 13:06:55 UTC