Re: WebVTT

On Wed, Jun 12, 2013 at 11:05 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>
>
>>> My major concern at this point is how stable this spec is. If we (Microsoft) put an implementations out in the world, it can be very hard to retract if it's wrong, and as you are no doubt aware, if the spec subsequently changes this is often used by detractors to berate those efforts. If there are parts of the spec that are known to be wrong/under dispute (i.e. more than normally unstable - as obviously it's all under development), can you please, as a matter of some urgency, mark them in some way - no need to have a fix at this point, but some highlight and/or a link to a relevant entry in the Bugzilla database would be very helpful. I have read through the extant bugs, and I think I have a handle on the things that you are trying to fix, but I cannot be 100% sure I got everything, and many of the threads there peter out inconclusively so it's hard to know what their current status is.
>
>>I'd rather spend my limited time fixing them.
>
> Well I understand you have limited time, as we all do; however it's very hard to have any kind of debate on this if we don't know what's going on. I don’t see a lot of discussion happening on the CG mailing list or on the bug threads. How do I stay in touch with this activity?


That's all there is - the mailing list and the bugs. Just add your
name to all the CG bugs (all you need to do is click the "Add me to CC
list" toggle button at the top right and hit "Save Changes" when you
are on the page of a bug). There's not that many.


>>> My secondary concern however is whether,  even when corrected, it specifies something that is genuinely useful, and at this point I am struggling with that. One part of that concern stems of course from whether this could in practice be used as a delivery format for content stored in TTML or other formats, either as a conversion in the browser or in a server somewhere. My feeling at this point is that VTT does not offer a sufficiently general positioning mechanism to allow that to happen.
>
>>What's your exact concern about the positioning mechanism? If you can clarify that and raise it on list or in a bug, we can discuss  and fix it.
>
> My concern is that one cannot pick an arbitrary rectangle on the screen for example 0vw 80vh to 60vw 94vh with centered text using the controls in the current spec text.

The theory is that because you have a centered text (i.e.
align:middle), you need to specify the middle of your cue as the cue
position (i.e. line:60%, position:47%). Then you specify the cue's
width appropriately (i.e. size:94%). That should do it.


> Essentially In order to be able to translate TTML into this you need be able to directly set the CSS values for top, left, width and height directly on a cue, and not have it subsequently altered (it may of course be clipped by the video viewport)

Subsequent altering will only happen if the text doesn't fit into this
box or overlaps other cues.


>>> I cannot look at your source code for legal reasons, so I'll have to take your word on how close to the written spec that implementation is.
>>It's all under a creative commons license which puts it in the public domain so your legal reasons likely don't apply: http://creativecommons.org/publicdomain/zero/1.0/
>
> Its an internal ruling by my legal department so that I am not influenced by your code should I need to create my own implementation, which I am doing.
> Its not impossible for me to use FOSS licensed code, but it needs me to make a lot more of a case to legal than I am prepared to make at this point

Fair enough. :-)

>>>>> Where do you get that from? I don't think that's correct - why would it drop half the line?
>>> As to an underlying rationale, I can offer no opinion, but the I base my reasoning on the following:
>>>
>>>>If the text track cue writing direction is horizontal, and the text
>>>>track cue snap-to-lines flag is set
>>>    y-poition -> 0
>>> ...
>>>>14.Adjust the positions of boxes according to the appropriate steps from the following list:
>>>   >If cue's text track cue snap-to-lines flag is set (it is)
>>>   > margin = 1vh
>>>              - In the absence of overscan, this value should be picked for aesthetics  (to avoid text being aligned precisely on the bottom edge of
>>>               - the video, which can be ugly).
>>>                 Notwithstanding that 'aesthetics' is a remarkably subjective term to use in such an algorithmic oriented specification,
>>>                 this  important value is UA dependent, which means vertical position is actually not predictable for authoring.
>
>> Well, there's a strong recommendation of 1%.

You're right. It's missing - I thought we had made a recommendation.
I've registered a bug:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=22332 .


> Where is this recommendation documented, and more importantly how does an author work around it since it has a bearing on what to write in the file.

It's mainly created to work around overscan, so when the browser knows
that a video has overscan, it can automatically deduct that from the
caption rendering area.


>>>   > Let full dimension be the height of video's rendering area  = 100vh
>>>   > 3.Let max dimension be full dimension - (2 × margin).    (margin = 1  max = 98)
>>>  > 4. Let step be the height of the first line box in boxes.  (step ->
>>> 6 based on line-height:normal =>  1.2 and a font height of 5vh from section 5.2.2)  > 6.Let line position be the text track cue computed line position.
>>>                 >> 7. Let n be the number of text tracks whose text track mode is showing and that are in the media element's list of text tracks before track.
>>>                           n -> 0
>>>                 >> 8. Increment n by one.
>>>                           n -> 1
>>>                  >> 9. Negate n.
>>>                           n -> -1
>>>                  >> 10. Return n.
>>>      line position -> -1
>>>> 8.Let position be the result of multiplying step and line position
>>>          Position -> -6
>>>> 10.If line position is less than zero then increase position by max
>>>> dimension, and negate step
>>>          Position = 92
>>>          Step = 6
>>>> Move all the boxes in boxes down by the distance given by position.
>>>
>>> So where top was 0, it is now 92vh; sufficient for one line and a bit
>>> - since the spec requires partial lines to be removed,
>
>>I don't see where that conclusion comes from. There are no partial lines to be removed - line breaks are introduced when the text is too long.
>
> Yes as I explain immediately below, the text is too long because its 50 characters of 3vh, which adds up to 150vh.
> If the width of my text is 150% of the height of the video and the width of my video is 133% of its height (in a 4x3 aspect ratio), then as 150 > 133 clearly it needs to be broken into more than one line.
> Now your implementation may be using some very narrow characters, or wider video, in which case you might not have seen this; but 3vh for a 5vh font seems to be typical.

OK, I still don't follow. But if you have discovered a bug, please register it.


>>>only the first line shows up. Or gets clipped depending on the outcome of that bug issue. Given that the example text is 50 characters long, and with a font advance of ~3vw, its probable that this text needs to break into at least two lines, at least for the purposes of determining the container height;
>
>>>it may subsequently be restyled to a different size, and thus reflow, but I believe that does not change the layout box.
>>Yes it does.
>
> If you say so, but thats not following the spec as written, as it contradicts the following text from the spec, which is part of "The rules for interpreting WebVTT cue text ":
>
> "No style sheets are associated with nodes. (The nodes are subsequently restyled using style sheets after their boxes are generated, as described below.)"
> And later - in 5.2.2 Applying CSS properties to WebVTT Node Objects
> "On the (root) list of WebVTT Node Objects, the 'position' property must be set to 'absolute' ...  the 'top' property must be set to top, the 'left' property must be set to left, the 'width' property must be set to width, and the 'height' property must be set to height, where direction, writing-mode, top, left, width, and height are the values with those names determined by the rules for updating the display of WebVTT text tracks for the text track cue from whose text the list of WebVTT Node Objects was constructed."
> therefore the crucial layout traits are fixed here based on the values we computed as above.

Yes, external CSS is applied after the WebVTT cue rendering algorithm
has executed and the basic CSS parameters been set. However, during
the rendering algorithm, the width of the video is taken into account
and lines are broken and create new CSS boxes that become part of what
is being rendered. That's what I was referring to.


> Although style can be applied at any point only the traits explicitly set by the properties set in 5.3.3 have any effect:
> "The following properties apply to the '::cue' pseudo-element with no argument; other properties set on the pseudo-element must be ignored
> •'color'
> •'opacity'
> •'visibility'
> •'text-decoration'
> •'text-outline'
> •'text-shadow'
> •the properties corresponding to the 'background' shorthand
> •the properties corresponding to the 'outline' shorthand
> •the properties corresponding to the 'font' shorthand, including 'line-height'
> •'white-space'
> "
> Since none of these will affect the absolute positioning traits of top, left, width or height. The layout box remains fixed. and the inner nodes are flowed with respect to that containing block.
> Therefore if the font size changes, then the text needs to flow inside that box.


No, the rendering algorithm is executed again:
"User agents that support the pseudo-element described below must
dynamically update renderings accordingly. When either 'white-space'
or one of the properties corresponding to the 'font' shorthand
(including 'line-height') changes value, then the text track cue's
text track cue display state must be emptied and the text track's
rules for updating the text track rendering must be immediately
rerun."

The re-run is then using the new CSS settings for those properties and
thus ends up creating different boxes.


>>>> However, you're telling the browser to position a cue of 50% width middle aligned at the left side of the video. At the left there is no space for a middle aligned cue, so as much as the browser is trying to squeeze this cue  in, it wasn't given any space and thus the cue disappears (1 char long is a bug in an implementation). So it disappears.
>>>
>>> Well my expectation was, possibly naively, that the cue will be subsequently moved, to have its left edge  at 0vw, so, in actuality there is plenty of room for it.
>
>>Why should the browser be allowed to freely move a cue when you specified a middle aligned cue that is positioned at the left edge of the video with its middle position there? That would be breaking the author's intention.
>
> Well precisely. It shouldn't. It shouldn't resize it either. In my opinion the whole notion of the browser "fixing up" a layout to suit itself is misguided. However that is what the spec-as-written requires it to do.

Only when it has no other choice.


>>> The size constraint appears to happen at the wrong time, and is IMO actually unnecessary. Just define the video viewport to clip all cues, and let the author  be responsible for keeping their content visible. That is the CSS way of things.

The "size" defines the width of the cue. It's an important part of
defining the box as you outlined above.

BTW: you might be interested in
https://www.w3.org/Bugs/Public/show_bug.cgi?id=20146


>>As I said: WebVTT tries really hard to keep all cues within the viewport, including avoiding clipping text. We may change that constraint, though, as a consequence of the bugs.
>
> OK well as I said I think the principle should be trust the author, and rely on clipping to the viewport where the author abuses that trust.

Yes, that might be how we may fix it. I've got to look into that bug
in more detail.

> "Fixup" behavior may be fine for a user whipping up some captions for a 5m YouTube video, it's not really appropriate if VTT is intended as the delivery vehicle for the worlds caption corpus, which unless I am very much mistaken Ian has explicitly stated it was never intended to be.

I don't remember such a statement. Anyway - WebVTT should be good for
both use cases. With the new region spec, you may find you can achieve
a bit more control, because cues that are painted into regions will
not try to be adjusted, but simply overlap when not authored
carefully: https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/region.html
.



>>Words are not boken in the middle unless there is no other way.
>
> That is not what the text says.
>
> A: ◾Regardless of the value of the 'white-space' property, lines must be wrapped at the edge of their containing blocks, even if doing so requires splitting a word where there is no line breaking opportunity
> B: ◾Regardless of the value of the 'white-space' property, any line breaks inserted by the user agent for the purposes of line wrapping must be placed so as to minimize Δ across each run of consecutive lines between preserved newlines in the source
>
> The only way that I can see that these two phrases can operate simultaneously, and I see no indication that one is subordinate to the other, is if the line breaks can be placed arbitrarily.
> If you want to preserve word structure then you need more nuanced wording here (or better still remove it altogether and rely on CSS).

"even if doing so requires.." is careful wording. It says that only if
necessary words will be split in the middle.

> I don't assume anything, it's the logical conclusion of what the spec says. It explicitly says a) 'even if doing so requires splitting a word' and b) minimize delta.

It clearly says "Text runs must be wrapped according to the CSS
line-wrapping rules", but in addition the wrapping should minimize the
delta and if necessary split words where there is no line breaking
opportunity. I don't see how that could be interpreted differently.


>>> There is a very good reason CSS is not overly prescriptive in this area and they have been thinking about it for 20 years or so. Far better I think to leave it up to the author in the first instance, and CSS in the second.
>
>>Sure, the author can always influence this. CSS does not provide for balanced text at the moment, though the feature is in development and WebVTT will then rely on the CSS spec for it. We have heard from many caption experts that captions require balancing - it makes for faster reading of captions.
>
> Yes captions need balance, and caption authors do that. Don't try and make browsers do it,  because they aren't very good at it.

The experience from YouTube has shown that many captions/subtitles are
provided by in a single long line. These lines usually end up having
to be broken (because they are wider than the video or wider than the
available caption width). And they usually end up being broken with a
massive imbalance of words (only one or two words ending up in the
second line, the rest in the first). That's the only problem that this
approach is trying to solve. A professional captioner will naturally
provide balanced captions that the browser does not have to reflow.


> If and when CSS comes up with a solution, then let's by all means look at it, but I can 100% guarantee it won't be the solution you have here.

The CSS work is being influenced by the need of the WebVTT reflow
algorithm, so I expect it to satisfy the needs.


>>>>Try adding an "align:start" and you will be fine.
>>> Yes, but I don’t want a left aligned caption, I want a centered aligned caption which stretches from 0vw to 50vw, i.e. centered around  25vw.
>>>
>>>  Let's say because I am modeling a two speaker dialog and I want each speaker to have their own half of the screen. There is plenty of room for that.  Can you tell me what values I should use to achieve it?
>>If you want it centered around 25vw, then you have to write "position:25%"  and not "position:0%".
>
> That doesn't work in the spec-as-written.
> It appears you intend to remove the text in section 15.2.1.10.14.else.2 is that correct?

There is no section 15 .. looking ... - I assume you mean section 5.
Yes, that part needs to be rewritten.

> But even after removing the reposition text I still cannot have a centered box that is left or right aligned in the video and is wider than 50% of the video width.

You will be able to, once it's changed. And also, right now, the
region spec allows you to do this.


>>> VTT seems to be heavily optimized to solve a corner case of a corner case, that is automatically moving captions to avoid overlap; which almost never happens in practice, and could quite easily be taken care of as an authoring constraint as is done in SDP and indeed in VTT for time order. It wouldn’t matter so much, but this is seems to be largely at the expense of predictability and expressiveness which are two key requirements if VTT is going to be the target for the worlds caption corpus. The position of a caption is part of its semantics, as they are often placed to indicate which speaker is speaking; to avoid specific areas in the video itself; and for aesthetics. If the basic controls don’t offer the author the ability to at least semi accurately place captions, IMO the format is going to be very hard to use, if not a complete non-starter - hence my admittedly unfortunate comment.
>>
>>Caption overlap only comes into play when there is more than 1 caption displayed at the same time. As you say, it's a very unlikely case.
>>Therefore, what you seem to be focusing in on isn't even executed 99% of the time. If that's a case that you're not interested in, just ignore that part of the spec (unless of course you are implementing it in IE and need to be compatible with the other browsers when this corner case actually happens).
>
> No the parts of the spec I'm having issues with happen well before considering any overlap, and need to work properly before this document can go to rec.
> What I am implementing and where that implementation might end up isn't relevant to this discussion.

Yes, there are bugs to fix.


>>>> If you don't author accurate cues, you are bound to get rubbish. That's the case with any format.
>>> While I agree with your sentiment, the problem is in VTT it is very hard to author accurate cues.
>>
>>I partially agree with you here - but it's all captured in the bugs.
>
> I think it may only be partially captured in the bugs. But lets see when you have finished correcting them.

Feel free to register any further bugs that you come across.


>>> Hoping you take this in the spirit I intend it, which is to make captioning the web the best experience it can be for users, achieved with the minimum of expense and effort on the part of providers.
>
>>Agreed, that's indeed the intention of WebVTT. I'll go now and make some spec changes. ;-)
>
> Interesting . I would have thought you'd need to resolve the issues with the community group first?

What do you mean? Of course I have to fix bugs that have been
registered. I can't just go and change the spec randomly.


> Anyway it seems we may need to table this discussion until the text is a somewhat more mature.

I think it's one particular section that you have the most trouble
with. But feel free to wait until that bug is fixed.

Regards,
Silvia.

Received on Wednesday, 12 June 2013 14:11:13 UTC