- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Wed, 12 Jun 2013 19:30:49 +1000
- To: Sean Hayes <Sean.Hayes@microsoft.com>
- Cc: John Birch <John.Birch@screensystems.tv>, "public-tt@w3.org" <public-tt@w3.org>
On Wed, Jun 12, 2013 at 12:27 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote: > OK. I got a bit carried away there with my hyperbole, and I apologize unreservedly for that. No personal attack intended; Ian is a very smart guy and has obviously put a lot of time into this, as have many other smart people, and I respect that. Don’t get me wrong, I am genuinely trying to get my head around this and am attempting to not only understand what this spec actually says, but also if possible why it says it. As Glenn once quipped, doing spec work is the art of reading carefully. Which is what I am attempting to do. I'm prepared for the fact that the spec might be buggy, and that’s to be expected, but since the spec offers almost no redundancy through motivational text, pictures or examples, and this model is so unlike anything else, it's really hard to spot what is intentional and what is a bug. That's a fair statement and I'd be happy to make any fixes necessary. > My major concern at this point is how stable this spec is. If we (Microsoft) put an implementations out in the world, it can be very hard to retract if it's wrong, and as you are no doubt aware, if the spec subsequently changes this is often used by detractors to berate those efforts. If there are parts of the spec that are known to be wrong/under dispute (i.e. more than normally unstable - as obviously it's all under development), can you please, as a matter of some urgency, mark them in some way - no need to have a fix at this point, but some highlight and/or a link to a relevant entry in the Bugzilla database would be very helpful. I have read through the extant bugs, and I think I have a handle on the things that you are trying to fix, but I cannot be 100% sure I got everything, and many of the threads there peter out inconclusively so it's hard to know what their current status is. I'd rather spend my limited time fixing them. > My secondary concern however is whether, even when corrected, it specifies something that is genuinely useful, and at this point I am struggling with that. One part of that concern stems of course from whether this could in practice be used as a delivery format for content stored in TTML or other formats, either as a conversion in the browser or in a server somewhere. My feeling at this point is that VTT does not offer a sufficiently general positioning mechanism to allow that to happen. What's your exact concern about the positioning mechanism? If you can clarify that and raise it on list or in a bug, we can discuss and fix it. > I cannot look at your source code for legal reasons, so I'll have to take your word on how close to the written spec that implementation is. It's all under a creative commons license which puts it in the public domain so your legal reasons likely don't apply: http://creativecommons.org/publicdomain/zero/1.0/ > And I can't know what value you chose for the bottom margin, but it does seem to me that the spec does require you to put text in that margin. And if the margin is small; (and I arbitrarily chose 1% of the video height for the top and bottom, since apparently 0 is not allowed). Then I believe the spec does require you to place text off-screen. The spec is written to avoid placing anything off-screen at all cost. Text will rather disappear than being placed outside the video viewport. >>> Where do you get that from? I don't think that's correct - why would it drop half the line? > As to an underlying rationale, I can offer no opinion, but the I base my reasoning on the following: > >>If the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is set > y-poition -> 0 > ... >>14.Adjust the positions of boxes according to the appropriate steps from the following list: > >If cue's text track cue snap-to-lines flag is set (it is) > > margin = 1vh > - In the absence of overscan, this value should be picked for aesthetics (to avoid text being aligned precisely on the bottom edge of > - the video, which can be ugly). > Notwithstanding that 'aesthetics' is a remarkably subjective term to use in such an algorithmic oriented specification, > this important value is UA dependent, which means vertical position is actually not predictable for authoring. Well, there's a strong recommendation of 1%. > > > Let full dimension be the height of video's rendering area = 100vh > > 3.Let max dimension be full dimension - (2 × margin). (margin = 1 max = 98) > > 4. Let step be the height of the first line box in boxes. (step -> 6 based on line-height:normal => 1.2 and a font height of 5vh from section 5.2.2) > > 6.Let line position be the text track cue computed line position. > >> 7. Let n be the number of text tracks whose text track mode is showing and that are in the media element's list of text tracks before track. > n -> 0 > >> 8. Increment n by one. > n -> 1 > >> 9. Negate n. > n -> -1 > >> 10. Return n. > line position -> -1 >> 8.Let position be the result of multiplying step and line position > Position -> -6 >> 10.If line position is less than zero then increase position by max dimension, and negate step > Position = 92 > Step = 6 >> Move all the boxes in boxes down by the distance given by position. > > So where top was 0, it is now 92vh; sufficient for one line and a bit > - since the spec requires partial lines to be removed, I don't see where that conclusion comes from. There are no partial lines to be removed - line breaks are introduced when the text is too long. >only the first line shows up. Or gets clipped depending on the outcome of that bug issue. Given that the example text is 50 characters long, and with a font advance of ~3vw, its probable that this text needs to break into at least two lines, at least for the purposes of determining the container height; it may subsequently be restyled to a different size, and thus reflow, but I believe that does not change the layout box. Yes it does. >> Firstly, "line:60%" has no influence on it disappearing (it's 60% down from the top of the video). > True it's not the reason it disappears, but it is not 60% down from the top of the video, at least not according to the written text, It is. If you're testing this in Chrome and haven't adjusted the align property, then I believe Chrome makes space for 1 letter and uses as many lines it can without overlapping another cue, so it likely uses the whole video height. However, "line:60%" originally positioned it 60% down the video height. > Although as you say that appears to be buggy, it requires the first line to be at 42% down, because in the repositioning section: > >> Let x be a percentage given by the text track cue text position, and let y be a percentage given by the text track cue computed line position >> 2.Position the boxes in boxes such that ... the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area.... > > Thus line=60% causes the 60% point of box to be placed at the 60% point of the height of the video. The height of the bounding box is determined by the font advance (not specified - so again the final value is in fact UA dependent) but probably around 3 vw * width and the amount of text. As I said: this part needs adjusting. >> However, you're telling the browser to position a cue of 50% width middle aligned at the left side of the video. At the left there is no space for a middle aligned cue, so as much as the browser is trying to squeeze this cue in, it wasn't given any space and thus the cue disappears (1 char long is a bug in an implementation). So it disappears. > > Well my expectation was, possibly naively, that the cue will be subsequently moved, to have its left edge at 0vw, so, in actuality there is plenty of room for it. Why should the browser be allowed to freely move a cue when you specified a middle aligned cue that is positioned at the left edge of the video with its middle position there? That would be breaking the author's intention. > The size constraint appears to happen at the wrong time, and is IMO actually unnecessary. Just define the video viewport to clip all cues, and let the author be responsible for keeping their content visible. That is the CSS way of things. As I said: WebVTT tries really hard to keep all cues within the viewport, including avoiding clipping text. We may change that constraint, though, as a consequence of the bugs. > Whether it shows 1 char or not is an ambiguity in the spec. No, it's an implementation bug. It should not be showing anything. > Since at least as far as I can tell (and I have looked pretty hard) the special breaking rules do not define what happens when the width is less than 1 character, and CSS would I believe allow overflow to occur in this case. No, only if you allow CSS to have overflow. > Also I'd point out that those special breaking rules don’t seem to work particularly well for internationalization. Japanese for example is pretty strict about where you can break lines, and in languages like Arabic where words are essentially continuous marks, it doesn't work very well if you aren’t at least a little selective in where you break. Yup, CSS takes care of that. > Even in English you need to be somewhat careful about where you break so as not to damage the meaning of text (e.g. > > His clean but > toned coat > > Reads quite differently from > > His clean > buttoned coat. Words are not boken in the middle unless there is no other way. > Even though the former has a better 'delta'. > > Now depending on whether the spec truly requires an absolute minimum delta (in which case the proper breaking would be something like the truly awful, > > His clean bu > ttoned coat > > The relative weighting of the two rules is not clear, so there is definitely a reading which would imply that the spec would actually require the first layout, over the second.. You assume that words are broken in the middle, which they aren't. > There is a very good reason CSS is not overly prescriptive in this area and they have been thinking about it for 20 years or so. Far better I think to leave it up to the author in the first instance, and CSS in the second. Sure, the author can always influence this. CSS does not provide for balanced text at the moment, though the feature is in development and WebVTT will then rely on the CSS spec for it. We have heard from many caption experts that captions require balancing - it makes for faster reading of captions. >>Try adding an "align:start" and you will be fine. > Yes, but I don’t want a left aligned caption, I want a centered aligned caption which stretches from 0vw to 50vw, i.e. centered around 25vw. > > Let's say because I am modeling a two speaker dialog and I want each speaker to have their own half of the screen. There is plenty of room for that. Can you tell me what values I should use to achieve it? If you want it centered around 25vw, then you have to write "position:25%" and not "position:0%". > I think one significant problem is that the cue properties are doing partial double duty here and interact in subtle ways and this leads to some problems, not least of which, although I haven’t done a full analysis yet, is that it seems there are many quite valid layouts that VTT is unable to express, like the fairly common one I just gave. And really I'm not sure why these controls are even necessary, CSS provides all the controls you could possibly need to position a box in the video rectangle; As VTT is supposed to be a simple browser optimized format, why even have these controls at all? VTT relies on CSS to render them, but VTT makes it easier for VTT authors to write rules that make sense in the limited viewport-related positioning use case that VTT targets rather than forcing authors to understand the full CSS specification. > VTT seems to be heavily optimized to solve a corner case of a corner case, that is automatically moving captions to avoid overlap; which almost never happens in practice, and could quite easily be taken care of as an authoring constraint as is done in SDP and indeed in VTT for time order. It wouldn’t matter so much, but this is seems to be largely at the expense of predictability and expressiveness which are two key requirements if VTT is going to be the target for the worlds caption corpus. The position of a caption is part of its semantics, as they are often placed to indicate which speaker is speaking; to avoid specific areas in the video itself; and for aesthetics. If the basic controls don’t offer the author the ability to at least semi accurately place captions, IMO the format is going to be very hard to use, if not a complete non-starter - hence my admittedly unfortunate comment. Caption overlap only comes into play when there is more than 1 caption displayed at the same time. As you say, it's a very unlikely case. Therefore, what you seem to be focusing in on isn't even executed 99% of the time. If that's a case that you're not interested in, just ignore that part of the spec (unless of course you are implementing it in IE and need to be compatible with the other browsers when this corner case actually happens). >> If you don't author accurate cues, you are bound to get rubbish. That's the case with any format. > While I agree with your sentiment, the problem is in VTT it is very hard to author accurate cues. I partially agree with you here - but it's all captured in the bugs. > Even apparently simple cases are fiendishly difficult to arrange, some common arrangements seem actually impossible, and there are a number of UA specific values that can move my content in unknown ways,. This is my point. > > Hoping you take this in the spirit I intend it, which is to make captioning the web the best experience it can be for users, achieved with the minimum of expense and effort on the part of providers. Agreed, that's indeed the intention of WebVTT. I'll go now and make some spec changes. ;-) Cheers, Silvia.
Received on Wednesday, 12 June 2013 09:31:37 UTC