- From: David Singer <singer@mac.com>
- Date: Thu, 28 Sep 2017 13:01:09 -0700
- To: "public-texttracks@w3.org" <public-texttracks@w3.org>
Thank you Silvia it would be good to get rapid resolution of these issues, so comments that lead towards closure would be warmly received. > On Sep 28, 2017, at 8:00 , Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote: > > Hi fellow WebVTT devs, > > As you will know, WebVTT is in the process of moving towards > standardisation in the TTWG. Since releasing the updated Working > Draft, a number of W3C members have sent their feedback. > > I'd like to involve this group into the discussions around the > feedback, so am sending the emails on. > > Note that most of the feedback is going straight into GitHub, so keep > an eye on https://github.com/w3c/webvtt/issues/ . > > Below is my first forward of a really in-depth feedback from Nigel at > the BBC. He has also registered the issues in GitHub, so if you'd like > to provide specific feedback, please do it there. > > Cheers, > Silvia. > > > ---------- Forwarded message ---------- > From: Nigel Megitt <nigel.megitt@bbc.co.uk> > Date: Fri, Sep 22, 2017 at 7:46 AM > Subject: WebVTT wide review > To: Timed Text Working Group <public-tt@w3.org> > > > Dear TTWG, > > In my BBC role with responsibility for engineering strategy for access > services including subtitles I've reviewed WebVTT and have the > following feedback. > > The first thing to note is that this review is based on the question > "would/could BBC use WebVTT?" or if not, "what would it take for BBC > to be able to use WebVTT?" and is therefore affected by our existing > infrastructure, guidelines and approach and is somewhat specific to > BBC (and possibly other organisations with similar requirements). It > is not meant to suggest that the same response would apply for all > other organisations and individuals considering WebVTT. Nor is it > intended to start a "flame war" between standards or factions: note > that it is not a comparative review, and it may be that other formats > suffer similar issues or worse! I make this point explicitly because > my overall view with WebVTT at present is that it has architectural > and design features that make it costly or difficult for the BBC to > use across the whole chain from authoring to presentation. > > Secondly, in reviewing, I am considering it both from an architectural > perspective and in some cases looking at the detail – apologies in > advance for switching between the two without warning! As a > consequence, it seems inappropriate to raise GitHub issues on the repo > for all the comments. Where it is appropriate I intend to do so if > nobody else does it, but I'm going to write first and then raise the > issues later. Nevertheless I think architectural comments are worth > raising as part of the wide review, even if there is no easy > resolution. Those comments are probably best thought of more of an > explanation than a call for specific actions for which I might expect > a 'won't fix' response. I would still like a disposition response on > those comments though, even if that only is an explanation about why > the decision was taken. > > Thirdly, though my comments are all about changes that I'd like to > see, that is, issues I have identified, it is worth pointing out that > as a simple way to serialise subtitle or caption text and associate it > with times and presentational styles, WebVTT mostly works, within a > particular scope. > > WebVTT serialisation and parsing > > The WebVTT syntax is similar to (but incompatible with) SRT but > otherwise distinct from all other syntaxes, and includes a subsection > that is effectively CSS syntax. I consider the serialisation and > parsing of a document format to be an architectural layer in its own > right, ideally with tests, tools and support for the format. In the > case of WebVTT the fact that it has a unique format means that the > benefits of referencing an independent serialisation and parsing layer > are absent. For internal business to business transactions this > creates some hurdles: it is costlier to develop a syntax checker for > example to validate that received files are well formed, or to quality > check the content; writing custom parser code becomes a security risk > since issues like buffer overflow are more commonly, though not > uniquely, found in less mature code. The tool support for e.g. JSON, > HTML or XML serialisation is much more mature and less likely to > suffer from these problems. > > It is unclear what action could resolve this with WebVTT in its > current form, without taking seemingly extreme steps. For example if > WebVTT were a semantic model plus an API, and alternative > representations were defined, and at least one of those alternative > representations were a more commonly used one, that would help, though > at the expense of adding an initial step for every WebVTT import or > export, which is to work out which representation to use. > >> From this perspective, the syntax of WebVTT seems better suited to > direct writing and editing in text editors by humans than by software, > though obviously it is ultimately feasible to use either. For an > organisation like the BBC authoring and distributing subtitle > documents at scale it would be better to optimise for machine reading > and writing instead of human reading and writing, since we expect > subtitle authors and editors to use specialist software rather than > tweaking files directly. > > Content signalling > > WebVTT is able to carry data intended for different purposes – > captions, chapters, metadata etc. However the lack of signalling > within documents creates an asset management and validation problem. > If we have 3 WebVTT files associated with a single video asset, how do > we know which is which? How would we validate them? The same issue > arises with languages: it is impossible to signal the primary language > for a WebVTT file within the file; requiring it to be set by the HTML > srclang attribute is, frankly, completely useless within the scope of > the WebVTT specification. How can an HTML page be generated to specify > it correctly, and how can the value be validated for any given WebVTT > file unless the language identifier is present within it? What if the > WebVTT file is intended for use without being embedded in an HTML > document? > > The three types of WebVTT file indicated in §4.6 appear to be > indistinguishable by inspecting the file. > > Timing > > WebVTT timing does not specify the time base that applies, except by > reference to HTML5.1 media timelines. In general it is assumed to be > media time, i.e. The first frame of the video is time zero. That makes > sense in relation to a distributed piece of media such as would be > present in an HTML video track. However there are broadcast usages of > subtitle files that need to relate times to embedded timecode in > video. It may be that such practices are phased out over time; until > that occurs however, there is no defined data structure in WebVTT > (even as metadata) that allows the processor to understand how to > relate timestamps to the media with which the file is somehow > associated, or indeed how the timestamps were generated. > > For example, if a WebVTT file were authored against embedded timecode > timestamps, but then that file were accidentally re-used against video > media stripped of such timestamps, the content would display at the > wrong time, but there would be no information available within the > WebVTT file to identify that such a scenario had occurred. > > The syntax of WebVTT timestamps requires that all timestamps contain a > full stop character followed by three digits. This is a likely source > of errors in two ways: Firstly, if a conformant processor sees a > timestamp with a different number of characters (even a greater > number) after the full stop, rather than returning a valid alternative > time, it returns an error. This is therefore not robust to authoring > errors. Secondly, a non-conformant processor that does not return an > error must make a decision about how to deal with the strange number > of digits. Since it is clearly defined as a "number of milliseconds" > most interpretations would/should treat a timestamp such as > "00:00:01.2" as having 2 milliseconds, but this is contrary to the > normal mathematical representation of a fractional value, which would > consider it to be 200 milliseconds. I would propose that the parser > should a) append '0's after the value4 string to pad it to 3 > characters if it has fewer than 3 and b) truncate the value4 string to > 3 characters if it has more than 3. (rounding seems pointlessly > complex when the significance is +/- 0.5ms and the user agent probably > doesn't meet that level of precision anyway) > > Styling > > The default font size is 5vh, however the default line spacing for > addressing line positions is only 5.33vh, representing a line spacing > of 106.6% - that would represent an unusually small line height and > may result in strange effects. I cannot see anywhere that overrides > the line position height based on the computed value of the > line-height property. (comment repeated below under positioning, since > it's relevant there too) > > The default background colour's opacity of 0.8 may cause accessibility > problems for some users, especially for bright patterned video behind > the text. This can easily be fixed by setting it to 1. > > The list of ::cue CSS properties in §7.2.1 is a useful minimal set, > especially for a non-CSS-based implementation however the requirement > to ignore any properties not in the list seems overly restrictive for > a CSS based implementation. What is the mechanism for adding further > styling options in the future, given that WebVTT files do not indicate > any specification version? > > There is no similar list for ::cue-region which seems like a strange > inconsistency – I am unsure which properties are permitted on a > region. For example, is border permitted? What about padding? > > Positioning > > In some places sizes and positions are defined relative to the video > viewport; in others the video itself. This is likely to cause some > confusion or mis-alignment when the two are not the same (e.g. a 16:9 > aspect ratio video is displayed in a 14:9 viewport) and creates an > authoring problem. For example the cue box size is relative to the > video but the cue box line is relative to the video viewport (both > defined within section 3.1). > > Two positioning and sizing constructs are defined: the cue box and the > region. It is not clear in the data model section why both are needed > and how a cue box is supposed to interact with a region. Indeed it > appears that they are independent constructs given that only cue's > position (and none of the other attributes) relates to or depends on > the region. The cue box size is relative to the video viewport, not > the region. This means that the concept of a region is not a "bounding > box" within which cue text is rendered, but is only an alternative > positioning reference. That semantic definition differs from other > specifications that define regions, including TTWG's other > specification covering this use case, TTML. (as Chair, I would prefer > that there be semantic alignment here between the working group's > specifications, and given the history and current state, that would > mean modifying the treatment of regions in WebVTT.) > > Reading section 3 data model, it appears that the region width and > height are unused. This may be a bug, since §6.1 rule 14.3.3 says to > interpret the cue's computed position as a percentage of the region > width. It is in general hard to work out if the data model section > aligns with the rendering section. I'm not sure how to fix that, but > until it is fixed, document authors and implementers may be at > loggerheads, each apparently doing the right thing but finding > unexpected results for the other. > > It is not clear how the use of lines as a positioning construct is > defined. Whereas section 4.4 "line cue setting" says that the height > of a line is based on the first line, in §6.1, step 12.1 bullet two, > the line height is initialised to a very specific value 5.33vh, > regardless of the height of the line, and I can't see anywhere that > sets another value in its place (I could not see any reference to the > line-height property in the rendering section, but please point me to > it if it is there). > > The user-agent defined edge margins used in section 6 Rendering > effectively mean that the author of the file can not predict positions > or line breaks accurately since the size and position of text cannot > be known at authoring time. This prevents the author from being able > to position captions in a location that avoids overlapping parts of > the video that are important for the viewer to see in order to > understand the content. This makes it effectively impossible to meet > the BBC's editorial subtitle guidelines using WebVTT. Note that this > comment must logically only apply when a single track is visible since > WebVTT defines rules for multiple tracks being displayed > simultaneously, in which case positioning is presumably outside the > realm of the subtitle author, in the general case. > > As an alternative to edge margins, why not allow the something like > the padding property to be set on whatever viewport is used as the > positioning reference area? (cf comment above about positioning > relative to video or to video viewport) By the way, from a CSS box > model perspective, calling them "margins" is confusing – aren't they > "padding" values since they are inset from the video viewport area? > > Metadata > > The lack of delimiters on metadata is restrictive – it means that data > structures cannot be placed in metadata blocks without first checking > that they do not contain text that would be interpreted by a processor > as cue text. Then if they do contain such text, there is no defined > escaping mechanism to work around it. We capture metadata about the > production process of subtitle and caption files within those files, > for internal re-use – that appears to be impossible with WebVTT in a > general way now, though could be addressed by means of delimiters, > escaping mechanisms etc. > > Terminology > > It is extremely difficult to read and understand the WebVTT > specification because of the terminology used, especially in relation > to positioning. Many concepts have overlapping use of words when they > relate to different things. For example section 4.4 "WebVTT cue > settings" includes the "WebVTT line cue setting" which actually > configures the "cue box" position. > > A detail point about the text in that section: looking at that line > cue setting, it "configures the offset of the cue box from the video > viewport's edge in the direction opposite to the writing direction". I > think that means "orthogonal to the writing direction" but it could be > interpreted as the negative of the writing direction, i.e. If the > writing direction is top-to-bottom then does this mean left-to-right > or bottom-to-top? Then the WebVTT position cue setting "configures the > indent position of the cue box in the direction orthogonal to the > WebVTT line cue setting". Why doesn't it simply say "in the writing > direction"? > > It would be easier for me to understand and follow if there were a > clearer box model, in which there is a region box, a cue box within > that region, and then a set of cue lines within that cue box. Then the > positioning units of, say, the cue box, can be in terms of lines, > relative to the region edges, and the overall rendering area (the > default region?) can be positioned relative to the video or the video > viewport in a consistent manner. > > Algorithmic specifications > > The algorithmic nature of the processing model is helpful for initial > implementations, but it is very difficult to understand and follow and > perhaps more importantly looks brittle in case of changes. If a change > is needed to the algorithm, it may be challenging for implementers to > work out where in their implementations the equivalent change is > needed, especially if the implementation has been optimised and does > not follow the specification's structure exactly. > > Formatting > > There seem to be some formatting oddities when I view the > specification on Firefox or Chrome – in section 3.1 there's an > algorithm for calculating the computed line, whose outline box might > be helpful but overlaps the bullet numbers on the left. > > The diagram at the bottom of section 3.2 seems to truncate the word > "Video viewport". > > Some text has a yellow highlight colour without explanation, e.g. In Example 19. > > Conclusion > > As mentioned at the top, as this specification is at present I could > not recommend to the BBC that it should adopt WebVTT across the whole > chain from authoring to presentation for both broadcast and online > use, for prepared and live subtitles. That would be my preferred scope > for any subtitle/closed caption format, though may go beyond the > intended requirements set for WebVTT (which by the way are unstated in > the specification). Restricting the scope, to consider solely the last > stage and limiting the devices, i.e. thinking about direct > distribution to browsers, the main presentational issue is that the > positioning semantics are hard to understand and seem to work against > the BBC's subtitle guidelines (developed with over 30 years of > experience with closed captions). This may be something that can be > resolved by improvements to the specification. There appears to be no > support at present for live contribution, though this could be > addressed by other specifications. The issues concerning content > identification, language etc could probably be resolved quite easily > to assist with the asset management challenges I have identified > above. The issue about serialisation and parsing probably goes beyond > any reasonable expectation to address at this stage, but could be a > continuing content management problem. > > Kind regards, > > Nigel > > -- > Nigel Megitt > Executive Product Manager, BBC Design & Engineering > Dave Singer singer@mac.com
Received on Thursday, 28 September 2017 20:01:42 UTC