WebVTT wide review

Dear TTWG,

In my BBC role with responsibility for engineering strategy for access services including subtitles I've reviewed WebVTT and have the following feedback.

The first thing to note is that this review is based on the question "would/could BBC use WebVTT?" or if not, "what would it take for BBC to be able to use WebVTT?" and is therefore affected by our existing infrastructure, guidelines and approach and is somewhat specific to BBC (and possibly other organisations with similar requirements). It is not meant to suggest that the same response would apply for all other organisations and individuals considering WebVTT. Nor is it intended to start a "flame war" between standards or factions: note that it is not a comparative review, and it may be that other formats suffer similar issues or worse! I make this point explicitly because my overall view with WebVTT at present is that it has architectural and design features that make it costly or difficult for the BBC to use across the whole chain from authoring to presentation.

Secondly, in reviewing, I am considering it both from an architectural perspective and in some cases looking at the detail – apologies in advance for switching between the two without warning! As a consequence, it seems inappropriate to raise GitHub issues on the repo for all the comments. Where it is appropriate I intend to do so if nobody else does it, but I'm going to write first and then raise the issues later. Nevertheless I think architectural comments are worth raising as part of the wide review, even if there is no easy resolution. Those comments are probably best thought of more of an explanation than a call for specific actions for which I might expect a 'won't fix' response. I would still like a disposition response on those comments though, even if that only is an explanation about why the decision was taken.

Thirdly, though my comments are all about changes that I'd like to see, that is, issues I have identified, it is worth pointing out that as a simple way to serialise subtitle or caption text and associate it with times and presentational styles, WebVTT mostly works, within a particular scope.

WebVTT serialisation and parsing

The WebVTT syntax is similar to (but incompatible with) SRT but otherwise distinct from all other syntaxes, and includes a subsection that is effectively CSS syntax. I consider the serialisation and parsing of a document format to be an architectural layer in its own right, ideally with tests, tools and support for the format. In the case of WebVTT the fact that it has a unique format means that the benefits of referencing an independent serialisation and parsing layer are absent. For internal business to business transactions this creates some hurdles: it is costlier to develop a syntax checker for example to validate that received files are well formed, or to quality check the content; writing custom parser code becomes a security risk since issues like buffer overflow are more commonly, though not uniquely, found in less mature code. The tool support for e.g. JSON, HTML or XML serialisation is much more mature and less likely to suffer from these problems.

It is unclear what action could resolve this with WebVTT in its current form, without taking seemingly extreme steps. For example if WebVTT were a semantic model plus an API, and alternative representations were defined, and at least one of those alternative representations were a more commonly used one, that would help, though at the expense of adding an initial step for every WebVTT import or export, which is to work out which representation to use.

>From this perspective, the syntax of WebVTT seems better suited to direct writing and editing in text editors by humans than by software, though obviously it is ultimately feasible to use either. For an organisation like the BBC authoring and distributing subtitle documents at scale it would be better to optimise for machine reading and writing instead of human reading and writing, since we expect subtitle authors and editors to use specialist software rather than tweaking files directly.

Content signalling

WebVTT is able to carry data intended for different purposes – captions, chapters, metadata etc. However the lack of signalling within documents creates an asset management and validation problem. If we have 3 WebVTT files associated with a single video asset, how do we know which is which? How would we validate them? The same issue arises with languages: it is impossible to signal the primary language for a WebVTT file within the file; requiring it to be set by the HTML srclang attribute is, frankly, completely useless within the scope of the WebVTT specification. How can an HTML page be generated to specify it correctly, and how can the value be validated for any given WebVTT file unless the language identifier is present within it? What if the WebVTT file is intended for use without being embedded in an HTML document?

The three types of WebVTT file indicated in §4.6 appear to be indistinguishable by inspecting the file.

Timing

WebVTT timing does not specify the time base that applies, except by reference to HTML5.1 media timelines. In general it is assumed to be media time, i.e. The first frame of the video is time zero. That makes sense in relation to a distributed piece of media such as would be present in an HTML video track. However there are broadcast usages of subtitle files that need to relate times to embedded timecode in video. It may be that such practices are phased out over time; until that occurs however, there is no defined data structure in WebVTT (even as metadata) that allows the processor to understand how to relate timestamps to the media with which the file is somehow associated, or indeed how the timestamps were generated.

For example, if a WebVTT file were authored against embedded timecode timestamps, but then that file were accidentally re-used against video media stripped of such timestamps, the content would display at the wrong time, but there would be no information available within the WebVTT file to identify that such a scenario had occurred.

The syntax of WebVTT timestamps requires that all timestamps contain a full stop character followed by three digits. This is a likely source of errors in two ways: Firstly, if a conformant processor sees a timestamp with a different number of characters (even a greater number) after the full stop, rather than returning a valid alternative time, it returns an error. This is therefore not robust to authoring errors. Secondly, a non-conformant processor that does not return an error must make a decision about how to deal with the strange number of digits. Since it is clearly defined as a "number of milliseconds" most interpretations would/should treat a timestamp such as "00:00:01.2" as having 2 milliseconds, but this is contrary to the normal mathematical representation of a fractional value, which would consider it to be 200 milliseconds. I would propose that the parser should a) append '0's after the value4 string to pad it to 3 characters if it has fewer than 3 and b) truncate the value4 string to 3 characters if it has more than 3. (rounding seems pointlessly complex when the significance is +/- 0.5ms and the user agent probably doesn't meet that level of precision anyway)

Styling

The default font size is 5vh, however the default line spacing for addressing line positions is only 5.33vh, representing a line spacing of 106.6% - that would represent an unusually small line height and may result in strange effects. I cannot see anywhere that overrides the line position height based on the computed value of the line-height property. (comment repeated below under positioning, since it's relevant there too)

The default background colour's opacity of 0.8 may cause accessibility problems for some users, especially for bright patterned video behind the text. This can easily be fixed by setting it to 1.

The list of ::cue CSS properties in §7.2.1 is a useful minimal set, especially for a non-CSS-based implementation however the requirement to ignore any properties not in the list seems overly restrictive for a CSS based implementation. What is the mechanism for adding further styling options in the future, given that WebVTT files do not indicate any specification version?

There is no similar list for ::cue-region which seems like a strange inconsistency – I am unsure which properties are permitted on a region. For example, is border permitted? What about padding?

Positioning

In some places sizes and positions are defined relative to the video viewport; in others the video itself. This is likely to cause some confusion or mis-alignment when the two are not the same (e.g. a 16:9 aspect ratio video is displayed in a 14:9 viewport) and creates an authoring problem. For example the cue box size is relative to the video but the cue box line is relative to the video viewport (both defined within section 3.1).

Two positioning and sizing constructs are defined: the cue box and the region. It is not clear in the data model section why both are needed and how a cue box is supposed to interact with a region. Indeed it appears that they are independent constructs given that only cue's position (and none of the other attributes) relates to or depends on the region. The cue box size is relative to the video viewport, not the region. This means that the concept of a region is not a "bounding box" within which cue text is rendered, but is only an alternative positioning reference. That semantic definition differs from other specifications that define regions, including TTWG's other specification covering this use case, TTML. (as Chair, I would prefer that there be semantic alignment here between the working group's specifications, and given the history and current state, that would mean modifying the treatment of regions in WebVTT.)

Reading section 3 data model, it appears that the region width and height are unused. This may be a bug, since §6.1 rule 14.3.3 says to interpret the cue's computed position as a percentage of the region width. It is in general hard to work out if the data model section aligns with the rendering section. I'm not sure how to fix that, but until it is fixed, document authors and implementers may be at loggerheads, each apparently doing the right thing but finding unexpected results for the other.

It is not clear how the use of lines as a positioning construct is defined. Whereas section 4.4 "line cue setting" says that the height of a line is based on the first line, in §6.1, step 12.1 bullet two, the line height is initialised to a very specific value 5.33vh, regardless of the height of the line, and I can't see anywhere that sets another value in its place (I could not see any reference to the line-height property in the rendering section, but please point me to it if it is there).

The user-agent defined edge margins used in section 6 Rendering effectively mean that the author of the file can not predict positions or line breaks accurately since the size and position of text cannot be known at authoring time. This prevents the author from being able to position captions in a location that avoids overlapping parts of the video that are important for the viewer to see in order to understand the content. This makes it effectively impossible to meet the BBC's editorial subtitle guidelines<http://bbc.github.io/subtitle-guidelines/#Positioning> using WebVTT. Note that this comment must logically only apply when a single track is visible since WebVTT defines rules for multiple tracks being displayed simultaneously, in which case positioning is presumably outside the realm of the subtitle author, in the general case.

As an alternative to edge margins, why not allow the something like the padding property to be set on whatever viewport is used as the positioning reference area? (cf comment above about positioning relative to video or to video viewport) By the way, from a CSS box model perspective, calling them "margins" is confusing – aren't they "padding" values since they are inset from the video viewport area?

Metadata

The lack of delimiters on metadata is restrictive – it means that data structures cannot be placed in metadata blocks without first checking that they do not contain text that would be interpreted by a processor as cue text. Then if they do contain such text, there is no defined escaping mechanism to work around it. We capture metadata about the production process of subtitle and caption files within those files, for internal re-use – that appears to be impossible with WebVTT in a general way now, though could be addressed by means of delimiters, escaping mechanisms etc.

Terminology

It is extremely difficult to read and understand the WebVTT specification because of the terminology used, especially in relation to positioning. Many concepts have overlapping use of words when they relate to different things. For example section 4.4 "WebVTT cue settings" includes the "WebVTT line cue setting" which actually configures the "cue box" position.

A detail point about the text in that section: looking at that line cue setting, it "configures the offset of the cue box from the video viewport's edge in the direction opposite to the writing direction". I think that means "orthogonal to the writing direction" but it could be interpreted as the negative of the writing direction, i.e. If the writing direction is top-to-bottom then does this mean left-to-right or bottom-to-top? Then the WebVTT position cue setting "configures the indent position of the cue box in the direction orthogonal to the WebVTT line cue setting". Why doesn't it simply say "in the writing direction"?

It would be easier for me to understand and follow if there were a clearer box model, in which there is a region box, a cue box within that region, and then a set of cue lines within that cue box. Then the positioning units of, say, the cue box, can be in terms of lines, relative to the region edges, and the overall rendering area (the default region?) can be positioned relative to the video or the video viewport in a consistent manner.

Algorithmic specifications

The algorithmic nature of the processing model is helpful for initial implementations, but it is very difficult to understand and follow and perhaps more importantly looks brittle in case of changes. If a change is needed to the algorithm, it may be challenging for implementers to work out where in their implementations the equivalent change is needed, especially if the implementation has been optimised and does not follow the specification's structure exactly.

Formatting

There seem to be some formatting oddities when I view the specification on Firefox or Chrome – in section 3.1 there's an algorithm for calculating the computed line, whose outline box might be helpful but overlaps the bullet numbers on the left.

The diagram at the bottom of section 3.2 seems to truncate the word "Video viewport".

Some text has a yellow highlight colour without explanation, e.g. In Example 19.

Conclusion

As mentioned at the top, as this specification is at present I could not recommend to the BBC that it should adopt WebVTT across the whole chain from authoring to presentation for both broadcast and online use, for prepared and live subtitles. That would be my preferred scope for any subtitle/closed caption format, though may go beyond the intended requirements set for WebVTT (which by the way are unstated in the specification). Restricting the scope, to consider solely the last stage and limiting the devices, i.e. thinking about direct distribution to browsers, the main presentational issue is that the positioning semantics are hard to understand and seem to work against the BBC's subtitle guidelines (developed with over 30 years of experience with closed captions). This may be something that can be resolved by improvements to the specification. There appears to be no support at present for live contribution, though this could be addressed by other specifications. The issues concerning content identification, language etc could probably be resolved quite easily to assist with the asset management challenges I have identified above. The issue about serialisation and parsing probably goes beyond any reasonable expectation to address at this stage, but could be a continuing content management problem.

Kind regards,

Nigel

--
Nigel Megitt
Executive Product Manager, BBC Design & Engineering

Received on Friday, 22 September 2017 14:47:02 UTC