Fwd: WebVTT wide review from Silvia Pfeiffer on 2017-09-28 (public-texttracks@w3.org from September 2017)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Thu, 28 Sep 2017 08:00:24 -0700
To: "public-texttracks@w3.org" <public-texttracks@w3.org>
Message-ID: <CAHp8n2=t0L2w69ovPzfisEVgNn6KL1grNXGVD7K96RLXpqR4-w@mail.gmail.com>
Hi fellow WebVTT devs,

As you will know, WebVTT is in the process of moving towards
standardisation in the TTWG. Since releasing the updated Working
Draft, a number of W3C members have sent their feedback.

I'd like to involve this group into the discussions around the
feedback, so am sending the emails on.

Note that most of the feedback is going straight into GitHub, so keep
an eye on https://github.com/w3c/webvtt/issues/ .

Below is my first forward of a really in-depth feedback from Nigel at
the BBC. He has also registered the issues in GitHub, so if you'd like
to provide specific feedback, please do it there.

Cheers,
Silvia.


---------- Forwarded message ----------
From: Nigel Megitt <nigel.megitt@bbc.co.uk>
Date: Fri, Sep 22, 2017 at 7:46 AM
Subject: WebVTT wide review
To: Timed Text Working Group <public-tt@w3.org>


Dear TTWG,

In my BBC role with responsibility for engineering strategy for access
services including subtitles I've reviewed WebVTT and have the
following feedback.

The first thing to note is that this review is based on the question
"would/could BBC use WebVTT?" or if not, "what would it take for BBC
to be able to use WebVTT?" and is therefore affected by our existing
infrastructure, guidelines and approach and is somewhat specific to
BBC (and possibly other organisations with similar requirements). It
is not meant to suggest that the same response would apply for all
other organisations and individuals considering WebVTT. Nor is it
intended to start a "flame war" between standards or factions: note
that it is not a comparative review, and it may be that other formats
suffer similar issues or worse! I make this point explicitly because
my overall view with WebVTT at present is that it has architectural
and design features that make it costly or difficult for the BBC to
use across the whole chain from authoring to presentation.

Secondly, in reviewing, I am considering it both from an architectural
perspective and in some cases looking at the detail – apologies in
advance for switching between the two without warning! As a
consequence, it seems inappropriate to raise GitHub issues on the repo
for all the comments. Where it is appropriate I intend to do so if
nobody else does it, but I'm going to write first and then raise the
issues later. Nevertheless I think architectural comments are worth
raising as part of the wide review, even if there is no easy
resolution. Those comments are probably best thought of more of an
explanation than a call for specific actions for which I might expect
a 'won't fix' response. I would still like a disposition response on
those comments though, even if that only is an explanation about why
the decision was taken.

Thirdly, though my comments are all about changes that I'd like to
see, that is, issues I have identified, it is worth pointing out that
as a simple way to serialise subtitle or caption text and associate it
with times and presentational styles, WebVTT mostly works, within a
particular scope.

WebVTT serialisation and parsing

The WebVTT syntax is similar to (but incompatible with) SRT but
otherwise distinct from all other syntaxes, and includes a subsection
that is effectively CSS syntax. I consider the serialisation and
parsing of a document format to be an architectural layer in its own
right, ideally with tests, tools and support for the format. In the
case of WebVTT the fact that it has a unique format means that the
benefits of referencing an independent serialisation and parsing layer
are absent. For internal business to business transactions this
creates some hurdles: it is costlier to develop a syntax checker for
example to validate that received files are well formed, or to quality
check the content; writing custom parser code becomes a security risk
since issues like buffer overflow are more commonly, though not
uniquely, found in less mature code. The tool support for e.g. JSON,
HTML or XML serialisation is much more mature and less likely to
suffer from these problems.

It is unclear what action could resolve this with WebVTT in its
current form, without taking seemingly extreme steps. For example if
WebVTT were a semantic model plus an API, and alternative
representations were defined, and at least one of those alternative
representations were a more commonly used one, that would help, though
at the expense of adding an initial step for every WebVTT import or
export, which is to work out which representation to use.

>From this perspective, the syntax of WebVTT seems better suited to
direct writing and editing in text editors by humans than by software,
though obviously it is ultimately feasible to use either. For an
organisation like the BBC authoring and distributing subtitle
documents at scale it would be better to optimise for machine reading
and writing instead of human reading and writing, since we expect
subtitle authors and editors to use specialist software rather than
tweaking files directly.

Content signalling

WebVTT is able to carry data intended for different purposes –
captions, chapters, metadata etc. However the lack of signalling
within documents creates an asset management and validation problem.
If we have 3 WebVTT files associated with a single video asset, how do
we know which is which? How would we validate them? The same issue
arises with languages: it is impossible to signal the primary language
for a WebVTT file within the file; requiring it to be set by the HTML
srclang attribute is, frankly, completely useless within the scope of
the WebVTT specification. How can an HTML page be generated to specify
it correctly, and how can the value be validated for any given WebVTT
file unless the language identifier is present within it? What if the
WebVTT file is intended for use without being embedded in an HTML
document?

The three types of WebVTT file indicated in §4.6 appear to be
indistinguishable by inspecting the file.

Timing

WebVTT timing does not specify the time base that applies, except by
reference to HTML5.1 media timelines. In general it is assumed to be
media time, i.e. The first frame of the video is time zero. That makes
sense in relation to a distributed piece of media such as would be
present in an HTML video track. However there are broadcast usages of
subtitle files that need to relate times to embedded timecode in
video. It may be that such practices are phased out over time; until
that occurs however, there is no defined data structure in WebVTT
(even as metadata) that allows the processor to understand how to
relate timestamps to the media with which the file is somehow
associated, or indeed how the timestamps were generated.

For example, if a WebVTT file were authored against embedded timecode
timestamps, but then that file were accidentally re-used against video
media stripped of such timestamps, the content would display at the
wrong time, but there would be no information available within the
WebVTT file to identify that such a scenario had occurred.

The syntax of WebVTT timestamps requires that all timestamps contain a
full stop character followed by three digits. This is a likely source
of errors in two ways: Firstly, if a conformant processor sees a
timestamp with a different number of characters (even a greater
number) after the full stop, rather than returning a valid alternative
time, it returns an error. This is therefore not robust to authoring
errors. Secondly, a non-conformant processor that does not return an
error must make a decision about how to deal with the strange number
of digits. Since it is clearly defined as a "number of milliseconds"
most interpretations would/should treat a timestamp such as
"00:00:01.2" as having 2 milliseconds, but this is contrary to the
normal mathematical representation of a fractional value, which would
consider it to be 200 milliseconds. I would propose that the parser
should a) append '0's after the value4 string to pad it to 3
characters if it has fewer than 3 and b) truncate the value4 string to
3 characters if it has more than 3. (rounding seems pointlessly
complex when the significance is +/- 0.5ms and the user agent probably
doesn't meet that level of precision anyway)

Styling

The default font size is 5vh, however the default line spacing for
addressing line positions is only 5.33vh, representing a line spacing
of 106.6% - that would represent an unusually small line height and
may result in strange effects. I cannot see anywhere that overrides
the line position height based on the computed value of the
line-height property. (comment repeated below under positioning, since
it's relevant there too)

The default background colour's opacity of 0.8 may cause accessibility
problems for some users, especially for bright patterned video behind
the text. This can easily be fixed by setting it to 1.

The list of ::cue CSS properties in §7.2.1 is a useful minimal set,
especially for a non-CSS-based implementation however the requirement
to ignore any properties not in the list seems overly restrictive for
a CSS based implementation. What is the mechanism for adding further
styling options in the future, given that WebVTT files do not indicate
any specification version?

There is no similar list for ::cue-region which seems like a strange
inconsistency – I am unsure which properties are permitted on a
region. For example, is border permitted? What about padding?

Positioning

In some places sizes and positions are defined relative to the video
viewport; in others the video itself. This is likely to cause some
confusion or mis-alignment when the two are not the same (e.g. a 16:9
aspect ratio video is displayed in a 14:9 viewport) and creates an
authoring problem. For example the cue box size is relative to the
video but the cue box line is relative to the video viewport (both
defined within section 3.1).

Two positioning and sizing constructs are defined: the cue box and the
region. It is not clear in the data model section why both are needed
and how a cue box is supposed to interact with a region. Indeed it
appears that they are independent constructs given that only cue's
position (and none of the other attributes) relates to or depends on
the region. The cue box size is relative to the video viewport, not
the region. This means that the concept of a region is not a "bounding
box" within which cue text is rendered, but is only an alternative
positioning reference. That semantic definition differs from other
specifications that define regions, including TTWG's other
specification covering this use case, TTML. (as Chair, I would prefer
that there be semantic alignment here between the working group's
specifications, and given the history and current state, that would
mean modifying the treatment of regions in WebVTT.)

Reading section 3 data model, it appears that the region width and
height are unused. This may be a bug, since §6.1 rule 14.3.3 says to
interpret the cue's computed position as a percentage of the region
width. It is in general hard to work out if the data model section
aligns with the rendering section. I'm not sure how to fix that, but
until it is fixed, document authors and implementers may be at
loggerheads, each apparently doing the right thing but finding
unexpected results for the other.

It is not clear how the use of lines as a positioning construct is
defined. Whereas section 4.4 "line cue setting" says that the height
of a line is based on the first line, in §6.1, step 12.1 bullet two,
the line height is initialised to a very specific value 5.33vh,
regardless of the height of the line, and I can't see anywhere that
sets another value in its place (I could not see any reference to the
line-height property in the rendering section, but please point me to
it if it is there).

The user-agent defined edge margins used in section 6 Rendering
effectively mean that the author of the file can not predict positions
or line breaks accurately since the size and position of text cannot
be known at authoring time. This prevents the author from being able
to position captions in a location that avoids overlapping parts of
the video that are important for the viewer to see in order to
understand the content. This makes it effectively impossible to meet
the BBC's editorial subtitle guidelines using WebVTT. Note that this
comment must logically only apply when a single track is visible since
WebVTT defines rules for multiple tracks being displayed
simultaneously, in which case positioning is presumably outside the
realm of the subtitle author, in the general case.

As an alternative to edge margins, why not allow the something like
the padding property to be set on whatever viewport is used as the
positioning reference area? (cf comment above about positioning
relative to video or to video viewport) By the way, from a CSS box
model perspective, calling them "margins" is confusing – aren't they
"padding" values since they are inset from the video viewport area?

Metadata

The lack of delimiters on metadata is restrictive – it means that data
structures cannot be placed in metadata blocks without first checking
that they do not contain text that would be interpreted by a processor
as cue text. Then if they do contain such text, there is no defined
escaping mechanism to work around it. We capture metadata about the
production process of subtitle and caption files within those files,
for internal re-use – that appears to be impossible with WebVTT in a
general way now, though could be addressed by means of delimiters,
escaping mechanisms etc.

Terminology

It is extremely difficult to read and understand the WebVTT
specification because of the terminology used, especially in relation
to positioning. Many concepts have overlapping use of words when they
relate to different things. For example section 4.4 "WebVTT cue
settings" includes the "WebVTT line cue setting" which actually
configures the "cue box" position.

A detail point about the text in that section: looking at that line
cue setting, it "configures the offset of the cue box from the video
viewport's edge in the direction opposite to the writing direction". I
think that means "orthogonal to the writing direction" but it could be
interpreted as the negative of the writing direction, i.e. If the
writing direction is top-to-bottom then does this mean left-to-right
or bottom-to-top? Then the WebVTT position cue setting "configures the
indent position of the cue box in the direction orthogonal to the
WebVTT line cue setting". Why doesn't it simply say "in the writing
direction"?

It would be easier for me to understand and follow if there were a
clearer box model, in which there is a region box, a cue box within
that region, and then a set of cue lines within that cue box. Then the
positioning units of, say, the cue box, can be in terms of lines,
relative to the region edges, and the overall rendering area (the
default region?) can be positioned relative to the video or the video
viewport in a consistent manner.

Algorithmic specifications

The algorithmic nature of the processing model is helpful for initial
implementations, but it is very difficult to understand and follow and
perhaps more importantly looks brittle in case of changes. If a change
is needed to the algorithm, it may be challenging for implementers to
work out where in their implementations the equivalent change is
needed, especially if the implementation has been optimised and does
not follow the specification's structure exactly.

Formatting

There seem to be some formatting oddities when I view the
specification on Firefox or Chrome – in section 3.1 there's an
algorithm for calculating the computed line, whose outline box might
be helpful but overlaps the bullet numbers on the left.

The diagram at the bottom of section 3.2 seems to truncate the word
"Video viewport".

Some text has a yellow highlight colour without explanation, e.g. In Example 19.

Conclusion

As mentioned at the top, as this specification is at present I could
not recommend to the BBC that it should adopt WebVTT across the whole
chain from authoring to presentation for both broadcast and online
use, for prepared and live subtitles. That would be my preferred scope
for any subtitle/closed caption format, though may go beyond the
intended requirements set for WebVTT (which by the way are unstated in
the specification). Restricting the scope, to consider solely the last
stage and limiting the devices, i.e. thinking about direct
distribution to browsers, the main presentational issue is that the
positioning semantics are hard to understand and seem to work against
the BBC's subtitle guidelines (developed with over 30 years of
experience with closed captions). This may be something that can be
resolved by improvements to the specification. There appears to be no
support at present for live contribution, though this could be
addressed by other specifications. The issues concerning content
identification, language etc could probably be resolved quite easily
to assist with the asset management challenges I have identified
above. The issue about serialisation and parsing probably goes beyond
any reasonable expectation to address at this stage, but could be a
continuing content management problem.

Kind regards,

Nigel

-- 
Nigel Megitt
Executive Product Manager, BBC Design & Engineering
Received on Thursday, 28 September 2017 15:01:21 UTC