Re: WebVTT wide review

Hi Nigel,

Thanks for the extensive feedback. I'll be forwarding it to the CG
mailing list also and am grateful that you have submitted issues for
discussion into GitHub. I'm sure we can find solutions for most of
these issues.


I will, however, have to raise concerns about one issue that you
mentioned: the algorithmic form of the specification. Modern
specifications at the W3C are all formulated in algorithmic form and I
don't think that WebVTT should be changed to become an exception. The
goal of most of the specifications at the W3C are to get them
implemented in browsers and to make that work, the specifications need
an algorithmic approach as has been proven over the years. This allows
for making a difference between what is standardised syntax and what
provided actual formats a browser should be more lenient with and
parse. The algorithmic sort of specification works best for this and
has also been deemed to be much more tolerant to change than a mere
file format specification. It may seem messy, but it creates more
interoperable implementations, which is the goal of a standard.

We shall address all issues in the bug tracker, but I wanted to make
sure you understand why it's highly unlikely that the fundamental way
in which the WebVTT specification has been formulated will change.


I am curious about another statement of yours: your question about
"what would it take for the BBC to be able to use WebVTT?" is what
makes me curious. Have you connected with the business units across
the BBC to find out if anyone is already using WebVTT and if they are
what issues they are having? I'd be very surprised if it wasn't in
use, e.g. on mobile devices for displaying captions in browsers. I can
understand your reservations about using WebVTT across the whole chain
from authoring to presentation for both broadcast and online use. The
whole chain takes much more than what we have focused on so far and
some of the issues related to authoring that you brought up have
existed in the bug tracker for a long time - they have merely been
delayed for a later version. I would, however, be surprised if that
would imply that the BBC is not making use of WebVTT anywhere in their
publication chain. Could you find out and share?

Thanks for your extensive efforts on this.

Kind Regards,
Silvia.



On Fri, Sep 22, 2017 at 7:46 AM, Nigel Megitt <nigel.megitt@bbc.co.uk> wrote:
> Dear TTWG,
>
> In my BBC role with responsibility for engineering strategy for access
> services including subtitles I've reviewed WebVTT and have the following
> feedback.
>
> The first thing to note is that this review is based on the question
> "would/could BBC use WebVTT?" or if not, "what would it take for BBC to be
> able to use WebVTT?" and is therefore affected by our existing
> infrastructure, guidelines and approach and is somewhat specific to BBC (and
> possibly other organisations with similar requirements). It is not meant to
> suggest that the same response would apply for all other organisations and
> individuals considering WebVTT. Nor is it intended to start a "flame war"
> between standards or factions: note that it is not a comparative review, and
> it may be that other formats suffer similar issues or worse! I make this
> point explicitly because my overall view with WebVTT at present is that it
> has architectural and design features that make it costly or difficult for
> the BBC to use across the whole chain from authoring to presentation.
>
> Secondly, in reviewing, I am considering it both from an architectural
> perspective and in some cases looking at the detail – apologies in advance
> for switching between the two without warning! As a consequence, it seems
> inappropriate to raise GitHub issues on the repo for all the comments. Where
> it is appropriate I intend to do so if nobody else does it, but I'm going to
> write first and then raise the issues later. Nevertheless I think
> architectural comments are worth raising as part of the wide review, even if
> there is no easy resolution. Those comments are probably best thought of
> more of an explanation than a call for specific actions for which I might
> expect a 'won't fix' response. I would still like a disposition response on
> those comments though, even if that only is an explanation about why the
> decision was taken.
>
> Thirdly, though my comments are all about changes that I'd like to see, that
> is, issues I have identified, it is worth pointing out that as a simple way
> to serialise subtitle or caption text and associate it with times and
> presentational styles, WebVTT mostly works, within a particular scope.
>
> WebVTT serialisation and parsing
>
> The WebVTT syntax is similar to (but incompatible with) SRT but otherwise
> distinct from all other syntaxes, and includes a subsection that is
> effectively CSS syntax. I consider the serialisation and parsing of a
> document format to be an architectural layer in its own right, ideally with
> tests, tools and support for the format. In the case of WebVTT the fact that
> it has a unique format means that the benefits of referencing an independent
> serialisation and parsing layer are absent. For internal business to
> business transactions this creates some hurdles: it is costlier to develop a
> syntax checker for example to validate that received files are well formed,
> or to quality check the content; writing custom parser code becomes a
> security risk since issues like buffer overflow are more commonly, though
> not uniquely, found in less mature code. The tool support for e.g. JSON,
> HTML or XML serialisation is much more mature and less likely to suffer from
> these problems.
>
> It is unclear what action could resolve this with WebVTT in its current
> form, without taking seemingly extreme steps. For example if WebVTT were a
> semantic model plus an API, and alternative representations were defined,
> and at least one of those alternative representations were a more commonly
> used one, that would help, though at the expense of adding an initial step
> for every WebVTT import or export, which is to work out which representation
> to use.
>
> From this perspective, the syntax of WebVTT seems better suited to direct
> writing and editing in text editors by humans than by software, though
> obviously it is ultimately feasible to use either. For an organisation like
> the BBC authoring and distributing subtitle documents at scale it would be
> better to optimise for machine reading and writing instead of human reading
> and writing, since we expect subtitle authors and editors to use specialist
> software rather than tweaking files directly.
>
> Content signalling
>
> WebVTT is able to carry data intended for different purposes – captions,
> chapters, metadata etc. However the lack of signalling within documents
> creates an asset management and validation problem. If we have 3 WebVTT
> files associated with a single video asset, how do we know which is which?
> How would we validate them? The same issue arises with languages: it is
> impossible to signal the primary language for a WebVTT file within the file;
> requiring it to be set by the HTML srclang attribute is, frankly, completely
> useless within the scope of the WebVTT specification. How can an HTML page
> be generated to specify it correctly, and how can the value be validated for
> any given WebVTT file unless the language identifier is present within it?
> What if the WebVTT file is intended for use without being embedded in an
> HTML document?
>
> The three types of WebVTT file indicated in §4.6 appear to be
> indistinguishable by inspecting the file.
>
> Timing
>
> WebVTT timing does not specify the time base that applies, except by
> reference to HTML5.1 media timelines. In general it is assumed to be media
> time, i.e. The first frame of the video is time zero. That makes sense in
> relation to a distributed piece of media such as would be present in an HTML
> video track. However there are broadcast usages of subtitle files that need
> to relate times to embedded timecode in video. It may be that such practices
> are phased out over time; until that occurs however, there is no defined
> data structure in WebVTT (even as metadata) that allows the processor to
> understand how to relate timestamps to the media with which the file is
> somehow associated, or indeed how the timestamps were generated.
>
> For example, if a WebVTT file were authored against embedded timecode
> timestamps, but then that file were accidentally re-used against video media
> stripped of such timestamps, the content would display at the wrong time,
> but there would be no information available within the WebVTT file to
> identify that such a scenario had occurred.
>
> The syntax of WebVTT timestamps requires that all timestamps contain a full
> stop character followed by three digits. This is a likely source of errors
> in two ways: Firstly, if a conformant processor sees a timestamp with a
> different number of characters (even a greater number) after the full stop,
> rather than returning a valid alternative time, it returns an error. This is
> therefore not robust to authoring errors. Secondly, a non-conformant
> processor that does not return an error must make a decision about how to
> deal with the strange number of digits. Since it is clearly defined as a
> "number of milliseconds" most interpretations would/should treat a timestamp
> such as "00:00:01.2" as having 2 milliseconds, but this is contrary to the
> normal mathematical representation of a fractional value, which would
> consider it to be 200 milliseconds. I would propose that the parser should
> a) append '0's after the value4 string to pad it to 3 characters if it has
> fewer than 3 and b) truncate the value4 string to 3 characters if it has
> more than 3. (rounding seems pointlessly complex when the significance is
> +/- 0.5ms and the user agent probably doesn't meet that level of precision
> anyway)
>
> Styling
>
> The default font size is 5vh, however the default line spacing for
> addressing line positions is only 5.33vh, representing a line spacing of
> 106.6% - that would represent an unusually small line height and may result
> in strange effects. I cannot see anywhere that overrides the line position
> height based on the computed value of the line-height property. (comment
> repeated below under positioning, since it's relevant there too)
>
> The default background colour's opacity of 0.8 may cause accessibility
> problems for some users, especially for bright patterned video behind the
> text. This can easily be fixed by setting it to 1.
>
> The list of ::cue CSS properties in §7.2.1 is a useful minimal set,
> especially for a non-CSS-based implementation however the requirement to
> ignore any properties not in the list seems overly restrictive for a CSS
> based implementation. What is the mechanism for adding further styling
> options in the future, given that WebVTT files do not indicate any
> specification version?
>
> There is no similar list for ::cue-region which seems like a strange
> inconsistency – I am unsure which properties are permitted on a region. For
> example, is border permitted? What about padding?
>
> Positioning
>
> In some places sizes and positions are defined relative to the video
> viewport; in others the video itself. This is likely to cause some confusion
> or mis-alignment when the two are not the same (e.g. a 16:9 aspect ratio
> video is displayed in a 14:9 viewport) and creates an authoring problem. For
> example the cue box size is relative to the video but the cue box line is
> relative to the video viewport (both defined within section 3.1).
>
> Two positioning and sizing constructs are defined: the cue box and the
> region. It is not clear in the data model section why both are needed and
> how a cue box is supposed to interact with a region. Indeed it appears that
> they are independent constructs given that only cue's position (and none of
> the other attributes) relates to or depends on the region. The cue box size
> is relative to the video viewport, not the region. This means that the
> concept of a region is not a "bounding box" within which cue text is
> rendered, but is only an alternative positioning reference. That semantic
> definition differs from other specifications that define regions, including
> TTWG's other specification covering this use case, TTML. (as Chair, I would
> prefer that there be semantic alignment here between the working group's
> specifications, and given the history and current state, that would mean
> modifying the treatment of regions in WebVTT.)
>
> Reading section 3 data model, it appears that the region width and height
> are unused. This may be a bug, since §6.1 rule 14.3.3 says to interpret the
> cue's computed position as a percentage of the region width. It is in
> general hard to work out if the data model section aligns with the rendering
> section. I'm not sure how to fix that, but until it is fixed, document
> authors and implementers may be at loggerheads, each apparently doing the
> right thing but finding unexpected results for the other.
>
> It is not clear how the use of lines as a positioning construct is defined.
> Whereas section 4.4 "line cue setting" says that the height of a line is
> based on the first line, in §6.1, step 12.1 bullet two, the line height is
> initialised to a very specific value 5.33vh, regardless of the height of the
> line, and I can't see anywhere that sets another value in its place (I could
> not see any reference to the line-height property in the rendering section,
> but please point me to it if it is there).
>
> The user-agent defined edge margins used in section 6 Rendering effectively
> mean that the author of the file can not predict positions or line breaks
> accurately since the size and position of text cannot be known at authoring
> time. This prevents the author from being able to position captions in a
> location that avoids overlapping parts of the video that are important for
> the viewer to see in order to understand the content. This makes it
> effectively impossible to meet the BBC's editorial subtitle guidelines using
> WebVTT. Note that this comment must logically only apply when a single track
> is visible since WebVTT defines rules for multiple tracks being displayed
> simultaneously, in which case positioning is presumably outside the realm of
> the subtitle author, in the general case.
>
> As an alternative to edge margins, why not allow the something like the
> padding property to be set on whatever viewport is used as the positioning
> reference area? (cf comment above about positioning relative to video or to
> video viewport) By the way, from a CSS box model perspective, calling them
> "margins" is confusing – aren't they "padding" values since they are inset
> from the video viewport area?
>
> Metadata
>
> The lack of delimiters on metadata is restrictive – it means that data
> structures cannot be placed in metadata blocks without first checking that
> they do not contain text that would be interpreted by a processor as cue
> text. Then if they do contain such text, there is no defined escaping
> mechanism to work around it. We capture metadata about the production
> process of subtitle and caption files within those files, for internal
> re-use – that appears to be impossible with WebVTT in a general way now,
> though could be addressed by means of delimiters, escaping mechanisms etc.
>
> Terminology
>
> It is extremely difficult to read and understand the WebVTT specification
> because of the terminology used, especially in relation to positioning. Many
> concepts have overlapping use of words when they relate to different things.
> For example section 4.4 "WebVTT cue settings" includes the "WebVTT line cue
> setting" which actually configures the "cue box" position.
>
> A detail point about the text in that section: looking at that line cue
> setting, it "configures the offset of the cue box from the video viewport's
> edge in the direction opposite to the writing direction". I think that means
> "orthogonal to the writing direction" but it could be interpreted as the
> negative of the writing direction, i.e. If the writing direction is
> top-to-bottom then does this mean left-to-right or bottom-to-top? Then the
> WebVTT position cue setting "configures the indent position of the cue box
> in the direction orthogonal to the WebVTT line cue setting". Why doesn't it
> simply say "in the writing direction"?
>
> It would be easier for me to understand and follow if there were a clearer
> box model, in which there is a region box, a cue box within that region, and
> then a set of cue lines within that cue box. Then the positioning units of,
> say, the cue box, can be in terms of lines, relative to the region edges,
> and the overall rendering area (the default region?) can be positioned
> relative to the video or the video viewport in a consistent manner.
>
> Algorithmic specifications
>
> The algorithmic nature of the processing model is helpful for initial
> implementations, but it is very difficult to understand and follow and
> perhaps more importantly looks brittle in case of changes. If a change is
> needed to the algorithm, it may be challenging for implementers to work out
> where in their implementations the equivalent change is needed, especially
> if the implementation has been optimised and does not follow the
> specification's structure exactly.
>
> Formatting
>
> There seem to be some formatting oddities when I view the specification on
> Firefox or Chrome – in section 3.1 there's an algorithm for calculating the
> computed line, whose outline box might be helpful but overlaps the bullet
> numbers on the left.
>
> The diagram at the bottom of section 3.2 seems to truncate the word "Video
> viewport".
>
> Some text has a yellow highlight colour without explanation, e.g. In Example
> 19.
>
> Conclusion
>
> As mentioned at the top, as this specification is at present I could not
> recommend to the BBC that it should adopt WebVTT across the whole chain from
> authoring to presentation for both broadcast and online use, for prepared
> and live subtitles. That would be my preferred scope for any subtitle/closed
> caption format, though may go beyond the intended requirements set for
> WebVTT (which by the way are unstated in the specification). Restricting the
> scope, to consider solely the last stage and limiting the devices, i.e.
> thinking about direct distribution to browsers, the main presentational
> issue is that the positioning semantics are hard to understand and seem to
> work against the BBC's subtitle guidelines (developed with over 30 years of
> experience with closed captions). This may be something that can be resolved
> by improvements to the specification. There appears to be no support at
> present for live contribution, though this could be addressed by other
> specifications. The issues concerning content identification, language etc
> could probably be resolved quite easily to assist with the asset management
> challenges I have identified above. The issue about serialisation and
> parsing probably goes beyond any reasonable expectation to address at this
> stage, but could be a continuing content management problem.
>
> Kind regards,
>
> Nigel
>
> --
> Nigel Megitt
> Executive Product Manager, BBC Design & Engineering
>

Received on Wednesday, 27 September 2017 20:06:44 UTC