Re: WebVTT wide review

Thank you Silvia

it would be good to get rapid resolution of these issues, so comments that lead towards closure would be warmly received.

> On Sep 28, 2017, at 8:00 , Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote:
> 
> Hi fellow WebVTT devs,
> 
> As you will know, WebVTT is in the process of moving towards
> standardisation in the TTWG. Since releasing the updated Working
> Draft, a number of W3C members have sent their feedback.
> 
> I'd like to involve this group into the discussions around the
> feedback, so am sending the emails on.
> 
> Note that most of the feedback is going straight into GitHub, so keep
> an eye on https://github.com/w3c/webvtt/issues/ .
> 
> Below is my first forward of a really in-depth feedback from Nigel at
> the BBC. He has also registered the issues in GitHub, so if you'd like
> to provide specific feedback, please do it there.
> 
> Cheers,
> Silvia.
> 
> 
> ---------- Forwarded message ----------
> From: Nigel Megitt <nigel.megitt@bbc.co.uk>
> Date: Fri, Sep 22, 2017 at 7:46 AM
> Subject: WebVTT wide review
> To: Timed Text Working Group <public-tt@w3.org>
> 
> 
> Dear TTWG,
> 
> In my BBC role with responsibility for engineering strategy for access
> services including subtitles I've reviewed WebVTT and have the
> following feedback.
> 
> The first thing to note is that this review is based on the question
> "would/could BBC use WebVTT?" or if not, "what would it take for BBC
> to be able to use WebVTT?" and is therefore affected by our existing
> infrastructure, guidelines and approach and is somewhat specific to
> BBC (and possibly other organisations with similar requirements). It
> is not meant to suggest that the same response would apply for all
> other organisations and individuals considering WebVTT. Nor is it
> intended to start a "flame war" between standards or factions: note
> that it is not a comparative review, and it may be that other formats
> suffer similar issues or worse! I make this point explicitly because
> my overall view with WebVTT at present is that it has architectural
> and design features that make it costly or difficult for the BBC to
> use across the whole chain from authoring to presentation.
> 
> Secondly, in reviewing, I am considering it both from an architectural
> perspective and in some cases looking at the detail – apologies in
> advance for switching between the two without warning! As a
> consequence, it seems inappropriate to raise GitHub issues on the repo
> for all the comments. Where it is appropriate I intend to do so if
> nobody else does it, but I'm going to write first and then raise the
> issues later. Nevertheless I think architectural comments are worth
> raising as part of the wide review, even if there is no easy
> resolution. Those comments are probably best thought of more of an
> explanation than a call for specific actions for which I might expect
> a 'won't fix' response. I would still like a disposition response on
> those comments though, even if that only is an explanation about why
> the decision was taken.
> 
> Thirdly, though my comments are all about changes that I'd like to
> see, that is, issues I have identified, it is worth pointing out that
> as a simple way to serialise subtitle or caption text and associate it
> with times and presentational styles, WebVTT mostly works, within a
> particular scope.
> 
> WebVTT serialisation and parsing
> 
> The WebVTT syntax is similar to (but incompatible with) SRT but
> otherwise distinct from all other syntaxes, and includes a subsection
> that is effectively CSS syntax. I consider the serialisation and
> parsing of a document format to be an architectural layer in its own
> right, ideally with tests, tools and support for the format. In the
> case of WebVTT the fact that it has a unique format means that the
> benefits of referencing an independent serialisation and parsing layer
> are absent. For internal business to business transactions this
> creates some hurdles: it is costlier to develop a syntax checker for
> example to validate that received files are well formed, or to quality
> check the content; writing custom parser code becomes a security risk
> since issues like buffer overflow are more commonly, though not
> uniquely, found in less mature code. The tool support for e.g. JSON,
> HTML or XML serialisation is much more mature and less likely to
> suffer from these problems.
> 
> It is unclear what action could resolve this with WebVTT in its
> current form, without taking seemingly extreme steps. For example if
> WebVTT were a semantic model plus an API, and alternative
> representations were defined, and at least one of those alternative
> representations were a more commonly used one, that would help, though
> at the expense of adding an initial step for every WebVTT import or
> export, which is to work out which representation to use.
> 
>> From this perspective, the syntax of WebVTT seems better suited to
> direct writing and editing in text editors by humans than by software,
> though obviously it is ultimately feasible to use either. For an
> organisation like the BBC authoring and distributing subtitle
> documents at scale it would be better to optimise for machine reading
> and writing instead of human reading and writing, since we expect
> subtitle authors and editors to use specialist software rather than
> tweaking files directly.
> 
> Content signalling
> 
> WebVTT is able to carry data intended for different purposes –
> captions, chapters, metadata etc. However the lack of signalling
> within documents creates an asset management and validation problem.
> If we have 3 WebVTT files associated with a single video asset, how do
> we know which is which? How would we validate them? The same issue
> arises with languages: it is impossible to signal the primary language
> for a WebVTT file within the file; requiring it to be set by the HTML
> srclang attribute is, frankly, completely useless within the scope of
> the WebVTT specification. How can an HTML page be generated to specify
> it correctly, and how can the value be validated for any given WebVTT
> file unless the language identifier is present within it? What if the
> WebVTT file is intended for use without being embedded in an HTML
> document?
> 
> The three types of WebVTT file indicated in §4.6 appear to be
> indistinguishable by inspecting the file.
> 
> Timing
> 
> WebVTT timing does not specify the time base that applies, except by
> reference to HTML5.1 media timelines. In general it is assumed to be
> media time, i.e. The first frame of the video is time zero. That makes
> sense in relation to a distributed piece of media such as would be
> present in an HTML video track. However there are broadcast usages of
> subtitle files that need to relate times to embedded timecode in
> video. It may be that such practices are phased out over time; until
> that occurs however, there is no defined data structure in WebVTT
> (even as metadata) that allows the processor to understand how to
> relate timestamps to the media with which the file is somehow
> associated, or indeed how the timestamps were generated.
> 
> For example, if a WebVTT file were authored against embedded timecode
> timestamps, but then that file were accidentally re-used against video
> media stripped of such timestamps, the content would display at the
> wrong time, but there would be no information available within the
> WebVTT file to identify that such a scenario had occurred.
> 
> The syntax of WebVTT timestamps requires that all timestamps contain a
> full stop character followed by three digits. This is a likely source
> of errors in two ways: Firstly, if a conformant processor sees a
> timestamp with a different number of characters (even a greater
> number) after the full stop, rather than returning a valid alternative
> time, it returns an error. This is therefore not robust to authoring
> errors. Secondly, a non-conformant processor that does not return an
> error must make a decision about how to deal with the strange number
> of digits. Since it is clearly defined as a "number of milliseconds"
> most interpretations would/should treat a timestamp such as
> "00:00:01.2" as having 2 milliseconds, but this is contrary to the
> normal mathematical representation of a fractional value, which would
> consider it to be 200 milliseconds. I would propose that the parser
> should a) append '0's after the value4 string to pad it to 3
> characters if it has fewer than 3 and b) truncate the value4 string to
> 3 characters if it has more than 3. (rounding seems pointlessly
> complex when the significance is +/- 0.5ms and the user agent probably
> doesn't meet that level of precision anyway)
> 
> Styling
> 
> The default font size is 5vh, however the default line spacing for
> addressing line positions is only 5.33vh, representing a line spacing
> of 106.6% - that would represent an unusually small line height and
> may result in strange effects. I cannot see anywhere that overrides
> the line position height based on the computed value of the
> line-height property. (comment repeated below under positioning, since
> it's relevant there too)
> 
> The default background colour's opacity of 0.8 may cause accessibility
> problems for some users, especially for bright patterned video behind
> the text. This can easily be fixed by setting it to 1.
> 
> The list of ::cue CSS properties in §7.2.1 is a useful minimal set,
> especially for a non-CSS-based implementation however the requirement
> to ignore any properties not in the list seems overly restrictive for
> a CSS based implementation. What is the mechanism for adding further
> styling options in the future, given that WebVTT files do not indicate
> any specification version?
> 
> There is no similar list for ::cue-region which seems like a strange
> inconsistency – I am unsure which properties are permitted on a
> region. For example, is border permitted? What about padding?
> 
> Positioning
> 
> In some places sizes and positions are defined relative to the video
> viewport; in others the video itself. This is likely to cause some
> confusion or mis-alignment when the two are not the same (e.g. a 16:9
> aspect ratio video is displayed in a 14:9 viewport) and creates an
> authoring problem. For example the cue box size is relative to the
> video but the cue box line is relative to the video viewport (both
> defined within section 3.1).
> 
> Two positioning and sizing constructs are defined: the cue box and the
> region. It is not clear in the data model section why both are needed
> and how a cue box is supposed to interact with a region. Indeed it
> appears that they are independent constructs given that only cue's
> position (and none of the other attributes) relates to or depends on
> the region. The cue box size is relative to the video viewport, not
> the region. This means that the concept of a region is not a "bounding
> box" within which cue text is rendered, but is only an alternative
> positioning reference. That semantic definition differs from other
> specifications that define regions, including TTWG's other
> specification covering this use case, TTML. (as Chair, I would prefer
> that there be semantic alignment here between the working group's
> specifications, and given the history and current state, that would
> mean modifying the treatment of regions in WebVTT.)
> 
> Reading section 3 data model, it appears that the region width and
> height are unused. This may be a bug, since §6.1 rule 14.3.3 says to
> interpret the cue's computed position as a percentage of the region
> width. It is in general hard to work out if the data model section
> aligns with the rendering section. I'm not sure how to fix that, but
> until it is fixed, document authors and implementers may be at
> loggerheads, each apparently doing the right thing but finding
> unexpected results for the other.
> 
> It is not clear how the use of lines as a positioning construct is
> defined. Whereas section 4.4 "line cue setting" says that the height
> of a line is based on the first line, in §6.1, step 12.1 bullet two,
> the line height is initialised to a very specific value 5.33vh,
> regardless of the height of the line, and I can't see anywhere that
> sets another value in its place (I could not see any reference to the
> line-height property in the rendering section, but please point me to
> it if it is there).
> 
> The user-agent defined edge margins used in section 6 Rendering
> effectively mean that the author of the file can not predict positions
> or line breaks accurately since the size and position of text cannot
> be known at authoring time. This prevents the author from being able
> to position captions in a location that avoids overlapping parts of
> the video that are important for the viewer to see in order to
> understand the content. This makes it effectively impossible to meet
> the BBC's editorial subtitle guidelines using WebVTT. Note that this
> comment must logically only apply when a single track is visible since
> WebVTT defines rules for multiple tracks being displayed
> simultaneously, in which case positioning is presumably outside the
> realm of the subtitle author, in the general case.
> 
> As an alternative to edge margins, why not allow the something like
> the padding property to be set on whatever viewport is used as the
> positioning reference area? (cf comment above about positioning
> relative to video or to video viewport) By the way, from a CSS box
> model perspective, calling them "margins" is confusing – aren't they
> "padding" values since they are inset from the video viewport area?
> 
> Metadata
> 
> The lack of delimiters on metadata is restrictive – it means that data
> structures cannot be placed in metadata blocks without first checking
> that they do not contain text that would be interpreted by a processor
> as cue text. Then if they do contain such text, there is no defined
> escaping mechanism to work around it. We capture metadata about the
> production process of subtitle and caption files within those files,
> for internal re-use – that appears to be impossible with WebVTT in a
> general way now, though could be addressed by means of delimiters,
> escaping mechanisms etc.
> 
> Terminology
> 
> It is extremely difficult to read and understand the WebVTT
> specification because of the terminology used, especially in relation
> to positioning. Many concepts have overlapping use of words when they
> relate to different things. For example section 4.4 "WebVTT cue
> settings" includes the "WebVTT line cue setting" which actually
> configures the "cue box" position.
> 
> A detail point about the text in that section: looking at that line
> cue setting, it "configures the offset of the cue box from the video
> viewport's edge in the direction opposite to the writing direction". I
> think that means "orthogonal to the writing direction" but it could be
> interpreted as the negative of the writing direction, i.e. If the
> writing direction is top-to-bottom then does this mean left-to-right
> or bottom-to-top? Then the WebVTT position cue setting "configures the
> indent position of the cue box in the direction orthogonal to the
> WebVTT line cue setting". Why doesn't it simply say "in the writing
> direction"?
> 
> It would be easier for me to understand and follow if there were a
> clearer box model, in which there is a region box, a cue box within
> that region, and then a set of cue lines within that cue box. Then the
> positioning units of, say, the cue box, can be in terms of lines,
> relative to the region edges, and the overall rendering area (the
> default region?) can be positioned relative to the video or the video
> viewport in a consistent manner.
> 
> Algorithmic specifications
> 
> The algorithmic nature of the processing model is helpful for initial
> implementations, but it is very difficult to understand and follow and
> perhaps more importantly looks brittle in case of changes. If a change
> is needed to the algorithm, it may be challenging for implementers to
> work out where in their implementations the equivalent change is
> needed, especially if the implementation has been optimised and does
> not follow the specification's structure exactly.
> 
> Formatting
> 
> There seem to be some formatting oddities when I view the
> specification on Firefox or Chrome – in section 3.1 there's an
> algorithm for calculating the computed line, whose outline box might
> be helpful but overlaps the bullet numbers on the left.
> 
> The diagram at the bottom of section 3.2 seems to truncate the word
> "Video viewport".
> 
> Some text has a yellow highlight colour without explanation, e.g. In Example 19.
> 
> Conclusion
> 
> As mentioned at the top, as this specification is at present I could
> not recommend to the BBC that it should adopt WebVTT across the whole
> chain from authoring to presentation for both broadcast and online
> use, for prepared and live subtitles. That would be my preferred scope
> for any subtitle/closed caption format, though may go beyond the
> intended requirements set for WebVTT (which by the way are unstated in
> the specification). Restricting the scope, to consider solely the last
> stage and limiting the devices, i.e. thinking about direct
> distribution to browsers, the main presentational issue is that the
> positioning semantics are hard to understand and seem to work against
> the BBC's subtitle guidelines (developed with over 30 years of
> experience with closed captions). This may be something that can be
> resolved by improvements to the specification. There appears to be no
> support at present for live contribution, though this could be
> addressed by other specifications. The issues concerning content
> identification, language etc could probably be resolved quite easily
> to assist with the asset management challenges I have identified
> above. The issue about serialisation and parsing probably goes beyond
> any reasonable expectation to address at this stage, but could be a
> continuing content management problem.
> 
> Kind regards,
> 
> Nigel
> 
> -- 
> Nigel Megitt
> Executive Product Manager, BBC Design & Engineering
> 

Dave Singer

singer@mac.com

Received on Thursday, 28 September 2017 20:01:42 UTC