Re: WebVTT wide review from Glenn Adams on 2017-09-27 (public-tt@w3.org from September 2017)

From: Glenn Adams <glenn@skynav.com>
Date: Wed, 27 Sep 2017 16:24:40 -0600
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: Nigel Megitt <nigel.megitt@bbc.co.uk>, Timed Text Working Group <public-tt@w3.org>
Message-ID: <CACQ=j+f_vdBTWDddkZsbCw75V_FFZY3gfxB_gmGAgu+vfLYZZg@mail.gmail.com>
On Wed, Sep 27, 2017 at 2:05 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com>
wrote:

> Hi Nigel,
>
> Thanks for the extensive feedback. I'll be forwarding it to the CG
> mailing list also and am grateful that you have submitted issues for
> discussion into GitHub. I'm sure we can find solutions for most of
> these issues.
>
>
> I will, however, have to raise concerns about one issue that you
> mentioned: the algorithmic form of the specification. Modern
> specifications at the W3C are all formulated in algorithmic form and I
> don't think that WebVTT should be changed to become an exception.


I think this claim is not true in general. Indeed, I believe the opposite
holds. Nonetheless, it is unlikely that we can change this in VTT at this
juncture.


> The
> goal of most of the specifications at the W3C are to get them
> implemented in browsers and to make that work, the specifications need
> an algorithmic approach as has been proven over the years. This allows
> for making a difference between what is standardised syntax and what
> provided actual formats a browser should be more lenient with and
> parse. The algorithmic sort of specification works best for this and
> has also been deemed to be much more tolerant to change than a mere
> file format specification. It may seem messy, but it creates more
> interoperable implementations, which is the goal of a standard.
>
> We shall address all issues in the bug tracker, but I wanted to make
> sure you understand why it's highly unlikely that the fundamental way
> in which the WebVTT specification has been formulated will change.
>
>
> I am curious about another statement of yours: your question about
> "what would it take for the BBC to be able to use WebVTT?" is what
> makes me curious. Have you connected with the business units across
> the BBC to find out if anyone is already using WebVTT and if they are
> what issues they are having? I'd be very surprised if it wasn't in
> use, e.g. on mobile devices for displaying captions in browsers. I can
> understand your reservations about using WebVTT across the whole chain
> from authoring to presentation for both broadcast and online use. The
> whole chain takes much more than what we have focused on so far and
> some of the issues related to authoring that you brought up have
> existed in the bug tracker for a long time - they have merely been
> delayed for a later version. I would, however, be surprised if that
> would imply that the BBC is not making use of WebVTT anywhere in their
> publication chain. Could you find out and share?
>
> Thanks for your extensive efforts on this.
>
> Kind Regards,
> Silvia.
>
>
>
> On Fri, Sep 22, 2017 at 7:46 AM, Nigel Megitt <nigel.megitt@bbc.co.uk>
> wrote:
> > Dear TTWG,
> >
> > In my BBC role with responsibility for engineering strategy for access
> > services including subtitles I've reviewed WebVTT and have the following
> > feedback.
> >
> > The first thing to note is that this review is based on the question
> > "would/could BBC use WebVTT?" or if not, "what would it take for BBC to
> be
> > able to use WebVTT?" and is therefore affected by our existing
> > infrastructure, guidelines and approach and is somewhat specific to BBC
> (and
> > possibly other organisations with similar requirements). It is not meant
> to
> > suggest that the same response would apply for all other organisations
> and
> > individuals considering WebVTT. Nor is it intended to start a "flame war"
> > between standards or factions: note that it is not a comparative review,
> and
> > it may be that other formats suffer similar issues or worse! I make this
> > point explicitly because my overall view with WebVTT at present is that
> it
> > has architectural and design features that make it costly or difficult
> for
> > the BBC to use across the whole chain from authoring to presentation.
> >
> > Secondly, in reviewing, I am considering it both from an architectural
> > perspective and in some cases looking at the detail – apologies in
> advance
> > for switching between the two without warning! As a consequence, it seems
> > inappropriate to raise GitHub issues on the repo for all the comments.
> Where
> > it is appropriate I intend to do so if nobody else does it, but I'm
> going to
> > write first and then raise the issues later. Nevertheless I think
> > architectural comments are worth raising as part of the wide review,
> even if
> > there is no easy resolution. Those comments are probably best thought of
> > more of an explanation than a call for specific actions for which I might
> > expect a 'won't fix' response. I would still like a disposition response
> on
> > those comments though, even if that only is an explanation about why the
> > decision was taken.
> >
> > Thirdly, though my comments are all about changes that I'd like to see,
> that
> > is, issues I have identified, it is worth pointing out that as a simple
> way
> > to serialise subtitle or caption text and associate it with times and
> > presentational styles, WebVTT mostly works, within a particular scope.
> >
> > WebVTT serialisation and parsing
> >
> > The WebVTT syntax is similar to (but incompatible with) SRT but otherwise
> > distinct from all other syntaxes, and includes a subsection that is
> > effectively CSS syntax. I consider the serialisation and parsing of a
> > document format to be an architectural layer in its own right, ideally
> with
> > tests, tools and support for the format. In the case of WebVTT the fact
> that
> > it has a unique format means that the benefits of referencing an
> independent
> > serialisation and parsing layer are absent. For internal business to
> > business transactions this creates some hurdles: it is costlier to
> develop a
> > syntax checker for example to validate that received files are well
> formed,
> > or to quality check the content; writing custom parser code becomes a
> > security risk since issues like buffer overflow are more commonly, though
> > not uniquely, found in less mature code. The tool support for e.g. JSON,
> > HTML or XML serialisation is much more mature and less likely to suffer
> from
> > these problems.
> >
> > It is unclear what action could resolve this with WebVTT in its current
> > form, without taking seemingly extreme steps. For example if WebVTT were
> a
> > semantic model plus an API, and alternative representations were defined,
> > and at least one of those alternative representations were a more
> commonly
> > used one, that would help, though at the expense of adding an initial
> step
> > for every WebVTT import or export, which is to work out which
> representation
> > to use.
> >
> > From this perspective, the syntax of WebVTT seems better suited to direct
> > writing and editing in text editors by humans than by software, though
> > obviously it is ultimately feasible to use either. For an organisation
> like
> > the BBC authoring and distributing subtitle documents at scale it would
> be
> > better to optimise for machine reading and writing instead of human
> reading
> > and writing, since we expect subtitle authors and editors to use
> specialist
> > software rather than tweaking files directly.
> >
> > Content signalling
> >
> > WebVTT is able to carry data intended for different purposes – captions,
> > chapters, metadata etc. However the lack of signalling within documents
> > creates an asset management and validation problem. If we have 3 WebVTT
> > files associated with a single video asset, how do we know which is
> which?
> > How would we validate them? The same issue arises with languages: it is
> > impossible to signal the primary language for a WebVTT file within the
> file;
> > requiring it to be set by the HTML srclang attribute is, frankly,
> completely
> > useless within the scope of the WebVTT specification. How can an HTML
> page
> > be generated to specify it correctly, and how can the value be validated
> for
> > any given WebVTT file unless the language identifier is present within
> it?
> > What if the WebVTT file is intended for use without being embedded in an
> > HTML document?
> >
> > The three types of WebVTT file indicated in §4.6 appear to be
> > indistinguishable by inspecting the file.
> >
> > Timing
> >
> > WebVTT timing does not specify the time base that applies, except by
> > reference to HTML5.1 media timelines. In general it is assumed to be
> media
> > time, i.e. The first frame of the video is time zero. That makes sense in
> > relation to a distributed piece of media such as would be present in an
> HTML
> > video track. However there are broadcast usages of subtitle files that
> need
> > to relate times to embedded timecode in video. It may be that such
> practices
> > are phased out over time; until that occurs however, there is no defined
> > data structure in WebVTT (even as metadata) that allows the processor to
> > understand how to relate timestamps to the media with which the file is
> > somehow associated, or indeed how the timestamps were generated.
> >
> > For example, if a WebVTT file were authored against embedded timecode
> > timestamps, but then that file were accidentally re-used against video
> media
> > stripped of such timestamps, the content would display at the wrong time,
> > but there would be no information available within the WebVTT file to
> > identify that such a scenario had occurred.
> >
> > The syntax of WebVTT timestamps requires that all timestamps contain a
> full
> > stop character followed by three digits. This is a likely source of
> errors
> > in two ways: Firstly, if a conformant processor sees a timestamp with a
> > different number of characters (even a greater number) after the full
> stop,
> > rather than returning a valid alternative time, it returns an error.
> This is
> > therefore not robust to authoring errors. Secondly, a non-conformant
> > processor that does not return an error must make a decision about how to
> > deal with the strange number of digits. Since it is clearly defined as a
> > "number of milliseconds" most interpretations would/should treat a
> timestamp
> > such as "00:00:01.2" as having 2 milliseconds, but this is contrary to
> the
> > normal mathematical representation of a fractional value, which would
> > consider it to be 200 milliseconds. I would propose that the parser
> should
> > a) append '0's after the value4 string to pad it to 3 characters if it
> has
> > fewer than 3 and b) truncate the value4 string to 3 characters if it has
> > more than 3. (rounding seems pointlessly complex when the significance is
> > +/- 0.5ms and the user agent probably doesn't meet that level of
> precision
> > anyway)
> >
> > Styling
> >
> > The default font size is 5vh, however the default line spacing for
> > addressing line positions is only 5.33vh, representing a line spacing of
> > 106.6% - that would represent an unusually small line height and may
> result
> > in strange effects. I cannot see anywhere that overrides the line
> position
> > height based on the computed value of the line-height property. (comment
> > repeated below under positioning, since it's relevant there too)
> >
> > The default background colour's opacity of 0.8 may cause accessibility
> > problems for some users, especially for bright patterned video behind the
> > text. This can easily be fixed by setting it to 1.
> >
> > The list of ::cue CSS properties in §7.2.1 is a useful minimal set,
> > especially for a non-CSS-based implementation however the requirement to
> > ignore any properties not in the list seems overly restrictive for a CSS
> > based implementation. What is the mechanism for adding further styling
> > options in the future, given that WebVTT files do not indicate any
> > specification version?
> >
> > There is no similar list for ::cue-region which seems like a strange
> > inconsistency – I am unsure which properties are permitted on a region.
> For
> > example, is border permitted? What about padding?
> >
> > Positioning
> >
> > In some places sizes and positions are defined relative to the video
> > viewport; in others the video itself. This is likely to cause some
> confusion
> > or mis-alignment when the two are not the same (e.g. a 16:9 aspect ratio
> > video is displayed in a 14:9 viewport) and creates an authoring problem.
> For
> > example the cue box size is relative to the video but the cue box line is
> > relative to the video viewport (both defined within section 3.1).
> >
> > Two positioning and sizing constructs are defined: the cue box and the
> > region. It is not clear in the data model section why both are needed and
> > how a cue box is supposed to interact with a region. Indeed it appears
> that
> > they are independent constructs given that only cue's position (and none
> of
> > the other attributes) relates to or depends on the region. The cue box
> size
> > is relative to the video viewport, not the region. This means that the
> > concept of a region is not a "bounding box" within which cue text is
> > rendered, but is only an alternative positioning reference. That semantic
> > definition differs from other specifications that define regions,
> including
> > TTWG's other specification covering this use case, TTML. (as Chair, I
> would
> > prefer that there be semantic alignment here between the working group's
> > specifications, and given the history and current state, that would mean
> > modifying the treatment of regions in WebVTT.)
> >
> > Reading section 3 data model, it appears that the region width and height
> > are unused. This may be a bug, since §6.1 rule 14.3.3 says to interpret
> the
> > cue's computed position as a percentage of the region width. It is in
> > general hard to work out if the data model section aligns with the
> rendering
> > section. I'm not sure how to fix that, but until it is fixed, document
> > authors and implementers may be at loggerheads, each apparently doing the
> > right thing but finding unexpected results for the other.
> >
> > It is not clear how the use of lines as a positioning construct is
> defined.
> > Whereas section 4.4 "line cue setting" says that the height of a line is
> > based on the first line, in §6.1, step 12.1 bullet two, the line height
> is
> > initialised to a very specific value 5.33vh, regardless of the height of
> the
> > line, and I can't see anywhere that sets another value in its place (I
> could
> > not see any reference to the line-height property in the rendering
> section,
> > but please point me to it if it is there).
> >
> > The user-agent defined edge margins used in section 6 Rendering
> effectively
> > mean that the author of the file can not predict positions or line breaks
> > accurately since the size and position of text cannot be known at
> authoring
> > time. This prevents the author from being able to position captions in a
> > location that avoids overlapping parts of the video that are important
> for
> > the viewer to see in order to understand the content. This makes it
> > effectively impossible to meet the BBC's editorial subtitle guidelines
> using
> > WebVTT. Note that this comment must logically only apply when a single
> track
> > is visible since WebVTT defines rules for multiple tracks being displayed
> > simultaneously, in which case positioning is presumably outside the
> realm of
> > the subtitle author, in the general case.
> >
> > As an alternative to edge margins, why not allow the something like the
> > padding property to be set on whatever viewport is used as the
> positioning
> > reference area? (cf comment above about positioning relative to video or
> to
> > video viewport) By the way, from a CSS box model perspective, calling
> them
> > "margins" is confusing – aren't they "padding" values since they are
> inset
> > from the video viewport area?
> >
> > Metadata
> >
> > The lack of delimiters on metadata is restrictive – it means that data
> > structures cannot be placed in metadata blocks without first checking
> that
> > they do not contain text that would be interpreted by a processor as cue
> > text. Then if they do contain such text, there is no defined escaping
> > mechanism to work around it. We capture metadata about the production
> > process of subtitle and caption files within those files, for internal
> > re-use – that appears to be impossible with WebVTT in a general way now,
> > though could be addressed by means of delimiters, escaping mechanisms
> etc.
> >
> > Terminology
> >
> > It is extremely difficult to read and understand the WebVTT specification
> > because of the terminology used, especially in relation to positioning.
> Many
> > concepts have overlapping use of words when they relate to different
> things.
> > For example section 4.4 "WebVTT cue settings" includes the "WebVTT line
> cue
> > setting" which actually configures the "cue box" position.
> >
> > A detail point about the text in that section: looking at that line cue
> > setting, it "configures the offset of the cue box from the video
> viewport's
> > edge in the direction opposite to the writing direction". I think that
> means
> > "orthogonal to the writing direction" but it could be interpreted as the
> > negative of the writing direction, i.e. If the writing direction is
> > top-to-bottom then does this mean left-to-right or bottom-to-top? Then
> the
> > WebVTT position cue setting "configures the indent position of the cue
> box
> > in the direction orthogonal to the WebVTT line cue setting". Why doesn't
> it
> > simply say "in the writing direction"?
> >
> > It would be easier for me to understand and follow if there were a
> clearer
> > box model, in which there is a region box, a cue box within that region,
> and
> > then a set of cue lines within that cue box. Then the positioning units
> of,
> > say, the cue box, can be in terms of lines, relative to the region edges,
> > and the overall rendering area (the default region?) can be positioned
> > relative to the video or the video viewport in a consistent manner.
> >
> > Algorithmic specifications
> >
> > The algorithmic nature of the processing model is helpful for initial
> > implementations, but it is very difficult to understand and follow and
> > perhaps more importantly looks brittle in case of changes. If a change is
> > needed to the algorithm, it may be challenging for implementers to work
> out
> > where in their implementations the equivalent change is needed,
> especially
> > if the implementation has been optimised and does not follow the
> > specification's structure exactly.
> >
> > Formatting
> >
> > There seem to be some formatting oddities when I view the specification
> on
> > Firefox or Chrome – in section 3.1 there's an algorithm for calculating
> the
> > computed line, whose outline box might be helpful but overlaps the bullet
> > numbers on the left.
> >
> > The diagram at the bottom of section 3.2 seems to truncate the word
> "Video
> > viewport".
> >
> > Some text has a yellow highlight colour without explanation, e.g. In
> Example
> > 19.
> >
> > Conclusion
> >
> > As mentioned at the top, as this specification is at present I could not
> > recommend to the BBC that it should adopt WebVTT across the whole chain
> from
> > authoring to presentation for both broadcast and online use, for prepared
> > and live subtitles. That would be my preferred scope for any
> subtitle/closed
> > caption format, though may go beyond the intended requirements set for
> > WebVTT (which by the way are unstated in the specification). Restricting
> the
> > scope, to consider solely the last stage and limiting the devices, i.e.
> > thinking about direct distribution to browsers, the main presentational
> > issue is that the positioning semantics are hard to understand and seem
> to
> > work against the BBC's subtitle guidelines (developed with over 30 years
> of
> > experience with closed captions). This may be something that can be
> resolved
> > by improvements to the specification. There appears to be no support at
> > present for live contribution, though this could be addressed by other
> > specifications. The issues concerning content identification, language
> etc
> > could probably be resolved quite easily to assist with the asset
> management
> > challenges I have identified above. The issue about serialisation and
> > parsing probably goes beyond any reasonable expectation to address at
> this
> > stage, but could be a continuing content management problem.
> >
> > Kind regards,
> >
> > Nigel
> >
> > --
> > Nigel Megitt
> > Executive Product Manager, BBC Design & Engineering
> >
>
>
Received on Wednesday, 27 September 2017 22:25:31 UTC