- From: Glenn Adams <glenn@skynav.com>
- Date: Wed, 27 Sep 2017 16:24:40 -0600
- To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Cc: Nigel Megitt <nigel.megitt@bbc.co.uk>, Timed Text Working Group <public-tt@w3.org>
- Message-ID: <CACQ=j+f_vdBTWDddkZsbCw75V_FFZY3gfxB_gmGAgu+vfLYZZg@mail.gmail.com>
On Wed, Sep 27, 2017 at 2:05 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote: > Hi Nigel, > > Thanks for the extensive feedback. I'll be forwarding it to the CG > mailing list also and am grateful that you have submitted issues for > discussion into GitHub. I'm sure we can find solutions for most of > these issues. > > > I will, however, have to raise concerns about one issue that you > mentioned: the algorithmic form of the specification. Modern > specifications at the W3C are all formulated in algorithmic form and I > don't think that WebVTT should be changed to become an exception. I think this claim is not true in general. Indeed, I believe the opposite holds. Nonetheless, it is unlikely that we can change this in VTT at this juncture. > The > goal of most of the specifications at the W3C are to get them > implemented in browsers and to make that work, the specifications need > an algorithmic approach as has been proven over the years. This allows > for making a difference between what is standardised syntax and what > provided actual formats a browser should be more lenient with and > parse. The algorithmic sort of specification works best for this and > has also been deemed to be much more tolerant to change than a mere > file format specification. It may seem messy, but it creates more > interoperable implementations, which is the goal of a standard. > > We shall address all issues in the bug tracker, but I wanted to make > sure you understand why it's highly unlikely that the fundamental way > in which the WebVTT specification has been formulated will change. > > > I am curious about another statement of yours: your question about > "what would it take for the BBC to be able to use WebVTT?" is what > makes me curious. Have you connected with the business units across > the BBC to find out if anyone is already using WebVTT and if they are > what issues they are having? I'd be very surprised if it wasn't in > use, e.g. on mobile devices for displaying captions in browsers. I can > understand your reservations about using WebVTT across the whole chain > from authoring to presentation for both broadcast and online use. The > whole chain takes much more than what we have focused on so far and > some of the issues related to authoring that you brought up have > existed in the bug tracker for a long time - they have merely been > delayed for a later version. I would, however, be surprised if that > would imply that the BBC is not making use of WebVTT anywhere in their > publication chain. Could you find out and share? > > Thanks for your extensive efforts on this. > > Kind Regards, > Silvia. > > > > On Fri, Sep 22, 2017 at 7:46 AM, Nigel Megitt <nigel.megitt@bbc.co.uk> > wrote: > > Dear TTWG, > > > > In my BBC role with responsibility for engineering strategy for access > > services including subtitles I've reviewed WebVTT and have the following > > feedback. > > > > The first thing to note is that this review is based on the question > > "would/could BBC use WebVTT?" or if not, "what would it take for BBC to > be > > able to use WebVTT?" and is therefore affected by our existing > > infrastructure, guidelines and approach and is somewhat specific to BBC > (and > > possibly other organisations with similar requirements). It is not meant > to > > suggest that the same response would apply for all other organisations > and > > individuals considering WebVTT. Nor is it intended to start a "flame war" > > between standards or factions: note that it is not a comparative review, > and > > it may be that other formats suffer similar issues or worse! I make this > > point explicitly because my overall view with WebVTT at present is that > it > > has architectural and design features that make it costly or difficult > for > > the BBC to use across the whole chain from authoring to presentation. > > > > Secondly, in reviewing, I am considering it both from an architectural > > perspective and in some cases looking at the detail – apologies in > advance > > for switching between the two without warning! As a consequence, it seems > > inappropriate to raise GitHub issues on the repo for all the comments. > Where > > it is appropriate I intend to do so if nobody else does it, but I'm > going to > > write first and then raise the issues later. Nevertheless I think > > architectural comments are worth raising as part of the wide review, > even if > > there is no easy resolution. Those comments are probably best thought of > > more of an explanation than a call for specific actions for which I might > > expect a 'won't fix' response. I would still like a disposition response > on > > those comments though, even if that only is an explanation about why the > > decision was taken. > > > > Thirdly, though my comments are all about changes that I'd like to see, > that > > is, issues I have identified, it is worth pointing out that as a simple > way > > to serialise subtitle or caption text and associate it with times and > > presentational styles, WebVTT mostly works, within a particular scope. > > > > WebVTT serialisation and parsing > > > > The WebVTT syntax is similar to (but incompatible with) SRT but otherwise > > distinct from all other syntaxes, and includes a subsection that is > > effectively CSS syntax. I consider the serialisation and parsing of a > > document format to be an architectural layer in its own right, ideally > with > > tests, tools and support for the format. In the case of WebVTT the fact > that > > it has a unique format means that the benefits of referencing an > independent > > serialisation and parsing layer are absent. For internal business to > > business transactions this creates some hurdles: it is costlier to > develop a > > syntax checker for example to validate that received files are well > formed, > > or to quality check the content; writing custom parser code becomes a > > security risk since issues like buffer overflow are more commonly, though > > not uniquely, found in less mature code. The tool support for e.g. JSON, > > HTML or XML serialisation is much more mature and less likely to suffer > from > > these problems. > > > > It is unclear what action could resolve this with WebVTT in its current > > form, without taking seemingly extreme steps. For example if WebVTT were > a > > semantic model plus an API, and alternative representations were defined, > > and at least one of those alternative representations were a more > commonly > > used one, that would help, though at the expense of adding an initial > step > > for every WebVTT import or export, which is to work out which > representation > > to use. > > > > From this perspective, the syntax of WebVTT seems better suited to direct > > writing and editing in text editors by humans than by software, though > > obviously it is ultimately feasible to use either. For an organisation > like > > the BBC authoring and distributing subtitle documents at scale it would > be > > better to optimise for machine reading and writing instead of human > reading > > and writing, since we expect subtitle authors and editors to use > specialist > > software rather than tweaking files directly. > > > > Content signalling > > > > WebVTT is able to carry data intended for different purposes – captions, > > chapters, metadata etc. However the lack of signalling within documents > > creates an asset management and validation problem. If we have 3 WebVTT > > files associated with a single video asset, how do we know which is > which? > > How would we validate them? The same issue arises with languages: it is > > impossible to signal the primary language for a WebVTT file within the > file; > > requiring it to be set by the HTML srclang attribute is, frankly, > completely > > useless within the scope of the WebVTT specification. How can an HTML > page > > be generated to specify it correctly, and how can the value be validated > for > > any given WebVTT file unless the language identifier is present within > it? > > What if the WebVTT file is intended for use without being embedded in an > > HTML document? > > > > The three types of WebVTT file indicated in §4.6 appear to be > > indistinguishable by inspecting the file. > > > > Timing > > > > WebVTT timing does not specify the time base that applies, except by > > reference to HTML5.1 media timelines. In general it is assumed to be > media > > time, i.e. The first frame of the video is time zero. That makes sense in > > relation to a distributed piece of media such as would be present in an > HTML > > video track. However there are broadcast usages of subtitle files that > need > > to relate times to embedded timecode in video. It may be that such > practices > > are phased out over time; until that occurs however, there is no defined > > data structure in WebVTT (even as metadata) that allows the processor to > > understand how to relate timestamps to the media with which the file is > > somehow associated, or indeed how the timestamps were generated. > > > > For example, if a WebVTT file were authored against embedded timecode > > timestamps, but then that file were accidentally re-used against video > media > > stripped of such timestamps, the content would display at the wrong time, > > but there would be no information available within the WebVTT file to > > identify that such a scenario had occurred. > > > > The syntax of WebVTT timestamps requires that all timestamps contain a > full > > stop character followed by three digits. This is a likely source of > errors > > in two ways: Firstly, if a conformant processor sees a timestamp with a > > different number of characters (even a greater number) after the full > stop, > > rather than returning a valid alternative time, it returns an error. > This is > > therefore not robust to authoring errors. Secondly, a non-conformant > > processor that does not return an error must make a decision about how to > > deal with the strange number of digits. Since it is clearly defined as a > > "number of milliseconds" most interpretations would/should treat a > timestamp > > such as "00:00:01.2" as having 2 milliseconds, but this is contrary to > the > > normal mathematical representation of a fractional value, which would > > consider it to be 200 milliseconds. I would propose that the parser > should > > a) append '0's after the value4 string to pad it to 3 characters if it > has > > fewer than 3 and b) truncate the value4 string to 3 characters if it has > > more than 3. (rounding seems pointlessly complex when the significance is > > +/- 0.5ms and the user agent probably doesn't meet that level of > precision > > anyway) > > > > Styling > > > > The default font size is 5vh, however the default line spacing for > > addressing line positions is only 5.33vh, representing a line spacing of > > 106.6% - that would represent an unusually small line height and may > result > > in strange effects. I cannot see anywhere that overrides the line > position > > height based on the computed value of the line-height property. (comment > > repeated below under positioning, since it's relevant there too) > > > > The default background colour's opacity of 0.8 may cause accessibility > > problems for some users, especially for bright patterned video behind the > > text. This can easily be fixed by setting it to 1. > > > > The list of ::cue CSS properties in §7.2.1 is a useful minimal set, > > especially for a non-CSS-based implementation however the requirement to > > ignore any properties not in the list seems overly restrictive for a CSS > > based implementation. What is the mechanism for adding further styling > > options in the future, given that WebVTT files do not indicate any > > specification version? > > > > There is no similar list for ::cue-region which seems like a strange > > inconsistency – I am unsure which properties are permitted on a region. > For > > example, is border permitted? What about padding? > > > > Positioning > > > > In some places sizes and positions are defined relative to the video > > viewport; in others the video itself. This is likely to cause some > confusion > > or mis-alignment when the two are not the same (e.g. a 16:9 aspect ratio > > video is displayed in a 14:9 viewport) and creates an authoring problem. > For > > example the cue box size is relative to the video but the cue box line is > > relative to the video viewport (both defined within section 3.1). > > > > Two positioning and sizing constructs are defined: the cue box and the > > region. It is not clear in the data model section why both are needed and > > how a cue box is supposed to interact with a region. Indeed it appears > that > > they are independent constructs given that only cue's position (and none > of > > the other attributes) relates to or depends on the region. The cue box > size > > is relative to the video viewport, not the region. This means that the > > concept of a region is not a "bounding box" within which cue text is > > rendered, but is only an alternative positioning reference. That semantic > > definition differs from other specifications that define regions, > including > > TTWG's other specification covering this use case, TTML. (as Chair, I > would > > prefer that there be semantic alignment here between the working group's > > specifications, and given the history and current state, that would mean > > modifying the treatment of regions in WebVTT.) > > > > Reading section 3 data model, it appears that the region width and height > > are unused. This may be a bug, since §6.1 rule 14.3.3 says to interpret > the > > cue's computed position as a percentage of the region width. It is in > > general hard to work out if the data model section aligns with the > rendering > > section. I'm not sure how to fix that, but until it is fixed, document > > authors and implementers may be at loggerheads, each apparently doing the > > right thing but finding unexpected results for the other. > > > > It is not clear how the use of lines as a positioning construct is > defined. > > Whereas section 4.4 "line cue setting" says that the height of a line is > > based on the first line, in §6.1, step 12.1 bullet two, the line height > is > > initialised to a very specific value 5.33vh, regardless of the height of > the > > line, and I can't see anywhere that sets another value in its place (I > could > > not see any reference to the line-height property in the rendering > section, > > but please point me to it if it is there). > > > > The user-agent defined edge margins used in section 6 Rendering > effectively > > mean that the author of the file can not predict positions or line breaks > > accurately since the size and position of text cannot be known at > authoring > > time. This prevents the author from being able to position captions in a > > location that avoids overlapping parts of the video that are important > for > > the viewer to see in order to understand the content. This makes it > > effectively impossible to meet the BBC's editorial subtitle guidelines > using > > WebVTT. Note that this comment must logically only apply when a single > track > > is visible since WebVTT defines rules for multiple tracks being displayed > > simultaneously, in which case positioning is presumably outside the > realm of > > the subtitle author, in the general case. > > > > As an alternative to edge margins, why not allow the something like the > > padding property to be set on whatever viewport is used as the > positioning > > reference area? (cf comment above about positioning relative to video or > to > > video viewport) By the way, from a CSS box model perspective, calling > them > > "margins" is confusing – aren't they "padding" values since they are > inset > > from the video viewport area? > > > > Metadata > > > > The lack of delimiters on metadata is restrictive – it means that data > > structures cannot be placed in metadata blocks without first checking > that > > they do not contain text that would be interpreted by a processor as cue > > text. Then if they do contain such text, there is no defined escaping > > mechanism to work around it. We capture metadata about the production > > process of subtitle and caption files within those files, for internal > > re-use – that appears to be impossible with WebVTT in a general way now, > > though could be addressed by means of delimiters, escaping mechanisms > etc. > > > > Terminology > > > > It is extremely difficult to read and understand the WebVTT specification > > because of the terminology used, especially in relation to positioning. > Many > > concepts have overlapping use of words when they relate to different > things. > > For example section 4.4 "WebVTT cue settings" includes the "WebVTT line > cue > > setting" which actually configures the "cue box" position. > > > > A detail point about the text in that section: looking at that line cue > > setting, it "configures the offset of the cue box from the video > viewport's > > edge in the direction opposite to the writing direction". I think that > means > > "orthogonal to the writing direction" but it could be interpreted as the > > negative of the writing direction, i.e. If the writing direction is > > top-to-bottom then does this mean left-to-right or bottom-to-top? Then > the > > WebVTT position cue setting "configures the indent position of the cue > box > > in the direction orthogonal to the WebVTT line cue setting". Why doesn't > it > > simply say "in the writing direction"? > > > > It would be easier for me to understand and follow if there were a > clearer > > box model, in which there is a region box, a cue box within that region, > and > > then a set of cue lines within that cue box. Then the positioning units > of, > > say, the cue box, can be in terms of lines, relative to the region edges, > > and the overall rendering area (the default region?) can be positioned > > relative to the video or the video viewport in a consistent manner. > > > > Algorithmic specifications > > > > The algorithmic nature of the processing model is helpful for initial > > implementations, but it is very difficult to understand and follow and > > perhaps more importantly looks brittle in case of changes. If a change is > > needed to the algorithm, it may be challenging for implementers to work > out > > where in their implementations the equivalent change is needed, > especially > > if the implementation has been optimised and does not follow the > > specification's structure exactly. > > > > Formatting > > > > There seem to be some formatting oddities when I view the specification > on > > Firefox or Chrome – in section 3.1 there's an algorithm for calculating > the > > computed line, whose outline box might be helpful but overlaps the bullet > > numbers on the left. > > > > The diagram at the bottom of section 3.2 seems to truncate the word > "Video > > viewport". > > > > Some text has a yellow highlight colour without explanation, e.g. In > Example > > 19. > > > > Conclusion > > > > As mentioned at the top, as this specification is at present I could not > > recommend to the BBC that it should adopt WebVTT across the whole chain > from > > authoring to presentation for both broadcast and online use, for prepared > > and live subtitles. That would be my preferred scope for any > subtitle/closed > > caption format, though may go beyond the intended requirements set for > > WebVTT (which by the way are unstated in the specification). Restricting > the > > scope, to consider solely the last stage and limiting the devices, i.e. > > thinking about direct distribution to browsers, the main presentational > > issue is that the positioning semantics are hard to understand and seem > to > > work against the BBC's subtitle guidelines (developed with over 30 years > of > > experience with closed captions). This may be something that can be > resolved > > by improvements to the specification. There appears to be no support at > > present for live contribution, though this could be addressed by other > > specifications. The issues concerning content identification, language > etc > > could probably be resolved quite easily to assist with the asset > management > > challenges I have identified above. The issue about serialisation and > > parsing probably goes beyond any reasonable expectation to address at > this > > stage, but could be a continuing content management problem. > > > > Kind regards, > > > > Nigel > > > > -- > > Nigel Megitt > > Executive Product Manager, BBC Design & Engineering > > > >
Received on Wednesday, 27 September 2017 22:25:31 UTC