RE: A new proposal for how to deal with text track cues

Hi Silvia,

With respect to the specific points in your email:
RE:  ...the browser has to do something when a line of text has to be wrapped because the video's width is too small to render the text.
Why should the browser 'do something'? The text should remain relative in size to the video, or if that has readability implications a) one would question the validity of captioning in the first place, and b) the captions should be **specifically purposed** to that reduced screen size. The concept of a one size fits all caption file is invalid. Reduction in the amount of text might be appropriate, or a faster repetition rate of shorter captions might be an alternative strategy. An algorithm cannot even get close to an acceptable presentation automagically without considerably more metadata than is available at this point in the chain.

RE: ... Also, you might want to talk with the people at YouTube that have to deal with a lot of garbage captions that they are getting as input...
I would not regard most YouTube captioning as 'setting a high bar' in caption / subtitling quality 1/2 ;-).

RE: ...When defining a markup language, but not defining the means of rendering, you allow rendering devices the freedom to interpret the markup differently...
TTML (and by extension all the derived formats DO precisely define the intended rendering result, but NOT the mechanism to achieve that result.

RE: That's how all standards are written...
Most standards do not have a **public** incremental creation... publication is 'staged'. There are (closed) phases of discussion and editing, and then publication and review... not all phases progressing in parallel. A constantly evolving public document does not IMHO encourage adoption by implementers.
RE: You might want to check back with the beginnings of TTML to an email about "Iterating toward a solution":...
And we certainly did iterate towards a solution... but there was no published document that we were calling a standard.

RE: ...The fractional percentage is simply the outcome of converting the CEA608 columns to exact percentages...
You don't see the contradiction inside this statement? Fractional percentages require the specification of a rounding algorithm, otherwise you can get 'pixel bouncing' when different rows are addressed.
e.g. the second line of a 2 line caption may appear at a different pixel position to a single line caption positioned on the bottom row.... even a very slight registration error between successive captions is noticeable and disruptive to readability.


On a more positive side, I'm hoping that we are now approaching some closure with this thread and the WebVTT threads discussing the interoperability questions... I'm really pleased to see how quickly that is being resolved.
I hope also that you can appreciate the 'angle' from which I make my comments, they are not intended to be critical of either the abilities or motivations of the WebVTT community.

Rather they are driven by frustrations of over a decade of working on captioning and subtitling within both traditional broadcast video distribution and more recently web distribution.
Captioning / subtitling has been plagued by a history of a) second class status b) a misunderstanding of the difference between an output distribution form and authoring, archive and mezzanine forms.
(I suspect this second problem arises as a result of the first).

Captions and subtitles **must** be regarded in the same light as video or audio material... they should go through the same (often silo'd) workflow steps. [Author, Edit / Purpose, Distribute]
The early implementations of caption technology for traditional broadcast encouraged a merger of these steps, such that the same file technology has been used at all stages (particularly in the USA)... although in Europe and Asia, the increased complexities of workflows to support multiple languages eventually resulted in the use of multiple formats for the captions / subtitle storage files (e.g. STL, and several proprietary formats).

Recently the proliferation of new distribution mechanisms for video content, starting with HD and Digital Cinema and exploding with web distribution, has greatly increased the need to properly consider how captions and subtitles should be handled to maximise the efficiency of production and distribution. Most captioning and subtitling workflows are still human mediated. Human mediated workflows have resulted in the absence of the metadata **inside the caption / subtitle file or stream** that is necessary  for automatic computer mediated conversion.

For me, the ideal state of captioning / subtitling is a world where content is authored in a metadata rich environment (including context, prosody, verbatim and summarisation, author identification, proxy media characteristics[what was used to caption from] and rights management). This authored format might then be archived as is, or stripped of some of the metadata to form a mezzanine format. The mezzanine form would then be similarly transformed to form specific output forms of captions or subtitles (e.g. Teletext, CEA608, DVB, Digital Cinema, Blu-ray/DVD, TTML, WebVTT). Thus there is a progression from an abstract form of caption to a target specific form of caption.

One should not expect that reverse transforms would result in an ideal output... too much is lost when you convert from the intent of the captions (who said what, when, where and why) to a target presentation where convention and format define the presentation used to convey that intent. What is lost is the metadata. Now it has been argued by proponents of target output formats that this metadata can be incorporated in these target output formats (using 'comments' typically), but this misses an important criteria for efficient repurposing...i.e. a **standardised** and public mechanism (or ontology) for conveying this metadata... since clearly metadata is only useful if it can be extracted and consistently represents the same concepts.

So, in closing, please understand that comments about 'authoring' target distribution formats are 'trigger' statements for me. In my world it is far preferable to speak of conversion to a distribution format, rather than directly authoring in it ;-)
In truth, even those tools that directly produce an output distribution format (e.g. 608 caption streams) will do so from an (often only) internal more abstract representation.

It is this abstraction that I believe is essential to the new world of captioning and subtitling, but made public and normalised across all caption and subtitle workflows.
Whilst not currently ideal, I believe this ambition is best served by basing archive / authoring forms of captions / subtitles on TTML, since extension and validation are key and primary concepts beneath the XML foundation of TTML.

I hope this further clarifies my position on WebVTT and TTML.

Finally, please don't focus on improving webVTT to meet my requirements, as they are closer to being met by other standards. I would rather urge you to focus on making WebVTT an effective *presentation* format for conversions from other caption formats (including TTML / SMPTE-TT :-).

With best regards,
John

John Birch | Strategic Partnerships Manager | Screen
Main Line : +44 1473 831700 | Ext : 270 | Direct Dial : +44 1473 834532
Mobile : +44 7919 558380 | Fax : +44 1473 830078
John.Birch@screensystems.tv | www.screensystems.tv | https://twitter.com/screensystems

Visit us at
Broadcast Asia 2013, 18 - 21 June 2013, Booths 5E4-01 & 5E4-02, UK Pavillion, Marina Bay Sands, Singapore


P Before printing, think about the environment-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
Sent: 15 June 2013 11:18
To: John Birch
Cc: Glenn Adams; public-tt
Subject: Re: A new proposal for how to deal with text track cues

On Fri, Jun 14, 2013 at 9:37 PM, John Birch <John.Birch@screensystems.tv> wrote:
> Hi Silvia,
>
> Thanks for your email... I've commented in-line below. (>>)
>
> As I state below, please do not misunderstand, I am not against the implementation of another subtitle / caption *output* format. I am concerned however about an output format that seeks to 'gloss over' the potential inadequacies of the caption / subtitle authoring. Captions and subtitles should never be considered 'second rate' ancillary content that can be fixed up by a 'clever' browser. For accessibility there is a clear ethical desire to have the best authored content. For translation, (where as much as 90% of the audience may need a quality translation experience) the commercial driver for high quality subtitles is even more important. Garbage in , garbage out. My primary concern with WebVTT is that far too much attention is being paid to supporting a 'garbage in' mentality.


I think you're still misunderstanding what WebVTT does. If a file is of high quality and captions/subtitles are authored as to a high standard (as I would expect from commercial entities), and the video is being displayed at a sufficient size to display the authored content as intended, the rendering algorithm will not do any, as you call it, 'fix up'.

However, the browser has to do something when a line of text has to be wrapped because the video's width is too small to render the text.
Also, the current spec will - in the unlikely event that several cues are rendered at the same time and have been poorly authored to overlap each other - try to move the cues slighly to make the text not overlap. These are the only two situations in which the WebVTT rendering algorithm will make any changes to the positioning of the text.

Also, you might want to talk with the people at YouTube that have to deal with a lot of garbage captions that they are getting as input, but they still manage to extract a lot of good quality captions out of them, so your "garbage in - garbage out" argument wouldn't hold for YouTube. Note, however, that YouTube does a lot more than what we have codified into the WebVTT rendering algorithm.


>>>TTML is a **markup** language. It is intended to contain the necessary structure to convey the intention of an author as to how text should appear timed against external content. It does NOT define a specific rendering implementation, the referenced rendering aspect is illustrative of the specification, and any rendering implementation is permitted.

When defining a markup language, but not defining the means of rendering, you allow rendering devices the freedom to interpret the markup differently, thus leading to different visual experiences.
Surely that is not the a good thing.


>>>This has been the case since inception (over 10 years). It has been unequivocal how TTML should be interpreted, (barring a few corner cases that are well documented and will be resolved in the next edition).

WebVTT is in the same position - we're also sorting out some corner cases.


>>> BTW. SMPTE-TT has more to say about practical rendering implementations in the captioning sense than TTML. For many of the use cases that TTML was intended, it is much further along than WebVTT.

Can you point out which use cases TTML is ahead of WebVTT? I'd like to understand what shortcomings there are so we can make sure to cover all use cases, or clarify any misunderstandings.


>>> I stand by my ("half-finished strawman") statement. I have followed the public **incremental** development of the WebVTT standard.

That's how all standards are written.


>>> I have had no inclination to attempt implementation against a moving target. All formats do not evolve to support more features. The better the requirements analysis and scoping phase is, the less radical evolution is required in the specification. Writing the spec should be the easy part - working out what to put in it is the difficult trick. By comparison to WebVTT, TTML had a long gestation, but the published standard was IMHO clearer and has certainly not evolved so much since publication.

 You might want to check back with the beginnings of TTML to an email about "Iterating toward a solution":
http://lists.w3.org/Archives/Public/public-tt/2003Feb/0039.html . That was in February 2003 - and TTML is still fixing bugs. That's continuous incremental improvement and it's the norm with all specifications that continue to be in active use and adapt to reality, which is a good thing.



>>> I don't disagree. But my comment was more about why it seemed necessary to develop a standard that effectively contests some of the same space as TTML? Especially when TTML was already well formed and published at the time that WebVTT was conceived? If WebVTT had been positioned and defined as a rendering environment for TTML (which is now being discussed) we would not be having this discussion.

I'm not going there - that was a decision of the browsers that made after looking at TTML. It's history and we can't change it any more.


> You may have missed that there is an actual spec for this:
> https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.
> html Other conversions are planned, but have not been required yet.
>
>>> I must have missed the announcement last week! ;-) BTW, from an admittedly cursory look I have reservations about mapping 608 row positions to (recurring) fractional percentages. The potential problems this can create is one of the reasons why the TTML standard includes a cell positioning concept.

It has been around for at least a year and I've been pointing people toward it. The fractional percentage is simply the outcome of converting the CEA608 columns to exact percentages on the video.


>> There does not seem to be a huge awareness of the role of a professional captioner or subtitler. Or of the role of commercial subtitling and captioning organisations, or of the existence of (internal) quality standards for caption / subtitling services that are adopted (insisted upon) by those organisations.
>> The professional captioning and subtitling profession is largely ignorant of WebVTT.
>
> If this statements implies that professional captioning and subtitling organisations are ignoring WebVTT, then you may have overlooked that some are already supporting it and others are keeping a close eye.
> They don't seem to be making a big fuss about it though. For example:
>
> http://www.cpcweb.com/webcasts/webcast_samples.htm#WebVTT
> http://www.automaticsync.com/captionsync/captionsync-delivers-webvtt-o
> utput/
> http://www.synchrimedia.com/
> http://www.longtailvideo.com/support/jw-player/29360/basic-vtt-caption
> s/
> http://www.wowza.com/forums/content.php?498-How-to-stream-WebVTT-subti
> tles-to-iOS-for-closed-captioning
>
>>> Most of the organisations you mention are not captioning or subtitling companies operating in the TV / Film / Content creation marketplace. They are mostly organisations involved in the **redistribution** of media (excluding CPC). Captioning and subtitling (as creative activities) takes place at (or on behalf of) content owners / creators as well as at re-distributors. It is this former (professional level) authoring community that I do not believe WebVTT is connected with.

CPC is captioning for the TV market FAIK. TV & Film companies don't create captions themselves but get them made by captioning companies.
They buy the formats that they need and as long as they publish to TV & Film - not the Web - they don't need WebVTT.



>>>Captions should be positioned, styled and timed using a concise, structured and partitioned framework. It should not be necessary to have an in depth knowledge of an arcane set of rules in order to achieve these requirements.

Right. WebVTT has a very clear approach to how to position, style and time captions - I don't see the problem.


>>> My biggest reservations about WebVTT are that it appears that it is being promoted as a container for subtitling and caption content at the **authoring and archive** level.

WebVTT is a captioning format for the Web - that's all. Nobody is promoting it for anything else. If companies see a need to archive content in this format, I wouldn't have any problem with that. Why would that be a problem for you?


>>>> In truth I have no problem with WebVTT as a delivery format to be interpreted by a browser or agent, although clearly I would prefer that there was only one such format. However, WebVTT is late in the game, and it does not IMHO address the requirements of authoring and archive. This may be due to a lack of appreciation of the number of phases that subtitle and caption content goes through in a 'professional' broadcast environment. Like video, subtitles and captions exist in different 'silos' and are transformed (often repeatedly) depending on the final target application. The US captioning model (that of captioning near output or creating caption master tapes) is NOT representative of captioning globally, nor is it at all representative of subtitling (multiple language translation) workflows.


WebVTT was built for an international market and has taken such requirements on board - it even supported <ruby> before TTML introduced it. It had to do so because the Web is a global phenomenon.
Some features were driven later by US law, but the initial target was always an international use.


>>> I hope that clarifies my reservations about WebVTT.

Yes, thanks. I don't believe I will be able to change your mind about WebVTT, so I'll just have to focus on improving it to meet your bar.
:-)

Cheers,
Silvia.

This message may contain confidential and/or privileged information. If you are not the intended recipient you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. Screen Subtitling Systems Ltd. Registered in England No. 2596832. Registered Office: The Old Rectory, Claydon Church Lane, Claydon, Ipswich, Suffolk, IP6 0EQ

Received on Wednesday, 19 June 2013 03:17:14 UTC