RE: A new proposal for how to deal with text track cues from John Birch on 2013-06-23 (public-tt@w3.org from June 2013)

From: John Birch <John.Birch@screensystems.tv>
Date: Sun, 23 Jun 2013 19:06:05 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: Glenn Adams <glenn@skynav.com>, public-tt <public-tt@w3.org>
Message-ID: <0981DC6F684DE44FBE8E2602C456E8ABDD3C1FE4@SS-IP-EXMB-01.screensystems.tv>
Hi Silvia,

Let's have one last go around the loop ;-)

But before I make specific comments I'd like to set up some background.

Captions != Subtitles.

Captions
<10% viewers need them.
Viewers may have associated vision problems (captions are largely used by an ageing population that have age related eyesight problems.
Broadcasters don't want to pay, especially for a small proportion of their audience.

Subtitles
>90% of viewers need them - program is useless without translation.
Viewers don't generally have eyesight problems - they don't want intrusive text - subtitles should be read but not seen.
Broadcasters will pay to ensure chosen target foreign audience stay watching their content.

BTW subtitling for the deaf community (i.e. translation) also happens. Most broadcasts of non-local material where translation is occurring will have two tracks to choose from, one subtitles, and one captions (translation plus sound effects).

In the broadcast world ONLY IN COUNTRIES USING THE CEA 608 / 708 CAPTION STANDARD do viewers have the choice of font size, colour and position of text.
That's basically USA / Canada and some South American countries.

ONLY in the USA is there a government mandate that captions on the Internet look the same as captions on TV - and offer the same facilities to the viewer.

Comments inline >>

Best regards,
John

John Birch | Strategic Partnerships Manager | Screen
Main Line : +44 1473 831700 | Ext : 270 | Direct Dial : +44 1473 834532
Mobile : +44 7919 558380 | Fax : +44 1473 830078
John.Birch@screensystems.tv | www.screensystems.tv | https://twitter.com/screensystems

Visit us at
Broadcast Asia 2013, 18 - 21 June 2013, Booths 5E4-01 & 5E4-02, UK Pavillion, Marina Bay Sands, Singapore


P Before printing, think about the environment-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
Sent: 23 June 2013 09:31
To: John Birch
Cc: Glenn Adams; public-tt
Subject: Re: A new proposal for how to deal with text track cues

Hi John,

I don't even know if I want to reply to this because there are some fundamental ways in which the Web and Web browsers work that don't seem to be something you subscribe to. As long as we disagree about these approaches and that Web features are necessary, there will be no agreement possible on how a caption format for the Web should work.
I'd rather agree to disagree at that stage. But let's see if we have reached that point yet.

>> Yes the web is different to TV, for a start there is 'something' (page space) outside the video, the video is not always necessarily 'full screen' like TV. And I agree that the web world view of allowing the page viewer to change text size etc. is interesting and that it could have an impact on captioning - and maybe even subtitling. But here is the 'rub'. These user driven changes are related to access issues, which are related generally to captions (translation or same language), i.e. to text for the Hard of Hearing... NOT for all viewers of translation subtitles, where the broadcasters and the viewers are generally both  looking for an intrinsically aesthetically pleasing experience.

On Wed, Jun 19, 2013 at 1:16 PM, John Birch <John.Birch@screensystems.tv> wrote:
> Hi Silvia,
>
> With respect to the specific points in your email:
> RE:  ...the browser has to do something when a line of text has to be wrapped because the video's width is too small to render the text.
> Why should the browser 'do something'? The text should remain relative in size to the video, or if that has readability implications a) one would question the validity of captioning in the first place, and b) the captions should be **specifically purposed** to that reduced screen size. The concept of a one size fits all caption file is invalid. Reduction in the amount of text might be appropriate, or a faster repetition rate of shorter captions might be an alternative strategy. An algorithm cannot even get close to an acceptable presentation automagically without considerably more metadata than is available at this point in the chain.

There are situations in which everything was authored well and the browser still has to deal with a need to wrap lines.
For example: Assuming the browser is given a line of text to render onto a video of a given size. Assuming further that the captions were specifically purposed to fit for that viewport size. Browsers allow users to interact with the content, for example increase font size to something that the user finds comfortable to read. Assuming that font size increase makes the text so big that unless the lines are wrapped, words will move off the viewport and disappear. Wrapping lines in such situations is just the lesser of two bad things to do.

>> This might make sense for a caption user (i.e. for accessibility reasons). But frankly I do not believe that the US mechanism of allowing caption size to be changed by the viewer has generally benefitted the US TV users given the impacts it has on word wrapping and readability. US captions are the worst in the world when it comes to accuracy, and consistent speed of presentation (which you'd admit is useful if you have difficulty reading). US captions often vanish within frames of being displayed, giving no time to read them. Increasing the font size generally makes this problem even worse. I agree with the intent of what you are saying, and also agree that it would be great to support larger text presentations for some users, but replicating the IMHO poor US 608 decoder experience on the Internet does not get my vote.

> RE: ... Also, you might want to talk with the people at YouTube that have to deal with a lot of garbage captions that they are getting as input...
> I would not regard most YouTube captioning as 'setting a high bar' in caption / subtitling quality 1/2 ;-).

I disagree with your analysis because I have seen much worse caption display on digital TV or other forms of broadcasting - compared to those, YouTube is setting a very high bar on how it renders captions.
But let's not get into any "he says she says here". ;-)
>> You must have only watched US TV! 1/2 ;-) Most YouTube captions spill single words to the second line... I mean - really! That's forcing the viewer's eyes to re-acquire the text, a process which has been measured as taking up to 1/2 second for some viewers. That's a MAJOR issue. There is considerable research on captioning, and several conferences a year where this sort of research is presented. This is a professional activity and a seriously academic one too! The science of captioning and subtitling is well understood, it's more that many implementations ignore the research of what should be done.
        e
> RE: You might want to check back with the beginnings of TTML to an email about "Iterating toward a solution":...
> And we certainly did iterate towards a solution... but there was no published document that we were calling a standard.

You're calling TTML a standard now, even though TTML is still evolving.
>> TTML *is* a published standard. TTML is also still evolving, under a phased process, although I would humbly suggest that the rate of evolution has slowed considerably. My comments reflected the origins of WebVTT and it's early and very public development, not so much the current state of play.

In any case: the reason why David is proposing to take the WebVTT spec from the CG into this WG is to take it onto the official path of standardisation of the W3C. If that's the only way to make you feel comfortable about a document, there is no issue here: it's happening.
>> Yes, and I think this is great.

> RE: ...The fractional percentage is simply the outcome of converting the CEA608 columns to exact percentages...
> You don't see the contradiction inside this statement? Fractional percentages require the specification of a rounding algorithm, otherwise you can get 'pixel bouncing' when different rows are addressed.
> e.g. the second line of a 2 line caption may appear at a different pixel position to a single line caption positioned on the bottom row.... even a very slight registration error between successive captions is noticeable and disruptive to readability.

CEA608 columns are defined based on a limited set of viewport resolutions. The Web knows no such thing: potentially any resolution is possible. We have to work withing the restrictions of what is given and the percentage is simply the consequence.
>> Yes, I understand your 'mapping' problem, but be advised... inconsistent positioning of caption text (even single pixel bounce) has been shown by research to have a significant impact on viewer fatigue and readability.

BTW: the SMPTE-TT conversion document simply says that a region must be created and put into the correct position, where percentages are used in the example. Thus, SMPTE-TT's implication is similar to the one that the WebVTT conversion proposes, except it doesn't spell it out. That basically assumes that the pixel error will also happen, but because it's not spelled out in the spec, the spec gets away with an "implementation quality problem" (otherwise called: "blame the engineers"). This can result in different rendering results by different implementations. It's such results that the WebVTT spec with its exact spelling out of the algorithms seeks to avoid.
>> Yes TTML has the same issue. Actually the issue is that the web (or in fact more generally computers) do not support what is required for caption / subtitle text layout. Captions and Subtitles should not fit into a specific space, instead the captions and subtitles should define the size of the space they need. I.e. the text should dominate the region, not the other way around. It is the readability of the text that is paramount, not the space or size of the box it fits into.

> On a more positive side, I'm hoping that we are now approaching some closure with this thread and the WebVTT threads discussing the interoperability questions... I'm really pleased to see how quickly that is being resolved.
> I hope also that you can appreciate the 'angle' from which I make my comments, they are not intended to be critical of either the abilities or motivations of the WebVTT community.
>
> Rather they are driven by frustrations of over a decade of working on captioning and subtitling within both traditional broadcast video distribution and more recently web distribution.
> Captioning / subtitling has been plagued by a history of a) second class status b) a misunderstanding of the difference between an output distribution form and authoring, archive and mezzanine forms.
> (I suspect this second problem arises as a result of the first).
>
> Captions and subtitles **must** be regarded in the same light as video
> or audio material... they should go through the same (often silo'd) workflow steps. [Author, Edit / Purpose, Distribute] The early implementations of caption technology for traditional broadcast encouraged a merger of these steps, such that the same file technology has been used at all stages (particularly in the USA)... although in Europe and Asia, the increased complexities of workflows to support multiple languages eventually resulted in the use of multiple formats for the captions / subtitle storage files (e.g. STL, and several proprietary formats).
>
> Recently the proliferation of new distribution mechanisms for video content, starting with HD and Digital Cinema and exploding with web distribution, has greatly increased the need to properly consider how captions and subtitles should be handled to maximise the efficiency of production and distribution. Most captioning and subtitling workflows are still human mediated. Human mediated workflows have resulted in the absence of the metadata **inside the caption / subtitle file or stream** that is necessary  for automatic computer mediated conversion.
>
> For me, the ideal state of captioning / subtitling is a world where content is authored in a metadata rich environment (including context, prosody, verbatim and summarisation, author identification, proxy media characteristics[what was used to caption from] and rights management). This authored format might then be archived as is, or stripped of some of the metadata to form a mezzanine format. The mezzanine form would then be similarly transformed to form specific output forms of captions or subtitles (e.g. Teletext, CEA608, DVB, Digital Cinema, Blu-ray/DVD, TTML, WebVTT). Thus there is a progression from an abstract form of caption to a target specific form of caption.
>
> One should not expect that reverse transforms would result in an ideal output... too much is lost when you convert from the intent of the captions (who said what, when, where and why) to a target presentation where convention and format define the presentation used to convey that intent. What is lost is the metadata. Now it has been argued by proponents of target output formats that this metadata can be incorporated in these target output formats (using 'comments' typically), but this misses an important criteria for efficient repurposing...i.e. a **standardised** and public mechanism (or ontology) for conveying this metadata... since clearly metadata is only useful if it can be extracted and consistently represents the same concepts.
>
> So, in closing, please understand that comments about 'authoring'
> target distribution formats are 'trigger' statements for me. In my world it is far preferable to speak of conversion to a distribution format, rather than directly authoring in it ;-) In truth, even those tools that directly produce an output distribution format (e.g. 608 caption streams) will do so from an (often only) internal more abstract representation.

I understand this need for continued conversion and the pain that it brings - conversion loss for captions can sometimes be worse than the quality loss in image or audio conversion. I believe, however, that for 90% of files created we should be able to convert between TTML and WebVTT without loss. You might have noticed that Sean and I are really trying hard to make that work.
>> Yes, and again I am supportive of this.

> It is this abstraction that I believe is essential to the new world of captioning and subtitling, but made public and normalised across all caption and subtitle workflows.
> Whilst not currently ideal, I believe this ambition is best served by basing archive / authoring forms of captions / subtitles on TTML, since extension and validation are key and primary concepts beneath the XML foundation of TTML.
>
> I hope this further clarifies my position on WebVTT and TTML.
>
> Finally, please don't focus on improving webVTT to meet my requirements, as they are closer to being met by other standards. I would rather urge you to focus on making WebVTT an effective *presentation* format for conversions from other caption formats (including TTML / SMPTE-TT :-).

If there is a requirement that can't be represented in WebVTT, then it can't be rendered from WebVTT either. So, don't hesitate to raise use cases.
>> The requirement is more fundamental. Stop thinking about presentation. Start thinking about why the text is there... and how that impacts how it should be presented. It's not about making random choices for wrapping and text sizes... it's about including potential wrapping points in the text in the first place (i.e. about annotating the phrases in the text so they can be preserved [again research shows this has a big impact on comprehension]). It's about indicating in the caption file (or the video file) information about where text should NOT be presented, e.g. not over the speakers lips, because that impacts viewers who combine caption reading with lip reading. Etc., etc., etc.

>> THIS IS THE BIG POINT... captioning meets the web... and we have an opportunity to rethink how it should be done. Instead all we focus on is reproducing mechanisms and conventions that were developed with the constraints of a 50+ year old technology for a different medium. EPIC FAIL!

However, I agree that we should likely conclude this thread now, in particular if as you say other formats meet your needs better.
>> Happy to continue this 'off group' if you wish... I can evangelise for years on this topic!

Best Regards,
Silvia.
>> best regards,
John


This message may contain confidential and/or privileged information. If you are not the intended recipient you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. Screen Subtitling Systems Ltd. Registered in England No. 2596832. Registered Office: The Old Rectory, Claydon Church Lane, Claydon, Ipswich, Suffolk, IP6 0EQ
Received on Sunday, 23 June 2013 19:06:37 UTC