Re: A new proposal for how to deal with text track cues

Hi John,

I don't even know if I want to reply to this because there are some
fundamental ways in which the Web and Web browsers work that don't
seem to be something you subscribe to. As long as we disagree about
these approaches and that Web features are necessary, there will be no
agreement possible on how a caption format for the Web should work.
I'd rather agree to disagree at that stage. But let's see if we have
reached that point yet.


On Wed, Jun 19, 2013 at 1:16 PM, John Birch <John.Birch@screensystems.tv> wrote:
> Hi Silvia,
>
> With respect to the specific points in your email:
> RE:  ...the browser has to do something when a line of text has to be wrapped because the video's width is too small to render the text.
> Why should the browser 'do something'? The text should remain relative in size to the video, or if that has readability implications a) one would question the validity of captioning in the first place, and b) the captions should be **specifically purposed** to that reduced screen size. The concept of a one size fits all caption file is invalid. Reduction in the amount of text might be appropriate, or a faster repetition rate of shorter captions might be an alternative strategy. An algorithm cannot even get close to an acceptable presentation automagically without considerably more metadata than is available at this point in the chain.

There are situations in which everything was authored well and the
browser still has to deal with a need to wrap lines.
For example: Assuming the browser is given a line of text to render
onto a video of a given size. Assuming further that the captions were
specifically purposed to fit for that viewport size. Browsers allow
users to interact with the content, for example increase font size to
something that the user finds comfortable to read. Assuming that font
size increase makes the text so big that unless the lines are wrapped,
words will move off the viewport and disappear. Wrapping lines in such
situations is just the lesser of two bad things to do.


> RE: ... Also, you might want to talk with the people at YouTube that have to deal with a lot of garbage captions that they are getting as input...
> I would not regard most YouTube captioning as 'setting a high bar' in caption / subtitling quality 1/2 ;-).

I disagree with your analysis because I have seen much worse caption
display on digital TV or other forms of broadcasting - compared to
those, YouTube is setting a very high bar on how it renders captions.
But let's not get into any "he says she says here". ;-)


> RE: You might want to check back with the beginnings of TTML to an email about "Iterating toward a solution":...
> And we certainly did iterate towards a solution... but there was no published document that we were calling a standard.

You're calling TTML a standard now, even though TTML is still evolving.

In any case: the reason why David is proposing to take the WebVTT spec
from the CG into this WG is to take it onto the official path of
standardisation of the W3C. If that's the only way to make you feel
comfortable about a document, there is no issue here: it's happening.


> RE: ...The fractional percentage is simply the outcome of converting the CEA608 columns to exact percentages...
> You don't see the contradiction inside this statement? Fractional percentages require the specification of a rounding algorithm, otherwise you can get 'pixel bouncing' when different rows are addressed.
> e.g. the second line of a 2 line caption may appear at a different pixel position to a single line caption positioned on the bottom row.... even a very slight registration error between successive captions is noticeable and disruptive to readability.

CEA608 columns are defined based on a limited set of viewport
resolutions. The Web knows no such thing: potentially any resolution
is possible. We have to work withing the restrictions of what is given
and the percentage is simply the consequence.

BTW: the SMPTE-TT conversion document simply says that a region must
be created and put into the correct position, where percentages are
used in the example. Thus, SMPTE-TT's implication is similar to the
one that the WebVTT conversion proposes, except it doesn't spell it
out. That basically assumes that the pixel error will also happen, but
because it's not spelled out in the spec, the spec gets away with an
"implementation quality problem" (otherwise called: "blame the
engineers"). This can result in different rendering results by
different implementations. It's such results that the WebVTT spec with
its exact spelling out of the algorithms seeks to avoid.


> On a more positive side, I'm hoping that we are now approaching some closure with this thread and the WebVTT threads discussing the interoperability questions... I'm really pleased to see how quickly that is being resolved.
> I hope also that you can appreciate the 'angle' from which I make my comments, they are not intended to be critical of either the abilities or motivations of the WebVTT community.
>
> Rather they are driven by frustrations of over a decade of working on captioning and subtitling within both traditional broadcast video distribution and more recently web distribution.
> Captioning / subtitling has been plagued by a history of a) second class status b) a misunderstanding of the difference between an output distribution form and authoring, archive and mezzanine forms.
> (I suspect this second problem arises as a result of the first).
>
> Captions and subtitles **must** be regarded in the same light as video or audio material... they should go through the same (often silo'd) workflow steps. [Author, Edit / Purpose, Distribute]
> The early implementations of caption technology for traditional broadcast encouraged a merger of these steps, such that the same file technology has been used at all stages (particularly in the USA)... although in Europe and Asia, the increased complexities of workflows to support multiple languages eventually resulted in the use of multiple formats for the captions / subtitle storage files (e.g. STL, and several proprietary formats).
>
> Recently the proliferation of new distribution mechanisms for video content, starting with HD and Digital Cinema and exploding with web distribution, has greatly increased the need to properly consider how captions and subtitles should be handled to maximise the efficiency of production and distribution. Most captioning and subtitling workflows are still human mediated. Human mediated workflows have resulted in the absence of the metadata **inside the caption / subtitle file or stream** that is necessary  for automatic computer mediated conversion.
>
> For me, the ideal state of captioning / subtitling is a world where content is authored in a metadata rich environment (including context, prosody, verbatim and summarisation, author identification, proxy media characteristics[what was used to caption from] and rights management). This authored format might then be archived as is, or stripped of some of the metadata to form a mezzanine format. The mezzanine form would then be similarly transformed to form specific output forms of captions or subtitles (e.g. Teletext, CEA608, DVB, Digital Cinema, Blu-ray/DVD, TTML, WebVTT). Thus there is a progression from an abstract form of caption to a target specific form of caption.
>
> One should not expect that reverse transforms would result in an ideal output... too much is lost when you convert from the intent of the captions (who said what, when, where and why) to a target presentation where convention and format define the presentation used to convey that intent. What is lost is the metadata. Now it has been argued by proponents of target output formats that this metadata can be incorporated in these target output formats (using 'comments' typically), but this misses an important criteria for efficient repurposing...i.e. a **standardised** and public mechanism (or ontology) for conveying this metadata... since clearly metadata is only useful if it can be extracted and consistently represents the same concepts.
>
> So, in closing, please understand that comments about 'authoring' target distribution formats are 'trigger' statements for me. In my world it is far preferable to speak of conversion to a distribution format, rather than directly authoring in it ;-)
> In truth, even those tools that directly produce an output distribution format (e.g. 608 caption streams) will do so from an (often only) internal more abstract representation.

I understand this need for continued conversion and the pain that it
brings - conversion loss for captions can sometimes be worse than the
quality loss in image or audio conversion. I believe, however, that
for 90% of files created we should be able to convert between TTML and
WebVTT without loss. You might have noticed that Sean and I are really
trying hard to make that work.


> It is this abstraction that I believe is essential to the new world of captioning and subtitling, but made public and normalised across all caption and subtitle workflows.
> Whilst not currently ideal, I believe this ambition is best served by basing archive / authoring forms of captions / subtitles on TTML, since extension and validation are key and primary concepts beneath the XML foundation of TTML.
>
> I hope this further clarifies my position on WebVTT and TTML.
>
> Finally, please don't focus on improving webVTT to meet my requirements, as they are closer to being met by other standards. I would rather urge you to focus on making WebVTT an effective *presentation* format for conversions from other caption formats (including TTML / SMPTE-TT :-).

If there is a requirement that can't be represented in WebVTT, then it
can't be rendered from WebVTT either. So, don't hesitate to raise use
cases.

However, I agree that we should likely conclude this thread now, in
particular if as you say other formats meet your needs better.

Best Regards,
Silvia.

Received on Sunday, 23 June 2013 08:31:49 UTC