using TTML for caption delivery, discussion from David Singer on 2011-02-11 (public-html-a11y@w3.org from February 2011)

From: David Singer <singer@apple.com>
Date: Fri, 11 Feb 2011 15:21:40 -0800
To: HTML Accessibility Task Force <public-html-a11y@w3.org>
Message-Id: <8E1DB402-D2E8-4925-B698-F2C7B92D3CF8@apple.com>

I've been following the discussion of WebVTT and W3C/SMPTE TTML, and I have some concerns.

TTML is a large specification, and SMPTE-TT extends it. It was designed primarily as an interchange format, as the spec. says, and as such, it contains a rich set of features and capability of expression. I was on the committee in the early days, and, as I recall, we looked at traditional captioning delivery on TV, multimedia text formats and support for them, and other industry practices used in caption authoring. If you want to capture every nuance of the text flow, so that you can keep that in a database, and use that source to deliver captions in a variety of formats (some of which may not support every nuance) it's a good format. In some sense, it represents a union of the formats and use cases looked at.

However, if you want to deliver captions, temporally, over the web, TTML (W3C or SMPTE) may be less suitable. One obvious example problem -- as far as I can tell, there is no requirement that the document be in time-order, and there doesn't even seem to be a feature designator to allow a profile to say that documents that conform to a profile requiring that feature must be in time order. That means that you have to receive an entire document and parse it before you can display anything.
SVG has a progressive rendering feature; does TTML, or should it?

I think SMPTE has partially addressed this in SMPTE-TT by suggesting that the author divide his timeline into sections, where the timed text for each section is described by a complete TTML document. But I don't think this is a good general solution. To take a specific example, in HTTP-based streaming, the media is segmented for fetching, and those segments can be quite short (a few seconds). It may not be possible to find clean 'division points' in a TTML document at that granularity, and even if it is, repeating all the 'static' information may represent significant overhead.

It's worth noting that WebVTT is easily delivered incrementally, as the body is simply a list of cues in start-time order.

While it's useful and important that it be possible to deliver the captions as a side-car file to the main media file, it's a nuisance if it's not possible to make an integrated media file. If the captions are truly 'part of the content' -- for example, they are textual material supplied as part of a lecture -- one really needs them to be in the content. And our experience of handling multi-file presentations (movies split over several files) is that for distribution, databasing, and so on, it's much easier if the content is in one file. Finally, having the captions inside means that, for example, selecting a time-range of a presentation (e.g. using a w3c media fragment identifier, or by using an editor) should 'just work'. We need caption formats that can be inside MP4, Matroska, transport streams, etc.

I think if we are to proceed with TTML as part of the workflow, we ought to look at
a) the mapping of TTML to WebVTT and maybe other caption formats; allow that to inform a TTML profile which describes what can be done in simpler formats like WebVTT;
b) defining a TTML profile that is more suited to delivery (for example, time-ordering) and the web (e.g. that uses CSS);
c) asking MPEG to consider how it can be carried in MP4 files and MPEG-2 Transport Streams; something that would make it more like 3GPP timed text, where the individual cues are in time order in the file, for example,
d) asking the same question of others who own file formats (such as Matroska).

David Singer
Multimedia and Software Standards, Apple Inc.

Received on Friday, 11 February 2011 23:22:45 UTC