RE: using TTML for caption delivery, discussion from Sean Hayes on 2011-02-13 (public-html-a11y@w3.org from February 2011)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Sun, 13 Feb 2011 00:23:23 +0000
To: David Singer <singer@apple.com>, HTML Accessibility Task Force <public-html-a11y@w3.org>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B913FA469FC@DB3EX14MBXC303.europe.corp.microsoft.c>
David, thanks for your thoughts. I'll address them mostly from my point of view as an implementer (multiple times) of TTML and also as chair of TTWG. >From the point of view of the HTML5 specification, I don't believe we are in a position to say anything very much at his point. Having the <track> hook and leaving the format choice open seems to me the best we can do (I'll cover why I think this is sufficient later). But since TTML is in daily use and has been for multiple years captioning thousands of hours of mainstream video at the BBC, Foxtel and elsewhere I think that you can be reassured that TTML is indeed a practical delivery system for the web.

Point 1 - size of spec. 
I'll grant that the specification looks big, but that is in large part because it describes the intended authoring styling effects; I suspect that if you counted the CSS specifications that WebVTT relies on, it too will also turn out similarly large. OTOH it's a fair deal smaller than say SVG so I wouldn't say it's unreasonable. It does as you say cover essentially most use cases, which in a professional workflow may be needed. From an implementation point of view; especially if one can rely on the presence of libs for an XML DOM and a CSS formatter, as you can in a browser context, the complexity is reasonable. An implementation of TTML/SMPTE-TT can be written in around 500 lines of javascript.

Point 2 - temporal order.
In the vast majority of cases it's true that content can be written so that the text order and the temporal order are equivalent; however there are cases where it is convenient to be able to separate these temporal threads, for example if you have two regions of essentially independent captioning it's easier to deal with them in independent sections of a document rather than have them interleaved arbitrarily. And there may be no easy way to organize the document so that there is a single order. 
And applying the time interval allocation over the DOM tree is actually fairly straight forward in practice; so it is fairly easy to extract the linearized sequence of events if you need to.

Point 3. Single or chunked delivery
Since the typical size of a caption file is only on the order of 10s of Kb, maybe 100Kb or so  for a long form movie, actually receiving it all up front and parsing it in one go isn't that much of a problem, and generally an advantage. The only time it would be an issue is in delivery of live content where you don't know the captions in advance, and as far as I can tell that's not a use case that is supported by the <video> tag  today. 
However to support live production, as you say in an adaptive streaming context it is possible to deliver individual files in HTTP responses generated on the fly that each cover smaller sections of the overall timeline, in the limit down to single captions if necessary. This actually works fairly well in practice and there is server side software and hardware that can do this with TTML today. 
NB Having the file be effectively closed is also a good redundancy mechanism, one of the downsides of an open ended format is that you may never notice if the file gets truncated, I've noticed this quite a bit when I've been looking at SRT caption files.

Integrating TTML into MPEG4 again is fairly easy due to the small size, it can simply all fit in one XML box. Or be delivered as multiple segments in a trak. This has been defined for DECE and could be adopted into MPEG.

Point 4. Profiles.
There is a fairly comprehensive profiling mechanism built into TTML, however paraphrasing a saying from ad land, 50% of the spec may well be redundant, trouble is  you don't know which 50%, and the answer is different depending on who you ask. I suspect that (as I'm observing with WebVTT) that the discussions over a smaller profile will play out in a very similar way to the way it did in the TTWG. I think you'll find that TTML actually does cover more or less exactly what is needed; and as I said can be implemented in its entirety for not much more effort than a subset.

General.
It is possible for a content author to just use the track mechanism and Javascript APIs to implement TTML fairly effectively today (and by the same token WebVTT when it's done, SRT, SAMI or pretty much anything else); and since for at least 5 years we will have to deal with a legacy situation of browsers that don't implement <video> at all, and those that implement <video> but not <track>, for a very long time it will be up to the content author to paper over the cracks anyway, and if they are writing fallback content with plugins like Flash and Silverlight, and Javascript, then it makes sense to reuse the same TTML files that those plugin players use today.

So I don't believe we actually achieve very much in the real world by trying to make a decision now. In a few years we may be able to see which format is gaining most ground in practice and make a decision then. The thing to do today is to ship HTML5 so that captioning is not precluded, allow pioneering content authors to write caption content in any format they choose, and wait and see how the browser vendors do on implementing <track> natively over time.

Cheers,
Sean.

-----Original Message-----
From: public-html-a11y-request@w3.org [mailto:public-html-a11y-request@w3.org] On Behalf Of David Singer
Sent: 11 February 2011 23:22
To: HTML Accessibility Task Force
Subject: using TTML for caption delivery, discussion

I've been following the discussion of WebVTT and W3C/SMPTE TTML, and I have some concerns.

TTML is a large specification, and SMPTE-TT extends it.  It was designed primarily as an interchange format, as the spec. says, and as such, it contains a rich set of features and capability of expression. I was on the committee in the early days, and, as I recall, we looked at traditional captioning delivery on TV, multimedia text formats and support for them, and other industry practices used in caption authoring.  If you want to capture every nuance of the text flow, so that you can keep that in a database, and use that source to deliver captions in a variety of formats (some of which may not support every nuance) it's a good format.  In some sense, it represents a union of the formats and use cases looked at.

However, if you want to deliver captions, temporally, over the web, TTML (W3C or SMPTE) may be less suitable.  One obvious example problem -- as far as I can tell, there is no requirement that the document be in time-order, and there doesn't even seem to be a feature designator to allow a profile to say that documents that conform to a profile requiring that feature must be in time order.  That means that you have to receive an entire document and parse it before you can display anything.
SVG has a progressive rendering feature; does TTML, or should it?

I think SMPTE has partially addressed this in SMPTE-TT by suggesting that the author divide his timeline into sections, where the timed text for each section is described by a complete TTML document. But I don't think this is a good general solution.  To take a specific example, in HTTP-based streaming, the media is segmented for fetching, and those segments can be quite short (a few seconds).  It may not be possible to find clean 'division points' in a TTML document at that granularity, and even if it is, repeating all the 'static' information may represent significant overhead.

It's worth noting that WebVTT is easily delivered incrementally, as the body is simply a list of cues in start-time order.

While it's useful and important that it be possible to deliver the captions as a side-car file to the main media file, it's a nuisance if it's not possible to make an integrated media file.  If the captions are truly 'part of the content' -- for example, they are textual material supplied as part of a lecture -- one really needs them to be in the content.  And our experience of handling multi-file presentations (movies split over several files) is that for distribution, databasing, and so on, it's much easier if the content is in one file.  Finally, having the captions inside means that, for example, selecting a time-range of a presentation (e.g. using a w3c media fragment identifier, or by using an editor) should 'just work'.  We need caption formats that can be inside MP4, Matroska, transport streams, etc.



I think if we are to proceed with TTML as part of the workflow, we ought to look at 
a) the mapping of TTML to WebVTT and maybe other caption formats; allow that to inform a TTML profile which describes what can be done in simpler formats like WebVTT;
b) defining a TTML profile that is more suited to delivery (for example, time-ordering) and the web (e.g. that uses CSS);
c) asking MPEG to consider how it can be carried in MP4 files and MPEG-2 Transport Streams; something that would make it more like 3GPP timed text, where the individual cues are in time order in the file, for example,
d) asking the same question of others who own file formats (such as Matroska).




David Singer
Multimedia and Software Standards, Apple Inc.
Received on Sunday, 13 February 2011 00:23:59 UTC