Member Confidential!

Timed Text Format Requirements list

From Geoff/NCAM: date 4/16/2001

Here's a list of requirements for a timed-text format, compiled from SYMM and WAI/PF comments. It isn't intended to be complete yet, so if you see things missing let me know and I'll add them. Opinions differ on a few topics, so some requirements may contradict others.

I DISPLAY

A timed-text format must or should...

Provide a means of giving richness or style to text (but *not* via a <font> element).
Be useable in all character sets.
Have a default UNICODE font.
Permit transparent overlay.
Permit text highlighting.
Allow for different display options (pop-on, roll-up, paint-on, etc.).
Allow user override of display.
Be able to display more than one speaker's captions simultaneously (for example, when more than one person is speaking at once).
Allow text to be positioned anywhere. This could be accomplished via simple placement commands (as it is currently), or something more complex like SVG.

II TIMING

A timed-text format must or should...

Allow for text to appear and disappear over time.
Permit the display of no text-- that is, allow for erasure of text when it is not necessary.
Keep text and timing information together.
Keep text and timing separate, perhaps via two separate modules.

III ARCHITECTURE

A timed-text format must or should...

Be simple to author and easy to learn.
Be valid XML.
Be streamable.
Be cross-platform.
Allow hyperlinks via the HTML "a" tag.
Allow authors to protect their text from being intercepted or misused, if desired.
Be searchable.
Have a method for distinguishing one speaker from another. This could be accomplished by a) using simple placement commands (<center>, <left>, <right>, etc.); or b) creating a persona for text which is spoken by each speaker using speaker="IDREF" attribute.
Allow the creation of collated transcripts which contain, and differentiate via markup, captions and audio descriptions.
Allow motion through the use of the SMIL animate element or other method.
Use SVG for complex font displays.
Allow the user to navigate through discrete timed media via SMIL interaction constructs.
Allow for the inclusion of multiple languages. These languages could be stored as separate text files, referenced from a SMIL file.
Allow for long-form presentation (e.g., it should support captions or subtitles for full-length movies or other long presentations).

IV OTHER

A timed-text format must or should...

Allow other ways to display text; for example, via text balloons.

Below are the comments received so far regarding a timed-text format.

From Geoff:

Here are a few ideas for the list of requirements for a standard timed text format.

Full placement options within the text region Captions need to move around the text region to indicate who is speaking. I'm not talking about animation, but just that the author must be able to stick the text in a specific spot:
<left>, <center>, <right>, <top> <XY>, for example.
Transparent overlay Authors must be able to put text in a transparent box over video, similar to subtitles on film.
Text highlighting The ability to synchronize highlighting of text with words as they are spoken is a feature many people are asking for: word by word, sentence by sentence, phrase by phrase, etc. Highlighting should not be limited to one specific text-display option (see #7).
Search Having a searchable text track is invaluable for indexing purposes and also teaching purposes. You can find an example of this already in QuickTime's QText.
Styles Color, font weight, size, face, etc., should all be available to the author, which leads me to...
User override of display Again, perhaps this is not a text-format issue, but users should be able to change the presentation of text to suit their needs, e.g., increasing font size or changing/adding colors and backgrounds.
Various text-display options In addition to your garden-variety pop-on captions, other display styles are necessary: scrolling text (top down or bottom up) and crawling text (left to right or right to left).
Collated transcripts containing captions and text-audio descriptions This is a big issue for the WAI: if a multimedia clip contains both captions and audio descriptions, the user should be able to access a transcript containing the caption text plus a text version of the audio descriptions and of the program audio.
See the last-call version of the WAI User Agent Guidelines for more info: http://www.w3.org/TR/2000/WD-UAAG10-20001023/#gl-content-access

From Warner:

I think we have "one" requirement, although it might hide a few of them:

- The mechanism of synchronizing a timed-text document by a SMIL document must be generic in the sense that also other forms of similar timed-media can be controled and interacted with. The problem is the synchronization with a stream of discrete media, where the discrete media have been marked-up with timing/synchronization information.
An example could be series of (timed) images. The user should be able to navigate through the timed-discrete media through the SMIL document, using SMIL interaction constructs (events or animation; also transitions?).

From Erik:

Use cases:

Case 1: A closed-caption text stream coming from a live source should be possible. End users should be able to "tune in" to the text presentation at any time after it has begun.

Case 2: A long-format subtitles presentation should be possible. I've seen a 1MB+ text file that contains all CC information from a four-hour speech. Keep in mind that this 1MB file (in RealText format) sits on a server on the Internet, not on a player-side CD ROM or hard drive.

Requirements:

Following requirements are STT "Must":

be extremely simple to author, and must be easy to learn. Current Web authors should be able to see->do->teach STT in a short period of time otherwise I believe it will never be widely adopted. The purpose of STT is to provide a standardized means of presenting text over time. If we get too ambitious, we'll be reinventing SVG. Also, the simpler it is, the faster we can agree on what its syntax and semantics are.
be valid XML. However, cases 1 and 2 above mean that an application may not be able to know in advance if the entire presentation validates because playback must be able to begin prior to the entire presentation being parsed and/or created. How XML errors that are encountered on-the-fly are handled is something for which I don't have a good answer. That might be best left to the implementation to decide.
be "streamable". Users should be able to seek from point A forward to point B within an STT presentation and not have to wait for the entire contents between A and B to be sent before playback can resume. (Otherwise, imagine how long you would have to wait if you seeked to the end of the 1MB presentation I mentioned above if you were on a 56K modem.)
be cross-platform; it should not rely on any technology particular to a specific operating system (e.g., should not use "Arial" or "Helvetica" fonts).
be useable in all character sets. The default font must be a UNICODE font and must be shipped with any application that displays STT presentations. This font must be free of royalties and free of any other fees for its use in any STT application.
contain an STT version in each STT file and live stream. An STT-compliant application must not attempt to handle an STT presentation with a version that it does not recognize. In my opinion (shared by many others), many problems encountered when trying to author HTML for the Web arise due to the fact that an author has to write for the least-capable browser in order to be sure that all users will see the page as that author intended. (Note: we're solving this in SMIL 2.0 by requiring a default namespace.) The SMIL mechanism for switching and for skipping unknown content should be incorporated as well. Of course, this is not an issue if we keep STT simple, get it right the first time, and leave it that way.
allow text to appear and disappear over time. An author should not have to give an end time to a block of text; it must be possible for new text to interrupt (end) prior text. The SMIL 2.0 "excl" time container is a (soon-to-be) standard way to enable this. Imagine use case 1 above. Someone is typing in the score of a World Cup match, which stands at 0 0, when a goal is scored. The person creating the live stream can't know how long to display the text "0 0" until after "0 0" is sent over the wire. The application then displays "0 0" until it receives new instructions telling it to now display "1 0", followed by "1 1", ...etc. By giving an STT file a body element and making it an excl time container, an author can interrupt old text with new by making each such block of text a child of the parent exclusive time container. Note that this is how CC and subtitles are presented in most cases on TV and in films.

These requirements are not a "must", but a strong *should*:
provide a means of giving richness to text (without overdoing it; requirement (1), above, is extremely important.) In my opinion, the following XHTML elements should be allowed: em, strong, h, br (with closing '/'), p, pre, hr, ol, ul, li Note: I did not look at XHTML modules to see whether the above only supports partial modules (which I know is the case); the above choices are based on my four years of experience with streaming text and my desire to keep STT as simple as possible.
allow hyperlinks via the HTML "a" tag. However, note that use of "name" and links with fragment identifiers to jump to that named spot in the presentation can conflict with requirement (3). Perhaps a required table of named anchors must exist in the header of the STT presentation in order for the client application to process requests to jump to such a spot. Live ones would be ignored.
take into consideration the probablilty that authors will want to be able to protect their text from being intercepted and/or used for purposes other than intended.

The following are in RealText but should *NOT* be in STT:

STT should not contain a means for giving motion to text. This is a very complicated thing to handle when having to calculate where text is over time considering word wraps, seeking, live-text streaming, variable-sized characters, hyperlink locations, ...etc. Motion should be done by use of the animate element in SMIL (or via the DOM). Also, some screen readers may not do well reading moving text.
STT should not contain a "font" element. SVG or other technologies should be used for complex font displays. (See requirements (4) and (5) above.)

I once read that experts are simply people who have made all the mistakes in their field. Given that definition, I guess that makes me an "expert" at timed text. Please include me in any further discussions of STT as I will be more than happy to provide input and feedback.

The WAI Protocols and Formats group (WAI/PF) held another discussion on the timed-text format at yesterday's teleconference. We've also been discussing it on the list for a couple weeks. Below are notes collated from the telecon and the list. AG= Al Gilman; GF = Geoff Freed; CMN= Charles McCathieNeville.

Last revised $Date: 2001/04/20 15:22:50 $ by $Author: tmichel $