Member Confidential!
Timed Text Format Requirements list
From Geoff/NCAM: date 4/16/2001
Here's a list of requirements for a timed-text format, compiled from SYMM
and WAI/PF comments. It isn't intended to be complete yet, so if you see
things missing let me know and I'll add them. Opinions differ on a few topics,
so some requirements may contradict others.
I DISPLAY
A timed-text format must or should...
-
Provide a means of giving richness or style to text (but *not* via a
<font> element).
-
Be useable in all character sets.
-
Have a default UNICODE font.
-
Permit transparent overlay.
-
Permit text highlighting.
-
Allow for different display options (pop-on, roll-up, paint-on, etc.).
-
Allow user override of display.
-
Be able to display more than one speaker's captions simultaneously (for example,
when more than one person is speaking at once).
-
Allow text to be positioned anywhere. This could be accomplished via simple
placement commands (as it is currently), or something more complex like SVG.
II TIMING
A timed-text format must or should...
-
Allow for text to appear and disappear over time.
-
Permit the display of no text-- that is, allow for erasure of text when it
is not necessary.
-
Keep text and timing information together.
-
Keep text and timing separate, perhaps via two separate modules.
III ARCHITECTURE
A timed-text format must or should...
-
Be simple to author and easy to learn.
-
Be valid XML.
-
Be streamable.
-
Be cross-platform.
-
Allow hyperlinks via the HTML "a" tag.
-
Allow authors to protect their text from being intercepted or misused, if
desired.
-
Be searchable.
-
Have a method for distinguishing one speaker from another. This could be
accomplished by a) using simple placement commands (<center>,
<left>, <right>, etc.); or b) creating a persona for text which
is spoken by each speaker using speaker="IDREF" attribute.
-
Allow the creation of collated transcripts which contain, and differentiate
via markup, captions and audio descriptions.
-
Allow motion through the use of the SMIL animate element or other method.
-
Use SVG for complex font displays.
-
Allow the user to navigate through discrete timed media via SMIL interaction
constructs.
-
Allow for the inclusion of multiple languages. These languages could be stored
as separate text files, referenced from a SMIL file.
-
Allow for long-form presentation (e.g., it should support captions or subtitles
for full-length movies or other long presentations).
IV OTHER
A timed-text format must or should...
-
Allow other ways to display text; for example, via text balloons.
Below are the comments received so far regarding a timed-text format.
From Geoff:
Here are a few ideas for the list of requirements for a standard timed text
format.
-
Full placement options within the text region Captions need to move around
the text region to indicate who is speaking. I'm not talking about animation,
but just that the author must be able to stick the text in a specific spot:
<left>, <center>, <right>, <top>
<XY>
, for example.
-
Transparent overlay Authors must be able to put text in a transparent box
over video, similar to subtitles on film.
-
Text highlighting The ability to synchronize highlighting of text with words
as they are spoken is a feature many people are asking for: word by word,
sentence by sentence, phrase by phrase, etc. Highlighting should not be limited
to one specific text-display option (see #7).
-
Search Having a searchable text track is invaluable for indexing purposes
and also teaching purposes. You can find an example of this already in
QuickTime's QText.
-
Styles Color, font weight, size, face, etc., should all be available to the
author, which leads me to...
-
User override of display Again, perhaps this is not a text-format issue,
but users should be able to change the presentation of text to suit their
needs, e.g., increasing font size or changing/adding colors and backgrounds.
-
Various text-display options In addition to your
garden-variety pop-on captions, other display styles are necessary: scrolling
text (top down or bottom up) and crawling text (left to right or right to
left).
-
Collated transcripts containing captions and text-audio descriptions This
is a big issue for the WAI: if a multimedia clip contains both captions and
audio descriptions, the user should be able to access a transcript containing
the caption text plus a text version of the audio descriptions and of the
program audio.
See the last-call version of the WAI User Agent Guidelines for more info:
http://www.w3.org/TR/2000/WD-UAAG10-20001023/#gl-content-access
From Warner:
I think we have "one" requirement, although it might hide a few of them:
- The mechanism of synchronizing a timed-text document by a SMIL document
must be generic in the sense that also other forms of similar timed-media
can be controled and interacted with. The problem is the synchronization
with a stream of discrete media, where the discrete media have been marked-up
with timing/synchronization information.
An example could be series of (timed) images. The user should be able to
navigate through the timed-discrete media through the SMIL document, using
SMIL interaction constructs (events or animation; also transitions?).
From Erik:
Use cases:
Case 1: A closed-caption text stream coming from a live source should be
possible. End users should be able to "tune in" to the text presentation
at any time after it has begun.
Case 2: A long-format subtitles presentation should be possible. I've seen
a 1MB+ text file that contains all CC information from a four-hour speech.
Keep in mind that this 1MB file (in RealText format) sits on a server on
the Internet, not on a player-side CD ROM or hard drive.
Requirements:
Following requirements are STT "Must":
-
be extremely simple to author, and must be easy to learn. Current Web authors
should be able to see->do->teach STT in a short period of time otherwise
I believe it will never be widely adopted. The purpose of STT is to provide
a standardized means of presenting text over time. If we get too ambitious,
we'll be reinventing SVG. Also, the simpler it is, the faster we can agree
on what its syntax and semantics are.
-
be valid XML. However, cases 1 and 2 above mean that an application may not
be able to know in advance if the entire presentation validates because playback
must be able to begin prior to the entire presentation being parsed and/or
created. How XML errors that are encountered on-the-fly are handled is something
for which I don't have a good answer. That might be best left to the
implementation to decide.
-
be "streamable". Users should be able to seek from point A forward to point
B within an STT presentation and not have to wait for the entire contents
between A and B to be sent before playback can resume. (Otherwise, imagine
how long you would have to wait if you seeked to the end of the 1MB presentation
I mentioned above if you were on a 56K modem.)
-
be cross-platform; it should not rely on any technology particular to a specific
operating system (e.g., should not use "Arial" or "Helvetica" fonts).
-
be useable in all character sets. The default font must be a UNICODE font
and must be shipped with any application that displays STT presentations.
This font must be free of royalties and free of any other fees for its use
in any STT application.
-
contain an STT version in each STT file and live stream. An STT-compliant
application must not attempt to handle an STT presentation with a version
that it does not recognize. In my opinion (shared by many others), many problems
encountered when trying to author HTML for the Web arise due to the fact
that an author has to write for the least-capable browser in order to be
sure that all users will see the page as that author intended. (Note: we're
solving this in SMIL 2.0 by requiring a default namespace.) The SMIL mechanism
for switching and for skipping unknown content should be incorporated as
well. Of course, this is not an issue if we keep STT simple, get it right
the first time, and leave it that way.
-
allow text to appear and disappear over time. An author should not have to
give an end time to a block of text; it must be possible for new text to
interrupt (end) prior text. The SMIL 2.0 "excl" time container is a (soon-to-be)
standard way to enable this. Imagine use case 1 above. Someone is typing
in the score of a World Cup match, which stands at 0 0, when a goal is scored.
The person creating the live stream can't know how long to display the text
"0 0" until after "0 0" is sent over the wire. The application then displays
"0 0" until it receives new instructions telling it to now display "1 0",
followed by "1 1", ...etc. By giving an STT file a body element and making
it an excl time container, an author can interrupt old text with new by making
each such block of text a child of the parent exclusive time container. Note
that this is how CC and subtitles are presented in most cases on TV and in
films.
These requirements are not a "must", but a strong *should*:
-
provide a means of giving richness to text (without overdoing it; requirement
(1), above, is extremely important.) In my opinion, the following XHTML elements
should be allowed: em, strong, h, br (with closing '/'), p, pre, hr, ol,
ul, li Note: I did not look at XHTML modules to see whether the above only
supports partial modules (which I know is the case); the above choices are
based on my four years of experience with streaming text and my desire to
keep STT as simple as possible.
-
allow hyperlinks via the HTML "a" tag. However, note that use of "name" and
links with fragment identifiers to jump to that named spot in the presentation
can conflict with requirement (3). Perhaps a required table of named anchors
must exist in the header of the STT presentation in order for the client
application to process requests to jump to such a spot. Live ones would be
ignored.
-
take into consideration the probablilty that authors will want to be able
to protect their text from being intercepted and/or used for purposes other
than intended.
The following are in RealText but should *NOT* be in STT:
-
STT should not contain a means for giving motion to text. This is a very
complicated thing to handle when having to calculate where text is over time
considering word wraps, seeking, live-text streaming, variable-sized characters,
hyperlink locations, ...etc. Motion should be done by use of the animate
element in SMIL (or via the DOM). Also, some screen readers may not do well
reading moving text.
-
STT should not contain a "font" element. SVG or other technologies should
be used for complex font displays. (See requirements (4) and (5) above.)
I once read that experts are simply people who have made all the mistakes
in their field. Given that definition, I guess that makes me an "expert"
at timed text. Please include me in any further discussions of STT as I will
be more than happy to provide input and feedback.
The WAI Protocols and Formats group (WAI/PF) held another
discussion
on the timed-text format at yesterday's teleconference. We've also been
discussing it on the list for a couple weeks. Below are notes collated from
the telecon and the list. AG= Al Gilman; GF = Geoff Freed; CMN= Charles
McCathieNeville.
Last revised $Date: 2001/04/20 15:22:50 $ by $Author: tmichel $