Re: SMIL: draft items for UA guidelines


Thanks very much for your comments. I'm bouncing this message to the User
Guidelines Working Group, as well as other people cc'd on the original
message, for additional comment, preferably by October 6.

Here are my brief comments; please excuse (or correct) my very rough
paraphrasing of your points. I've included your complete message below. 

The original message is at if any
one is looking for it.

#1. Reference to SMIL should be to SMIL 1.0 -- Agreed.

#2. Would be useful to have captioning conventions for how non-speech
sounds and additional audio information such as change-of-language are
represented; what exists? -- Agreed. Geoff, can you help us out there at
all; what exists to date?

#3. SMIL needs to be extended to be able to handle audio descriptions of
video images (see sample of suggested code below) -- Agreed, and the method
you propose looks good to me, although I would want to be sure that it is a
solution that is easy to implement. Could other people please review and
comment on this - Philipp, George, Geoff? ...I also think that we need a
mechanism for text descriptions of video images, which my original draft
didn't deal with adequately.

#8. Repositioning of captions could be handled by switch statements... --
Unclear here. Would captions that are overlaid on video media object be
controllable in this manner, or is the idea that user preferences should
require captions to be displayed elsewhere than on top of a video media

#9. Providing user control over presentation pace is complex and the
following issues need to be considered... -- Agreed. Need to consider to
what extent hard synchronization is required among other media objects that
have timing relationships with the audio (or perhaps video?) that is being
pace-controlled. Comments or suggestions from anyone else on this?



[Lloyd's message follows]

Subject: Re: SMIL: draft items for UA guidelines 
Date: Mon, 21 Sep 1998 18:15:12 +0200
From: Lloyd Rutledge <>

Hi Judy,

I'm just now back from my travels, which included our meeting at the
WOW breakfast meeting.  I was glad to see your UA guidelines email and
to see that the consideration UA with SMIL is moving forward.  Below
are some comments I have on the draft items.

> 1. User should be able to identify and switch text captions of audio
> objects on & off. [Priority one]
> - Rationale: Some users require captions anytime they are available, in
> which case users should be able to set a preference in the user preferences
> to view captions; other users require captions only in certain
> circumstances, in which case they need a mechanism to identify when
> captions are available, and to turn them on while a presentation is
> playing. Mechanism for identification of captions should also function
> non-visually as some users can neither hear audio files nor see captions
> but can still access captions through screen-reading software and
> refreshable Braille display.
> - Technique: Provide user interface to switch display of media objects with
> "system-caption" test attribute on and off. This must be possible both in
> the static user preferences, and dynamically while the presentation is
> playing.

This guideline applies to implementors of SMIL 1.0, which contrasts
with guidelines that apply to the working group for the next version
of SMIL.  It might help to distinguish between these guidelines so
that the different audiences reading the draft can readily identify
what applies most to them.

> 2. User should be able to control size, color, and background color of
> captions. [Priority one]
> - Rationale: Some users require specific font size, color, and contrast
> with caption background to be able to view captions.
> - Technique: Provide user interface to change size, color, and background
> color of media objects with "system-caption" test attribute. This must be
> possible both in the static user preferences, and dynamically while the
> presentation is playing.

This is an issue of style and specifying style for particular media
objects.  This issue is primarily in the domain of CSS.  SMIL includes
document portions of other media formats and integrates them into a
unified presentation.  If CSS or an equivalent does or is made to
apply any media type then we should ensure that the SMIL format and
the players for it are kept compatible.

Is there a captions format with markup/structure that identifies what
portions of the captioning are things such as

 - different speakers
 - music
 - various noises (telephone, doorbell, car horn, siren, whistle, etc.)
 - tones of voice
 - brief occurrences of different languages not necessarily intended to 
   be understood by the audience

Having a caption format that identifies such things abstractly rather
than just giving their text description is important because the
techniques for identifying them to the user can vary between
presentations.  Style specs can then say how these abstractions should
be presented to the user given the user's preferences.

> 3. User should be able to identify and turn on and off audio descriptions
> of video objects. [Priority one]
> - Rationale: Users who cannot see a video media object need a non-visual
> way to identify that an audio description is available.
> - Technique: Provide a standard mechanism for notifying third party
> assistive technologies (e.g. screen-reading software) of the existence of
> an audio description for a video object. Provide a mechanism (which can
> function non-visually) for turning on or off the audio description.

This applies both to implementors and the SMIL working group.  It
applies for the SMIL working group because the format currently does
not explicitly address audio descriptions.

We could introduce a "system-audio-desc" attribute that works just
like the system-captions attribute but for audio description rather
than captions.  This new attribute can be included in the next version
of SMIL.  In the meantime, it can be included uniformly by
implementors and specified as a private extension to SMIL in the
manner described in the SMIL appendix.

The current SMIL 1.0 time model handles the time issues related to
audio description well.  After seeing Geoff's demo of audio
description, which we both way at the breakfast meeting, I came up
with the example SMIL code below

    <video id="video" fill="freeze"
           src="video.mpeg" clip-begin="npt=0s" clip-end="npt=5s"/>
    <audio src="audio.wav"  clip-begin="npt=0s" clip-end="npt=5s"/>
    <audio begin="id(video)(6s) system-audio-desc="on" src="desc.wav"/>
     <video src="video.mpeg" clip-begin="npt=6s" clip-end="npt=10s">
     <audio src="audio.wav"  clip-begin="npt=6s" clip-end="npt=10s">

This presentation has a video file and an accompanying primary audio
file that is the main soundtrack for the video.  There is also one two
second long audio description file.  This describes a visual event
that takes place five seconds into the video.  There is no natural
silent period in the soundtrack into which an audio description could
fit.  The behavior of this presentation when audio description is on
is that the video and sound track play for five seconds, then the
sound track stops and the video freezes for five seconds while the
audio description plays and then the remaining five seconds of the
video and audio play.  This scenario enables a sighted viewer and a
non-sighted listener to watch the same presentation simultaneously,
with the pausing video keeping the viewer engaged.  If audio
description is turned off, the same presentation should play the full
10 seconds of video and soundtrack audio with no perceivable pauses or

This presentation is represented in the SMIL code above as a sequence
of five second sub-presentations.  In each, corresponding five second
segments of the video and soundtrack audio are played in parallel.
The first five second segment also includes a switch statement chooses
between playing an empty ref element, effectively doing nothing, and
playing the audio description file describing the event that happens
five seconds into the video.  This switch statement is timed to start
5 seconds into the playing of the video.  If the audio description is
not played, the switch statement does not do anything, thus ending
immediately.  This causes the end time of the par element to be the
same as the duration of the video clip.  In this case, the showing of
the first video clip should play seamlessly into the second.  However,
if audio description is on, the five second description file will
play, bringing the duration of the first par element to 10 seconds.
Because the video on this par is only 5 seconds long, when the video
clip ends the fill attribute value "freeze" takes effect, causes the
final frame of the video to stay on the screen until the audio
description ends.

This same technique can exploit the occurances of silent periods in
the soundtrack.  If there is a soundtrack silent period that coincides
with a visual event, then the audio description should play during
that silence.  For the duration of this silent period, it is not
necessary to freeze the video while the audio description is being
played because the audio description would not drown out any important
soundtrack sound.  However, if the audio description is longer than
the silent period, you may still want to freeze the video at the end
of the silent period.  The technique above can address this by
sychronizing not with the very end of the video but with the time in
the audio when the silent period begins, and clipping the audio so
that the end of the audio of the end of the silent period.

> 8. User should be able to reposition captions. [Priority two]
> - Rationale: Some multi-media presentations will include positioning
> conflicts between captions which can obscure key visual elements of video
> media objects.
> - Technique: Provide mechanisms to control caption display location
> dynamically and through user preferences.

This could be handled by switch statements that choose between
alternate screen layout specifications whose selection is based on
SMIL test attributes that refer to user characteristics, such as
system-captions and the system-audio-desc attribute proposed in this

> 9. User should be able to dynamically control rate for audio media objects.
> [Priority two]
> - Rationale: Users with aural-processing learning disabilities may require
> a slower pace of audio; users who are experienced users of synthesized
> speech may be able to tolerate a far faster pace of audio.
> - Tecnique: Provide mechanism for dynamic control of presentation pace.

If browsers have general presentation rate controls for the user, then
the whole presentation could be slowed in a synchronous fashion to the
point at which the user can comprehend the audio.  However, this may
not always be desired.  For example, if the presentation includes
video or non-speech audio that is not played simultaneously with
speech, then the speech processing impaired would have no need for
those parts of the presentation to be slowed.  Further, such rate
control for an entire presentation is very hard to implement.

This is another issue that style sheets apply to.  Auditory style
sheets could be written for a user that say how much to slow down
speech audio and at what rate to synthesize speech.

Some degree of rate-control can be specified with a switch element
choosing between alternative sound files whose choice is based on test
attributes.  Test attributes that apply here are not explicitly
defined in SMIL, but, as described earlier, there are extension
techniques described in the document that can be used to add test
attributes.  Perhaps an appropriate collection of test attributes for
this issue can be determined and included in a future version of SMIL.

Discussion of audio rate control should also address how to play other
media objects that have timing relationships with the audio.  If the
audio is slowed, what other media object should be slowed, and how?

One technique for addressing this with SMIL is to use code similar to
the example given for issue 3, above.  The audio and video media
objects accompanying the affected audio object could be divided into
clips, along with the affected audio object.  These clips would be
established at key event points whose synchronicity should be
maintained.  This would cause video clips to sometimes be frozen and
sound clips to pause at these key points in order to maintain overall
temporal consistency.

Note that relational timing in SMIL is in terms of the playing time of
the references media object, not in terms of ideal playing time or the
playing time of the overall presentation.  For example, if a video is
triggered to start five seconds into an audio clip, and the audio clip
is slowed to half speed, the vidoe will actually play after 10 seconds
real time.

Is more continuous synchronicity needed to properly address issue 9?
For example, if an audio file with speech is slowed down, should an
accompanying video be slowed down at exactly the same pace so that lip
synching is maintained?  If this is true, then hard synchronization
would need to be implemented in all SMIL players.  Hard
synchronization is discussed in the specification as an allowable
alternative for handling synchronization, but it is not required for
use in all cases.  A future version of SMIL could provide the author
with the ability to specify what parts of a presentation should have
hard synchronization and what should not.  If hard synchronization
were important for using a dynamic control rate for audio media
objects, then this need would motivate hard synchronizations inclusion
in the next version of SMIL.


Lloyd Rutledge                                         vox: +31 20 592 41 27
CWI (Centrum voor Wiskunde en Informatica)             fax: +31 20 592 41 99
Kruislaan 413,  NL-1098 SJ Amsterdam, The Netherlands  net:
P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands

Judy Brewer    +1.617.258.9741
Director, Web Accessibility Initiative (WAI) International Program Office
World Wide Web Consortium (W3C)
MIT/LCS Room NE3-355, 545 Technology Square, Cambridge, MA, 02139, USA

Received on Wednesday, 30 September 1998 15:19:23 UTC