W3C home > Mailing lists > Public > public-html-a11y@w3.org > June 2011

Re: [media] how to support extended text descriptions

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Mon, 6 Jun 2011 12:16:06 +1000
Message-ID: <BANLkTinF=fjxiu5FMNn7dpLtHfrX7uZCVQ@mail.gmail.com>
To: public-html-a11y@w3.org
Hi Janina,

Actually, your approach and mine are identical for all I can tell. I'm
just focused at one particular technical issue, while you have
described the big picture. I apologize for not having expressed it
well: I simply assumed we were all on the same page already (which, in
fact, we are, but weren't able to communicate). Your email and mine
looked at different aspects of the "extended text description"
problem, which is good.

We both agree that there is no difference between an "ordinary" text
description and a "extended" text description - any text description
has to be regarded as potentially having a need for time extension
because of the speed settings in the screen reader.

We also both agree that extended text descriptions require the video
to be paused to allow time for the remaining descriptive text to be

As for your description of the two things that a "module" would need
to be able to do, I think that functionality already exists. In

> 1.)     Track time--including how much time is available in between segments of spoken dialog.

Text descriptions are specified through the TextTrack mechanism as
currently specified in HTML5. A TextTrack consists of a sequence of
TextTrackCues which contain a startTime and a endTime attribute each.
The author of a TextTrack will create a text description in such a way
that they specify TextTrackCues with startTime and endTime that fit in
between segments of spoken dialog. If there is no pause time within a
spoken dialog, it's even possible to specify the startTime and the
endTime to be the same time.

Note that this functionality is already available as currently
specified in HTML5.

> 2.)     Intelligently pipe text to a TTS engine. The "intelligence" relates to calculating how much time will be required to voice the text that precedes the onset of the next spoken dialog.

To achieve this, description text of each TextTrackCues will need to
be handed over to the TTS engine at the time that the TextTrackCue
becomes active. In current implementations, this has been achieved in
JavaScript by using an aria-live attribute on <div> elements that are
created as a TextTrackCue becomes active. A TTS engine that supports
aria-live becomes aware of that text immediately and voices it

Now to the problem that I wanted to solve and that you call
"inelligent piping". The time that is required to voice the text of a
TextTrackCue is unknown to the author of the TextTrackCue and to the
browser, because the TTS engine controls the speech rate and therefore
the length of time that it takes to voice the text. Thefore, neither
the author nor the browser will now for how long to pause the video to
allow the TTS engine to finish speaking the text.

The TextTrackCue author can certainly make a good guess as to how much
time is available in a dialog pause and try to accommodate the length
of the descriptive text to that pause, but he can never be 100%
certain that it's right, because the TTS of every user may be
different. The browser is in a similar position, because it doesn't
know how long the TTS engine is speaking.

The only module that knows how long it will take to voice the
description text is the TTS engine itself.

Now, this is where we have to find a solution. Your suggestion of
introducing an application that would look after the synchronisation
between the TTS engine and the browser is one way to do it, but I
would prefer if we can solve the extended text description problem
with existing browsers and existing screen readers and just some
minimal extra functionality.

Here's how I can see it working:

If the browser exposes to the TTS engine the current time that the
video is playing at, and the end time of the associated TextTrackCue
by which time the description text has to be finished reading, then
the TTS engine has all the information that it needs to work out if
the text can be read during the given time frame.

If the available time is 0, it would immediately pause the video and
un-pause only when the description text is finished reading or when
the user indicated that he/she wanted to skip over this cue text.

If the available time is larger than 0, then the TTS engine can start
a count-down timer to measure how long its reading-out time takes and
pause the video when this timer reaches 0 - then un-pause the video
again when it has finished reading the description text.

Since this may be rather inaccurate because of the asynchronous nature
of TTS and video playback, we're probably better off if the TTS engine
registers an onexit event handler on the TextTrackCue with the browser
and pauses the video when that onexit event was reached by the browser
 - then un-pauses the video again when it has finished reading the
description text.

A third alternative is the use of the "pauseOnExit" attribute on the
TextTrackCue. A TTS could set the "pauseOnExit" for every
description-bearing TextTrackCue to "true" and when it has finished
reading out text, it only needs to set the "pauseOnExit" to false (in
case the video hadn't reached the end of the cue yet) and call the
"play()" function on the video to resume playback (in case the video
was indeed paused).

Note that this is all in addition to the usual interactions that a
user has with a TTS engine: the user can control the skipping of
description text, the user can pause the video (and therefore pause
the reading of description text) and start reading other content on
the page, the user can come back to a previously paused video element
and continue playing from where they were (including the associated
description text), and the user can reset a video to play from the
beginning (including the associated description text).

The analysis above provides three different ways in which the TTS
engine can interact with the video player to provide extended text
descriptions. Have I overseen any problems in this approach? The key
issue that I am trying to figure out is whether or not we already have
the required events and attributes available in current HTML5
specification to satisfy this use case. The above analysis indicates
that we have everything, but I may have overlooked something.

Best Regards,

On Sun, Jun 5, 2011 at 10:29 AM, Janina Sajka <janina@rednote.net> wrote:
> Hi, Silvia:
> I would suggest relying on the screen reader is asking for unnecessary
> complications. They're not designed for interacting with anything that
> moves through time.
> I think there's a simpler way. Please bear with me for a moment and set
> the screen reader aside. The problem is to get the texted description
> voiced during such time as is available inbetween the spoken dialog
> which is in the media resource. Expressed this way, there's actually no
> functional difference between extended and "ordinary" texted
> descriptions. In other words, by definition we know that extended
> descriptions will require pausing the audio in order to allow time for
> remaining descriptive text to be voiced. However, if the TTS rate is set
> slow enough, this could well also be the functional result of trying to
> squeeze a descriptive string inbetween segments of recorded audio.
> Fortunately, I would propose both can be handled the same way.
> What we need is a module that can do two things:
> 1,(     Track time--including how much time is available inbetween
> segments of spoken dialog.
> 2.)     Intelligently pipe text to a TTS engine. The "intelligence"
> relates to calculating how much time will be required to voice the text
> that precedes the onset of the next spoken dialog.
> Thus, if the time required is longer than that available in
>        the media resource, pause the primary resource long enough for
>        voicing to complete.
> No screen reader does anything remotely like this. Certainly, some might
> want to add the capability, but the capability could just as readily
> come from an app that does only this task.
> Note that it would be inappropriate to vary rate of TTS speech in order
> to try and "squeeze" text into an available segment inbetween spoken
> dialog.
> Note also that the screen reader can be expected to have access to all
> text on screen at any point.If the user is satisfied to rely on the
> screen reader alone, pausing the media and using it to read what's on
> screen is always an option. At times, I would expect users would pause
> the app I've described in order to check the spelling of some term, for
> instance. This is fully in keeping with what a screen reader can do
> today. It's common to set a screen reader NOT to auto read text, yet
> it's still able to voice as the user interactively "reads" acress the
> screen word by word, or char by char.
> Thus, the only remaining behavior to consider, is whether current
> position is reset by user initiated screen reading activity. May I
> suggest that typical screen reader functionality is again available to
> help us answer that. It's the user's choice. In some cases the user will
> want to resume from where playback was stopped, regardless of what the
> user may have "read" with the screen reader. In other cases, the user
> may choose to indicate "start from here," which is an option most modern
> screen readers support. As we should expect to navigate the media
> structure using the screen reader, this would be in keeping with
> expected functionality, so the plugin app I proposed above needs to be
> capable of receiving a new start point (earlier or later in the
> timeline).
> Janina
Received on Monday, 6 June 2011 02:16:54 UTC

This archive was generated by hypermail 2.4.0 : Friday, 20 January 2023 19:59:02 UTC