W3C home > Mailing lists > Public > public-html-a11y@w3.org > June 2011

Re: [media] how to support extended text descriptions

From: Janina Sajka <janina@rednote.net>
Date: Mon, 6 Jun 2011 09:33:02 -0400
To: public-html-a11y@w3.org
Message-ID: <20110606133302.GI6041@sonata.rednote.net>
Hi, Silvia:

Silvia Pfeiffer writes:
> Hi Janina,
> Actually, your approach and mine are identical for all I can tell. I'm
> just focused at one particular technical issue, while you have
> described the big picture. I apologize for not having expressed it
> well: I simply assumed we were all on the same page already (which, in
> fact, we are, but weren't able to communicate). Your email and mine
> looked at different aspects of the "extended text description"
> problem, which is good.
I agree we're pretty close.  Thanks for clarifying that we're not really
talking about a generalized application, by the bye. 

I do think, though, that we have a couple misunderstandings about how
voicing extended descriptions needs to work: 

1.)	Screen readers --- While reading out textual descriptions could
be loosely understood as "screen reading," I wonder if we introduce
confusion by using such terms for this particular feature. Managing a
time-sensitive text stream has not been a function provided by that
class of assistive technologies known as screen readers in their 30 year
history. To date the leading edge of anything similar is ARIA Live
Regions, which I suspect isn't a solution here because we'd be
overloading what Live Regions is designed to do. I mention it only to
set it aside explicitly.

	I think we want the option to voice texted descriptions using a
	voice different than that the screen reader is using, very
	possible output through a different device than the screen
	reader uses, or at least at a panning position distinct from the
	screen reader's. Independent volume control is also going to be
	important, but not so much as it's unlikely the screen reader
	will be voicing its ordinary functionality while the video
	resource is playing. I expect the user will continue to rely on
	the browser and screen reader to start, stop, pause, fast
	forward, rewind, navigate structurally, etc. However, while the
	video is actually playing, only the descriptions should be
	voiced--and these are a new kind of functionality.

		Now this is all my view of things, and we don't want to
		design based soley on my view. So, I will ask WAI-UAWG
		to for a wider discussion on behavior.

		2.)	The TTS engines are all over the map on how
		they're addressed. So, I hold out no hope that the
		browser would talk directly to TTS, unless we were to
		lock in a particular TTS, and I don't think that will be
		acceptable to anyone. Fortunately, there are APIs for
		this, but I've no good sense of how similar the various
		APIs are, or are not.

		3.)	All the above to come to what authors should,
		and should not do. My suggestion is to not overwork
		this. Clearly, 250 words for a 10 sec. segment between
		dialog is inappropriate. But I don't think there's much
		profit in going further than a general rule of X words
		per minute per Y secs. of available time. Users will
		adjust rate of words per minute anyway to a pace
		comfortable for them. This is an individual thing. Some
		will shift to the slow side, most will set a faster rate
		than most people would think workable. So, to go beyond
		a reasonable rule about the number of words per explicit
		segment size is likely only to frustrate users who
		expect these things to be under their control.

So, I wonder whether there's a need here for anything in the specs?
Specifically, why not just allow the voicing app, whether independent as
I propose, or newly added to existing screen readers, why not simply
allow it to rely on the pause/resume functionality already provided?

> Now to the problem that I wanted to solve and that you call
> "inelligent piping". The time that is required to voice the text of a
> TextTrackCue is unknown to the author of the TextTrackCue and to the
> browser, because the TTS engine controls the speech rate and therefore
> the length of time that it takes to voice the text. Thefore, neither
> the author nor the browser will now for how long to pause the video to
> allow the TTS engine to finish speaking the text.

As I've suggested above, I think it's wrong to ask authors to worry
about this overmuch. I suggest control of pausing needs to rest with
whatever app is voicing the descriptive text.

> The TextTrackCue author can certainly make a good guess as to how much
> time is available in a dialog pause and try to accommodate the length
> of the descriptive text to that pause, but he can never be 100%
> certain that it's right, because the TTS of every user may be
> different. The browser is in a similar position, because it doesn't
> know how long the TTS engine is speaking.
But the app managing the voicing knows how long the available segment
is, and it knows the words per minute rate it has instructed the TTS to
use. In addition, newer TTS engines provide callbacks, so it's quite
possible to manage this quite smoothly.

> The only module that knows how long it will take to voice the
> description text is the TTS engine itself.
This is wrong. TTS rates are exposed and easily adjusted by other apps.
A screen reader control panel includes such things, and any app can do

For an idea of what's possible, take a look at:


This is an API for the IBM version of one of the most popular voices out
there--despite its glaring flaws.

> Now, this is where we have to find a solution. Your suggestion of
> introducing an application that would look after the synchronisation
> between the TTS engine and the browser is one way to do it, but I
> would prefer if we can solve the extended text description problem
> with existing browsers and existing screen readers and just some
> minimal extra functionality.
Again, my point is that there's nothing the current crop of screen
readers has to offer here. This is a different kind of job, not one
they're designed to do.

> Here's how I can see it working:
> If the browser exposes to the TTS engine the current time that the
> video is playing at, and the end time of the associated TextTrackCue
> by which time the description text has to be finished reading, then
> the TTS engine has all the information that it needs to work out if
> the text can be read during the given time frame.
But, TTS doesn't typically do this kind of calculation. It's just a TTS,
nothing more.


> If the available time is 0, it would immediately pause the video and
> un-pause only when the description text is finished reading or when
> the user indicated that he/she wanted to skip over this cue text.
> If the available time is larger than 0, then the TTS engine can start
> a count-down timer to measure how long its reading-out time takes and
> pause the video when this timer reaches 0 - then un-pause the video
> again when it has finished reading the description text.
> Since this may be rather inaccurate because of the asynchronous nature
> of TTS and video playback, we're probably better off if the TTS engine
> registers an onexit event handler on the TextTrackCue with the browser
> and pauses the video when that onexit event was reached by the browser
>  - then un-pauses the video again when it has finished reading the
> description text.
> A third alternative is the use of the "pauseOnExit" attribute on the
> TextTrackCue. A TTS could set the "pauseOnExit" for every
> description-bearing TextTrackCue to "true" and when it has finished
> reading out text, it only needs to set the "pauseOnExit" to false (in
> case the video hadn't reached the end of the cue yet) and call the
> "play()" function on the video to resume playback (in case the video
> was indeed paused).
> Note that this is all in addition to the usual interactions that a
> user has with a TTS engine: the user can control the skipping of
> description text, the user can pause the video (and therefore pause
> the reading of description text) and start reading other content on
> the page, the user can come back to a previously paused video element
> and continue playing from where they were (including the associated
> description text), and the user can reset a video to play from the
> beginning (including the associated description text).
> The analysis above provides three different ways in which the TTS
> engine can interact with the video player to provide extended text
> descriptions. Have I overseen any problems in this approach? The key
> issue that I am trying to figure out is whether or not we already have
> the required events and attributes available in current HTML5
> specification to satisfy this use case. The above analysis indicates
> that we have everything, but I may have overlooked something.
> Best Regards,
> Silvia.
> On Sun, Jun 5, 2011 at 10:29 AM, Janina Sajka <janina@rednote.net> wrote:
> > Hi, Silvia:
> >
> >
> > I would suggest relying on the screen reader is asking for unnecessary
> > complications. They're not designed for interacting with anything that
> > moves through time.
> >
> > I think there's a simpler way. Please bear with me for a moment and set
> > the screen reader aside. The problem is to get the texted description
> > voiced during such time as is available inbetween the spoken dialog
> > which is in the media resource. Expressed this way, there's actually no
> > functional difference between extended and "ordinary" texted
> > descriptions. In other words, by definition we know that extended
> > descriptions will require pausing the audio in order to allow time for
> > remaining descriptive text to be voiced. However, if the TTS rate is set
> > slow enough, this could well also be the functional result of trying to
> > squeeze a descriptive string inbetween segments of recorded audio.
> > Fortunately, I would propose both can be handled the same way.
> >
> > What we need is a module that can do two things:
> >
> > 1,(     Track time--including how much time is available inbetween
> > segments of spoken dialog.
> >
> >
> > 2.)     Intelligently pipe text to a TTS engine. The "intelligence"
> > relates to calculating how much time will be required to voice the text
> > that precedes the onset of the next spoken dialog.
> >
> > Thus, if the time required is longer than that available in
> >        the media resource, pause the primary resource long enough for
> >        voicing to complete.
> >
> > No screen reader does anything remotely like this. Certainly, some might
> > want to add the capability, but the capability could just as readily
> > come from an app that does only this task.
> >
> > Note that it would be inappropriate to vary rate of TTS speech in order
> > to try and "squeeze" text into an available segment inbetween spoken
> > dialog.
> >
> > Note also that the screen reader can be expected to have access to all
> > text on screen at any point.If the user is satisfied to rely on the
> > screen reader alone, pausing the media and using it to read what's on
> > screen is always an option. At times, I would expect users would pause
> > the app I've described in order to check the spelling of some term, for
> > instance. This is fully in keeping with what a screen reader can do
> > today. It's common to set a screen reader NOT to auto read text, yet
> > it's still able to voice as the user interactively "reads" acress the
> > screen word by word, or char by char.
> >
> > Thus, the only remaining behavior to consider, is whether current
> > position is reset by user initiated screen reading activity. May I
> > suggest that typical screen reader functionality is again available to
> > help us answer that. It's the user's choice. In some cases the user will
> > want to resume from where playback was stopped, regardless of what the
> > user may have "read" with the screen reader. In other cases, the user
> > may choose to indicate "start from here," which is an option most modern
> > screen readers support. As we should expect to navigate the media
> > structure using the screen reader, this would be in keeping with
> > expected functionality, so the plugin app I proposed above needs to be
> > capable of receiving a new start point (earlier or later in the
> > timeline).
> >
> > Janina
> >


Janina Sajka,	Phone:	+1.443.300.2200

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)
Received on Monday, 6 June 2011 13:33:27 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 19:55:57 UTC