RE: [media] how to support extended text descriptions

Well it seems to me that checking the paused flag for false and currentTime and calling pause is no more or less onerous than checking the paused flag for true and calling play; but that could depend on the specific implementations I guess.

Since @pauseOnExit will always need to be set for descriptions to work properly with TTS, I think it might be better if it is assumed for the description track kind, (or maybe implied by a non-zero @volume on a text track), and the mechanism were left up to the implementation and not exposed to the author. 

I would also note that if you want to simultaneously display the text of a description visually while it is read out, for cognitive use; then pauseOnExit is precisely the wrong time to halt the video, since the cue will have disappeared from screen when the pause happens. This means that the description track will have to have different timing when used visually and when used aurally.

We could just write something like:
User agents must interact with assistive technology in such a way as to allow sufficient time for cues marked as descriptions to be read out by AT before the marked end time of the cue occurs in the playing video (this may involve pausing and restarting the video for long descriptions). 

-----Original Message-----
From: Silvia Pfeiffer [] 
Sent: 06 June 2011 10:39
To: Sean Hayes
Subject: Re: [media] how to support extended text descriptions

On Mon, Jun 6, 2011 at 7:31 PM, Sean Hayes <> wrote:
> If the TTS engine can know - presumably through the AT APIs - the video is paused, and call play() to restart the video, why can it not also know the video is still playing as it runs out of time and call pause(), obviating the need for pauseOnExit?

That's the first proposal with registering an event handler on the
"onexit" event. Either one should work (either pauseOnExit or the
onexit event). The use of pauseOnExit is probably cheaper and faster
than registering an event handler, since it doesn't require video and
TTS to synchronize. But I'd be curious to hear from the video
developers on this.

> Or are you intending that the JS call the play(), in which case there would need to be an event (e.g. onTextReadingFinished or some such) on cues.

No, the screen reader has to do that. This should all work without any
JS in the middle.


> -----Original Message-----
> From: [] On Behalf Of Silvia Pfeiffer
> Sent: 06 June 2011 09:26
> To:
> Subject: Re: [media] how to support extended text descriptions
> I've chatted with one of the Mozilla a11y developers and it seems he
> favors the third option with the "pauseOnExit" flag. He also pointed
> out that this requires additions to the accessibility APIs, in
> particular to IA2 and others. Do people know more about this and what
> other APIs would need to be adapted, too?
> Silvia.
> On Mon, Jun 6, 2011 at 5:55 PM, Silvia Pfeiffer
> <> wrote:
>> Incidentally, I have another question and something that we may need
>> to at least mention in the HTML spec:
>> right now I am assuming that captions should not be exposed to screen
>> readers. I think it would be really annoying if everything that is
>> being said in the video is being repeated by the screen reader over
>> the top of the video's audio. Does this also match the general opinion
>> here?
>> If so, what about subtitles - in particular subtitles that are
>> provided in a different language? Should a screen reader expose those?
>> Maybe the screen reader should provide the possibility to have one
>> audio track turned on or alternatively a subtitle track, plus one text
>> description track? Thoughts?
>> Silvia.
>> On Mon, Jun 6, 2011 at 12:16 PM, Silvia Pfeiffer
>> <> wrote:
>>> Hi Janina,
>>> Actually, your approach and mine are identical for all I can tell. I'm
>>> just focused at one particular technical issue, while you have
>>> described the big picture. I apologize for not having expressed it
>>> well: I simply assumed we were all on the same page already (which, in
>>> fact, we are, but weren't able to communicate). Your email and mine
>>> looked at different aspects of the "extended text description"
>>> problem, which is good.
>>> We both agree that there is no difference between an "ordinary" text
>>> description and a "extended" text description - any text description
>>> has to be regarded as potentially having a need for time extension
>>> because of the speed settings in the screen reader.
>>> We also both agree that extended text descriptions require the video
>>> to be paused to allow time for the remaining descriptive text to be
>>> voiced.
>>> As for your description of the two things that a "module" would need
>>> to be able to do, I think that functionality already exists. In
>>> detail:
>>>> 1.)     Track time--including how much time is available in between segments of spoken dialog.
>>> Text descriptions are specified through the TextTrack mechanism as
>>> currently specified in HTML5. A TextTrack consists of a sequence of
>>> TextTrackCues which contain a startTime and a endTime attribute each.
>>> The author of a TextTrack will create a text description in such a way
>>> that they specify TextTrackCues with startTime and endTime that fit in
>>> between segments of spoken dialog. If there is no pause time within a
>>> spoken dialog, it's even possible to specify the startTime and the
>>> endTime to be the same time.
>>> Note that this functionality is already available as currently
>>> specified in HTML5.
>>>> 2.)     Intelligently pipe text to a TTS engine. The "intelligence" relates to calculating how much time will be required to voice the text that precedes the onset of the next spoken dialog.
>>> To achieve this, description text of each TextTrackCues will need to
>>> be handed over to the TTS engine at the time that the TextTrackCue
>>> becomes active. In current implementations, this has been achieved in
>>> JavaScript by using an aria-live attribute on <div> elements that are
>>> created as a TextTrackCue becomes active. A TTS engine that supports
>>> aria-live becomes aware of that text immediately and voices it
>>> immediately.
>>> Now to the problem that I wanted to solve and that you call
>>> "inelligent piping". The time that is required to voice the text of a
>>> TextTrackCue is unknown to the author of the TextTrackCue and to the
>>> browser, because the TTS engine controls the speech rate and therefore
>>> the length of time that it takes to voice the text. Thefore, neither
>>> the author nor the browser will now for how long to pause the video to
>>> allow the TTS engine to finish speaking the text.
>>> The TextTrackCue author can certainly make a good guess as to how much
>>> time is available in a dialog pause and try to accommodate the length
>>> of the descriptive text to that pause, but he can never be 100%
>>> certain that it's right, because the TTS of every user may be
>>> different. The browser is in a similar position, because it doesn't
>>> know how long the TTS engine is speaking.
>>> The only module that knows how long it will take to voice the
>>> description text is the TTS engine itself.
>>> Now, this is where we have to find a solution. Your suggestion of
>>> introducing an application that would look after the synchronisation
>>> between the TTS engine and the browser is one way to do it, but I
>>> would prefer if we can solve the extended text description problem
>>> with existing browsers and existing screen readers and just some
>>> minimal extra functionality.
>>> Here's how I can see it working:
>>> If the browser exposes to the TTS engine the current time that the
>>> video is playing at, and the end time of the associated TextTrackCue
>>> by which time the description text has to be finished reading, then
>>> the TTS engine has all the information that it needs to work out if
>>> the text can be read during the given time frame.
>>> If the available time is 0, it would immediately pause the video and
>>> un-pause only when the description text is finished reading or when
>>> the user indicated that he/she wanted to skip over this cue text.
>>> If the available time is larger than 0, then the TTS engine can start
>>> a count-down timer to measure how long its reading-out time takes and
>>> pause the video when this timer reaches 0 - then un-pause the video
>>> again when it has finished reading the description text.
>>> Since this may be rather inaccurate because of the asynchronous nature
>>> of TTS and video playback, we're probably better off if the TTS engine
>>> registers an onexit event handler on the TextTrackCue with the browser
>>> and pauses the video when that onexit event was reached by the browser
>>>  - then un-pauses the video again when it has finished reading the
>>> description text.
>>> A third alternative is the use of the "pauseOnExit" attribute on the
>>> TextTrackCue. A TTS could set the "pauseOnExit" for every
>>> description-bearing TextTrackCue to "true" and when it has finished
>>> reading out text, it only needs to set the "pauseOnExit" to false (in
>>> case the video hadn't reached the end of the cue yet) and call the
>>> "play()" function on the video to resume playback (in case the video
>>> was indeed paused).
>>> Note that this is all in addition to the usual interactions that a
>>> user has with a TTS engine: the user can control the skipping of
>>> description text, the user can pause the video (and therefore pause
>>> the reading of description text) and start reading other content on
>>> the page, the user can come back to a previously paused video element
>>> and continue playing from where they were (including the associated
>>> description text), and the user can reset a video to play from the
>>> beginning (including the associated description text).
>>> The analysis above provides three different ways in which the TTS
>>> engine can interact with the video player to provide extended text
>>> descriptions. Have I overseen any problems in this approach? The key
>>> issue that I am trying to figure out is whether or not we already have
>>> the required events and attributes available in current HTML5
>>> specification to satisfy this use case. The above analysis indicates
>>> that we have everything, but I may have overlooked something.
>>> Best Regards,
>>> Silvia.
>>> On Sun, Jun 5, 2011 at 10:29 AM, Janina Sajka <> wrote:
>>>> Hi, Silvia:
>>>> I would suggest relying on the screen reader is asking for unnecessary
>>>> complications. They're not designed for interacting with anything that
>>>> moves through time.
>>>> I think there's a simpler way. Please bear with me for a moment and set
>>>> the screen reader aside. The problem is to get the texted description
>>>> voiced during such time as is available inbetween the spoken dialog
>>>> which is in the media resource. Expressed this way, there's actually no
>>>> functional difference between extended and "ordinary" texted
>>>> descriptions. In other words, by definition we know that extended
>>>> descriptions will require pausing the audio in order to allow time for
>>>> remaining descriptive text to be voiced. However, if the TTS rate is set
>>>> slow enough, this could well also be the functional result of trying to
>>>> squeeze a descriptive string inbetween segments of recorded audio.
>>>> Fortunately, I would propose both can be handled the same way.
>>>> What we need is a module that can do two things:
>>>> 1,(     Track time--including how much time is available inbetween
>>>> segments of spoken dialog.
>>>> 2.)     Intelligently pipe text to a TTS engine. The "intelligence"
>>>> relates to calculating how much time will be required to voice the text
>>>> that precedes the onset of the next spoken dialog.
>>>> Thus, if the time required is longer than that available in
>>>>        the media resource, pause the primary resource long enough for
>>>>        voicing to complete.
>>>> No screen reader does anything remotely like this. Certainly, some might
>>>> want to add the capability, but the capability could just as readily
>>>> come from an app that does only this task.
>>>> Note that it would be inappropriate to vary rate of TTS speech in order
>>>> to try and "squeeze" text into an available segment inbetween spoken
>>>> dialog.
>>>> Note also that the screen reader can be expected to have access to all
>>>> text on screen at any point.If the user is satisfied to rely on the
>>>> screen reader alone, pausing the media and using it to read what's on
>>>> screen is always an option. At times, I would expect users would pause
>>>> the app I've described in order to check the spelling of some term, for
>>>> instance. This is fully in keeping with what a screen reader can do
>>>> today. It's common to set a screen reader NOT to auto read text, yet
>>>> it's still able to voice as the user interactively "reads" acress the
>>>> screen word by word, or char by char.
>>>> Thus, the only remaining behavior to consider, is whether current
>>>> position is reset by user initiated screen reading activity. May I
>>>> suggest that typical screen reader functionality is again available to
>>>> help us answer that. It's the user's choice. In some cases the user will
>>>> want to resume from where playback was stopped, regardless of what the
>>>> user may have "read" with the screen reader. In other cases, the user
>>>> may choose to indicate "start from here," which is an option most modern
>>>> screen readers support. As we should expect to navigate the media
>>>> structure using the screen reader, this would be in keeping with
>>>> expected functionality, so the plugin app I proposed above needs to be
>>>> capable of receiving a new start point (earlier or later in the
>>>> timeline).
>>>> Janina

Received on Monday, 6 June 2011 11:09:10 UTC