RE: [media] how to support extended text descriptions from Sean Hayes on 2011-06-08 (public-html-a11y@w3.org from June 2011)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Wed, 8 Jun 2011 11:25:20 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: "public-html-a11y@w3.org" <public-html-a11y@w3.org>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B91586B557D@DB3EX14MBXC313.europe.corp.microsoft.c>
I guess it's possible the TTS engine may get starved too; the browser could hang, any one of a thousand things could and will go wrong, but I don't think we are designing a Mars lander here, the web breaks - a lot - we should take reasonable precautions yes, but we don't have to go mad here.  

You are assuming that the TTS engine needs to be called on the fly during media playback in a just in time manner; however a different approach could be to analyse the timed text prior to playing anything, generate a complete track list pointing to local audio files (you only need to generate the file locations in advance and enough of the files to cover TTS starvation, producing the later files can happen as a background process), you can also pre-determine where the pauses need to be in the main media and hand it all off to the media player as a playlist and then the sync will all happen inside the media player.

OTOH I think what this discussion is showing us is that having voicing go on outside the browser is problematic. There is an active WG for voice on the web; perhaps we ought to confer with them on this issue, as well as the ARIA folks. If we are going to solve this for descriptions, then it would make sense to do it as part of a general solution for voice on the web.

For showing descriptions as captions, I think you are over analyzing it. All you need to do is additionally reference the markup file containing descriptions as a caption file and just have it displayed normally. We don't need to add any special modes for description tracks, except to ensure that if the media pauses for a description it does so during the active interval and not after it (this might also solve the zero duration interval issue that Masatomo indicated).

I'm not trying to fool the user, I'm hoping we can design descriptions and extended descriptions so that expected things will work. Having volume for things that the user perceives as audio seems natural and expected to me, having only audio for pre-recorded seems like a problem.  
Similarly a design that allows extended audio descriptions to also use pre-recorded audio if required would be more balanced.

Somewhere along the line we have picked up this notion that text descriptions are going to be handed off to a screen reader, but I think this idea is pretty flawed. Producing speech from text has quite a bit more to it than that, as the speech group will tell us, and yes it would maybe require an as yet undefined interaction between the TTS and the media player, but maybe not as I stated above. Until we know how to do it, and have evidence that APIs can accommodate it (and in a general way, for example having events on aria live regions, not as a special case for descriptions) and that the AT community is stepping up to it, I think it unwise to rely on it. 

I suggest either we need to design speech production into the browser properly, e.g. provide a JS API onto the system speech API with the appropriate events (which seems preferable to me, but unlikely given the last call process),  or we remain silent on the issue until we get some implementation feedback on how it should all work. We can put an outcome based clause in the spec and leave it up to the browser folks to figure it out.

Not defining the bridge to TTS in this version does not IMO leave us with nothing in the interim. In my experiments embedding references to audio files in the timed text (whether pre-recorded as I did, or generated on a server using TTS as Masatomo did) worked reasonably well for extended descriptions, since the necessary JS API's to play referenced audio are already in place. If in the future there is an API to hand the text off for client side speech production (or the system can do this in the background), then the same approach could work, but possibly using references to SSML rather than audio.




-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com] 
Sent: 08 June 2011 07:46
To: Sean Hayes
Cc: public-html-a11y@w3.org
Subject: Re: [media] how to support extended text descriptions

On Wed, Jun 8, 2011 at 9:30 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> "Therefore, you cannot rely on the video making progress at the same time as the TTS engine ".
> Possibly not, however in the absence of trick play (which I think would have to cancel any descriptions), one can probably assume the video won't go *faster* than expected. Therefore if you set an internal handler for the assumed end time, then even if the video hasn't reached that point yet because it stalled, no real harm is done issuing a pause.


The TTS engine might go slower than expected (because it, too, may be
starved of CPU) and therefore the effect of the video going slower
than expected would still happen.


> " I do not know how to inform the browser or a JS when the screen reader has finished reading text in a cross-browser compatible way. "
> Do we need to is my point?


I think we do, since the TTS engine and the video player are two
processes that run asynchronously and therefore synchronisation is
necessary.


> " Descriptions delivered as audio do not come in the TextTrack. They come in the multitrack API. "
> That's arguing we shouldn't change the design because the design is wrong. To the end user they are both descriptions and serve the same purpose; the user doesn't care what markup tag caused them to come into existence.

You're assuming that the current design is wrong. Let's analyse that
before making such an assumption.

When we deal with text descriptions, they have to be voiced somehow.
This requires a TTS somewhere in the pipeline.
When we deal with audio descriptions, they come directly from the
video element and are thus a native part of the browser and not handed
through to a TTS.
I find it hard to see that it is possible to expose these two
fundamentally different types of content to the user in the same way.

In particular: audio descriptions will go in sync with the video and
there is no need to pause the video to display them, while text
descriptions create the need for extensions of the timeline and the
pausing behaviour.

I think they are inherently different and trying to fool the user into
thinking that they are identical will just lead to problems.


> "So you want them displayed as well as the captions? Always or only when they are also read out? What screen real estate are you expecting to use? Can you provide an example as a use case?"
> They would be presented as both captions and descriptions, so they are displayed when the user selects them in the caption menu and for their allotted duration. I'm expecting the author to determine the screen real estate exactly as they do for other captions. I demoed an example at the f2f if you recall. I'll check tomorrow whether it's still online.


Does selecting them in the captions menu automatically mean they have
to be shown on the screen? We have to be careful about the
consequences: we are just introducing two new state making it 4 states
that a audio description track can be in: off, on and voiced, on and
visible, on and visible and voiced. A single entry in a menu will now
not suffice any longer to select an audio description track. This
single change creates heaps of new complexity.

If an author really wants to display the text descriptions as text,
right now they would use some javascript to do so. Is that not
sufficient? Should we not wait and see how large the need for such a
feature is rather than jumping to conclusions on a feature that
doesn't exist anywhere else yet?


> "Screen readers provide the interface to the Braille devices."
> Screen readers are certainly the primary providers of text to a Braille device, but it's basically an output port; other processes, like the media subsystem, could potentially use it too. I don't think it's a given that we'd assume descriptions (which as you say aren't generally on the screen, and aren't in the DOM), should actually be read by a screen reader.


They are in the shadow dom and there is a JavaScript API for them.
They exist more in the page than other external content such as e.g.
picture, audio or video data.


> I am still not 100% on board with the idea that text track descriptions should be relying on the presence of a screen reader, since a SR is going to be doing a lot of other things related to navigation on the page. I'm not sure SR designers have even considered this use case.

Probably not yet. I am starting discussions on the IA2 mailing list to
see what people are thinking about it, since it would be there where
the most impact would be felt.

The issue is that SR and video playback have to interact
constructively. You can't just have them as completely separate
modules. The screen reader has control over an audio description track
of the video element - why should it not have control over a text
description track, too? Also, right now screen readers are the only
TTS engine we get for Web pages, so if we don't make use of them for
text descriptions, we can't do anything with text descriptions. What
alternative do we have?

Cheers,
Silvia.
Received on Wednesday, 8 June 2011 11:26:05 UTC