W3C home > Mailing lists > Public > public-html-a11y@w3.org > June 2011

Re: [media] how to support extended text descriptions

From: Janina Sajka <janina@rednote.net>
Date: Wed, 8 Jun 2011 11:01:31 -0400
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: Sean Hayes <Sean.Hayes@microsoft.com>, "public-html-a11y@w3.org" <public-html-a11y@w3.org>
Message-ID: <20110608150131.GO6041@sonata.rednote.net>
Silvia Pfeiffer writes:
> On Wed, Jun 8, 2011 at 9:30 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> > "Therefore, you cannot rely on the video making progress at the same time as the TTS engine ".
> > Possibly not, however in the absence of trick play (which I think would have to cancel any descriptions), one can probably assume the video won't go *faster* than expected. Therefore if you set an internal handler for the assumed end time, then even if the video hasn't reached that point yet because it stalled, no real harm is done issuing a pause.
> The TTS engine might go slower than expected (because it, too, may be
> starved of CPU) and therefore the effect of the video going slower
> than expected would still happen.
> > " I do not know how to inform the browser or a JS when the screen reader has finished reading text in a cross-browser compatible way. "
> > Do we need to is my point?
> I think we do, since the TTS engine and the video player are two
> processes that run asynchronously and therefore synchronisation is
> necessary.
> > " Descriptions delivered as audio do not come in the TextTrack. They come in the multitrack API. "
> > That's arguing we shouldn't change the design because the design is wrong. To the end user they are both descriptions and serve the same purpose; the user doesn't care what markup tag caused them to come into existence.
> You're assuming that the current design is wrong. Let's analyse that
> before making such an assumption.
> When we deal with text descriptions, they have to be voiced somehow.
> This requires a TTS somewhere in the pipeline.
> When we deal with audio descriptions, they come directly from the
> video element and are thus a native part of the browser and not handed
> through to a TTS.
> I find it hard to see that it is possible to expose these two
> fundamentally different types of content to the user in the same way.
> In particular: audio descriptions will go in sync with the video and
> there is no need to pause the video to display them, while text
> descriptions create the need for extensions of the timeline and the
> pausing behaviour.
> I think they are inherently different and trying to fool the user into
> thinking that they are identical will just lead to problems.
I think the only inherent difference that users will care about is
whether or not the audio recording is of a real human reading the
description or of a TTS engine voicing it. That's a kind we should be
sure to capture.

As has been noted, the TTS generated audio might be created realtime, or

If generated nonrealtime, it might be server generated and delivered as
recorded audio, in which instance we should not conflate it by equating
it with human naration in our tagging, even though it can effectively
"play" the same way.

We need also to add to our use cases the situation where the user has
opted to play timescale modified. Realtime TTS generation now needs to
compute the actual available time, which is different from what's
indicated for default playback rate. Please consider that this use case
is not an edge case, as it will be frequently used by students in

> > "So you want them displayed as well as the captions? Always or only when they are also read out?

As I've tried to suggest previously, we shouldn't assume the luxury of
limiting how audio plus video display of alternative content might need
to be combined. A couple of quick use cases:

*	Low vision people FREQUENTLY want to SEE content as well as HEAR
*	it. Comprehension is enhanced this way. This kind of feature is
*	common to magnification software like ZoomText, and users will
*	expect the same from media playback.

*	Learning disabled users also benefit from seeing plus hearing.
*	More elaborate support software for this population might
*	highlight words as they're being spoken--a bit of a challenge,
*	certainly.

As to displaying both captions and descriptions, I suspect the users who
will most want this aren't going to care about timelines or the actual
video, because they can't see the video--or hear the audio.Thus,
interleaved caption + description could well make sense.

Lastly, let me reiterate we need more feedback from the wider WAI
community on these use cases. It's good we're discussing this, but it's
a bit premature to try and specify API requirements at this point, imho.


>What screen real estate are you expecting to use? Can you provide an example as a use case?"
> > They would be presented as both captions and descriptions, so they are displayed when the user selects them in the caption menu and for their allotted duration. I'm expecting the author to determine the screen real estate exactly as they do for other captions. I demoed an example at the f2f if you recall. I'll check tomorrow whether it's still online.
> Does selecting them in the captions menu automatically mean they have
> to be shown on the screen? We have to be careful about the
> consequences: we are just introducing two new state making it 4 states
> that a audio description track can be in: off, on and voiced, on and
> visible, on and visible and voiced. A single entry in a menu will now
> not suffice any longer to select an audio description track. This
> single change creates heaps of new complexity.
> If an author really wants to display the text descriptions as text,
> right now they would use some javascript to do so. Is that not
> sufficient? Should we not wait and see how large the need for such a
> feature is rather than jumping to conclusions on a feature that
> doesn't exist anywhere else yet?
> > "Screen readers provide the interface to the Braille devices."
> > Screen readers are certainly the primary providers of text to a Braille device, but it's basically an output port; other processes, like the media subsystem, could potentially use it too. I don't think it's a given that we'd assume descriptions (which as you say aren't generally on the screen, and aren't in the DOM), should actually be read by a screen reader.
> They are in the shadow dom and there is a JavaScript API for them.
> They exist more in the page than other external content such as e.g.
> picture, audio or video data.
> > I am still not 100% on board with the idea that text track descriptions should be relying on the presence of a screen reader, since a SR is going to be doing a lot of other things related to navigation on the page. I'm not sure SR designers have even considered this use case.
> Probably not yet. I am starting discussions on the IA2 mailing list to
> see what people are thinking about it, since it would be there where
> the most impact would be felt.
> The issue is that SR and video playback have to interact
> constructively. You can't just have them as completely separate
> modules. The screen reader has control over an audio description track
> of the video element - why should it not have control over a text
> description track, too? Also, right now screen readers are the only
> TTS engine we get for Web pages, so if we don't make use of them for
> text descriptions, we can't do anything with text descriptions. What
> alternative do we have?
> Cheers,
> Silvia.


Janina Sajka,	Phone:	+1.443.300.2200

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)
Received on Wednesday, 8 June 2011 15:02:02 UTC

This archive was generated by hypermail 2.4.0 : Friday, 20 January 2023 19:59:02 UTC