Re: [media] how to support extended text descriptions from Silvia Pfeiffer on 2011-06-06 (public-html-a11y@w3.org from June 2011)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 7 Jun 2011 00:04:34 +1000
To: public-html-a11y@w3.org
Message-ID: <BANLkTi=DvkRx_iAPHC3HasMkYTSzr3oWKQ@mail.gmail.com>
On Mon, Jun 6, 2011 at 11:33 PM, Janina Sajka <janina@rednote.net> wrote:
> Hi, Silvia:
>
> Silvia Pfeiffer writes:
>> Hi Janina,
>>
>> Actually, your approach and mine are identical for all I can tell. I'm
>> just focused at one particular technical issue, while you have
>> described the big picture. I apologize for not having expressed it
>> well: I simply assumed we were all on the same page already (which, in
>> fact, we are, but weren't able to communicate). Your email and mine
>> looked at different aspects of the "extended text description"
>> problem, which is good.
>>
> I agree we're pretty close.  Thanks for clarifying that we're not really
> talking about a generalized application, by the bye.
>
> I do think, though, that we have a couple misunderstandings about how
> voicing extended descriptions needs to work:
>
>
> 1.)     Screen readers --- While reading out textual descriptions could
> be loosely understood as "screen reading," I wonder if we introduce
> confusion by using such terms for this particular feature. Managing a
> time-sensitive text stream has not been a function provided by that
> class of assistive technologies known as screen readers in their 30 year
> history. To date the leading edge of anything similar is ARIA Live
> Regions, which I suspect isn't a solution here because we'd be
> overloading what Live Regions is designed to do. I mention it only to
> set it aside explicitly.
>
>        I think we want the option to voice texted descriptions using a
>        voice different than that the screen reader is using, very
>        possible output through a different device than the screen
>        reader uses, or at least at a panning position distinct from the
>        screen reader's. Independent volume control is also going to be
>        important, but not so much as it's unlikely the screen reader
>        will be voicing its ordinary functionality while the video
>        resource is playing. I expect the user will continue to rely on
>        the browser and screen reader to start, stop, pause, fast
>        forward, rewind, navigate structurally, etc. However, while the
>        video is actually playing, only the descriptions should be
>        voiced--and these are a new kind of functionality.


OK. I think this needs to go into the design of new accessibility
APIs, such as the new IA2 version etc. None of this needs to go into
the HTML5 spec, IIUC.


>                Now this is all my view of things, and we don't want to
>                design based soley on my view. So, I will ask WAI-UAWG
>                to for a wider discussion on behavior.
>
>                2.)     The TTS engines are all over the map on how
>                they're addressed. So, I hold out no hope that the
>                browser would talk directly to TTS, unless we were to
>                lock in a particular TTS, and I don't think that will be
>                acceptable to anyone. Fortunately, there are APIs for
>                this, but I've no good sense of how similar the various
>                APIs are, or are not.

The TTS interact with the browser through a11y APIs such as IA2. These
APIs are part of browsers, but not part of HTML5, IIUC. That's how I
understood things to work.


>                3.)     All the above to come to what authors should,
>                and should not do. My suggestion is to not overwork
>                this. Clearly, 250 words for a 10 sec. segment between
>                dialog is inappropriate. But I don't think there's much
>                profit in going further than a general rule of X words
>                per minute per Y secs. of available time. Users will
>                adjust rate of words per minute anyway to a pace
>                comfortable for them. This is an individual thing. Some
>                will shift to the slow side, most will set a faster rate
>                than most people would think workable. So, to go beyond
>                a reasonable rule about the number of words per explicit
>                segment size is likely only to frustrate users who
>                expect these things to be under their control.

Yup, agreed. I wouldn't even want to try to prescribe a rule,
certainly not within HTML5. This should be a best practice as provided
by a11y experts.


> So, I wonder whether there's a need here for anything in the specs?

Yup, agreed. I don't really think anything has to go into HTML5. It's
all in the a11y API and between TTS and that API.

> Specifically, why not just allow the voicing app, whether independent as
> I propose, or newly added to existing screen readers, why not simply
> allow it to rely on the pause/resume functionality already provided?

That's exactly what I was proposing. I was, however, trying to
identify which of the pause/resume functionality would be most
appropriate. It's in particular not possible to just use pause() and
wait for the duration of the text description and assume that then
the video has indeed reached the end of that duration, because the
video may have stalled in the middle or may have decoded slower
because it ran out of CPU. Thus, we cannot rely on the browser playing
back the video at an expected speed, but rather we have to interact
with the browser events and states for reaching the end time of the
text cue.


>> Now to the problem that I wanted to solve and that you call
>> "inelligent piping". The time that is required to voice the text of a
>> TextTrackCue is unknown to the author of the TextTrackCue and to the
>> browser, because the TTS engine controls the speech rate and therefore
>> the length of time that it takes to voice the text. Thefore, neither
>> the author nor the browser will now for how long to pause the video to
>> allow the TTS engine to finish speaking the text.
>>
>
> As I've suggested above, I think it's wrong to ask authors to worry
> about this overmuch. I suggest control of pausing needs to rest with
> whatever app is voicing the descriptive text.

We agree.


>> The TextTrackCue author can certainly make a good guess as to how much
>> time is available in a dialog pause and try to accommodate the length
>> of the descriptive text to that pause, but he can never be 100%
>> certain that it's right, because the TTS of every user may be
>> different. The browser is in a similar position, because it doesn't
>> know how long the TTS engine is speaking.
>>
> But the app managing the voicing knows how long the available segment
> is, and it knows the words per minute rate it has instructed the TTS to
> use. In addition, newer TTS engines provide callbacks, so it's quite
> possible to manage this quite smoothly.

As long as "the app managing the voicing" also reacts to events and/or
states of the browser's video player, we're set.


>> The only module that knows how long it will take to voice the
>> description text is the TTS engine itself.
>>
> This is wrong. TTS rates are exposed and easily adjusted by other apps.
> A screen reader control panel includes such things, and any app can do
> this.

The browser doesn't know the TTS rate and therefore cannot react to
it. Also, you wouldn't want the browser to change the TTS rate
automatically just to make sure a piece of text falls between the
available cue duration - in particular if that cue duration could
potentially be 0. The user should be in control of the TTS rate. And
because the user controls the TTS rate, the amount of words that can
be spoken in the given gap cannot be predicted. Therefore the only
component that  knows how long a certain piece of text is taking (or
rather: that know if there will be more to read out) is the TTS engine
or the module that controls the TTS engine.


> For an idea of what's possible, take a look at:
>
> http://sourceforge.net/projects/ibmtts-sdk/files/
>
>
> This is an API for the IBM version of one of the most popular voices out
> there--despite its glaring flaws.
>
>> Now, this is where we have to find a solution. Your suggestion of
>> introducing an application that would look after the synchronisation
>> between the TTS engine and the browser is one way to do it, but I
>> would prefer if we can solve the extended text description problem
>> with existing browsers and existing screen readers and just some
>> minimal extra functionality.
>>
> Again, my point is that there's nothing the current crop of screen
> readers has to offer here. This is a different kind of job, not one
> they're designed to do.

Yes, that is true. But in order to make this work, we have to change them.


>> Here's how I can see it working:
>>
>> If the browser exposes to the TTS engine the current time that the
>> video is playing at, and the end time of the associated TextTrackCue
>> by which time the description text has to be finished reading, then
>> the TTS engine has all the information that it needs to work out if
>> the text can be read during the given time frame.
>>
> But, TTS doesn't typically do this kind of calculation. It's just a TTS,
> nothing more.

It doesn't need to calculate anything. It reacts to states that it
knows about itself and that it is given about the video player.

Cheers,
Silvia.
Received on Monday, 6 June 2011 14:05:22 UTC