RE: [media] how to support extended text descriptions from Sean Hayes on 2011-06-07 (public-html-a11y@w3.org from June 2011)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Tue, 7 Jun 2011 23:30:37 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: "public-html-a11y@w3.org" <public-html-a11y@w3.org>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B91586B512D@DB3EX14MBXC313.europe.corp.microsoft.c>
"Therefore, you cannot rely on the video making progress at the same time as the TTS engine ".
Possibly not, however in the absence of trick play (which I think would have to cancel any descriptions), one can probably assume the video won't go *faster* than expected. Therefore if you set an internal handler for the assumed end time, then even if the video hasn't reached that point yet because it stalled, no real harm is done issuing a pause.

" I do not know how to inform the browser or a JS when the screen reader has finished reading text in a cross-browser compatible way. "
Do we need to is my point?

" Descriptions delivered as audio do not come in the TextTrack. They come in the multitrack API. "
That's arguing we shouldn't change the design because the design is wrong. To the end user they are both descriptions and serve the same purpose; the user doesn't care what markup tag caused them to come into existence. 

"So you want them displayed as well as the captions? Always or only when they are also read out? What screen real estate are you expecting to use? Can you provide an example as a use case?"
They would be presented as both captions and descriptions, so they are displayed when the user selects them in the caption menu and for their allotted duration. I'm expecting the author to determine the screen real estate exactly as they do for other captions. I demoed an example at the f2f if you recall. I'll check tomorrow whether it's still online.

"Screen readers provide the interface to the Braille devices."
Screen readers are certainly the primary providers of text to a Braille device, but it's basically an output port; other processes, like the media subsystem, could potentially use it too. I don't think it's a given that we'd assume descriptions (which as you say aren't generally on the screen, and aren't in the DOM), should actually be read by a screen reader.
I am still not 100% on board with the idea that text track descriptions should be relying on the presence of a screen reader, since a SR is going to be doing a lot of other things related to navigation on the page. I'm not sure SR designers have even considered this use case.

-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com] 
Sent: 07 June 2011 01:16
To: Sean Hayes
Cc: public-html-a11y@w3.org
Subject: Re: [media] how to support extended text descriptions

On Tue, Jun 7, 2011 at 12:28 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> In SAPI you can a) set an allotted time for the utterance, and b) get callbacks within the TTS engine during and at the end of the utterance and c) measure elapsed time and set callback timers. I assume other speech API's are similar.
> So all you need to know from the browser is what the latest time is you can still be working on it. which you can derive from the cue times and currentTime, and have methods to pause and start.


That would work if you could rely on the browser progressing playback
in realtime. However, it's video and therefore there could be a
network stall or the CPU could be busy and stall the decoding of the
video. Therefore, you cannot rely on the video making progress at the
same time as the TTS engine.



> It would be much better if we can think of a solution that works with A11y APIs as they are than hypothesizing new ones, they tend to change very slowly. Adding ability to call JavaScript to the A11y API is in my mind a non-starter, but I'll leave that to the people more directly involved in that work to argue that point.

I do not know how to inform the browser or a JS when the screen reader
has finished reading text in a cross-browser compatible way. I think
we can only do best effort right now by careful authoring with average
word rates. A reliably solution is only possible IIUC with changes of
the a11y API. And since Janina says that there is work happening on
a11y APIs right now, we should get involved now rather than avoid the
issue and be satisfied with sub-optimal solutions.


> Descriptions delivered as audio have a volume. Sending them as text is really only a difference of encoding. Maybe the timed text should actually be referenced from an <audio> element here, but I'm not expecting elegance, just equivalence.

Descriptions delivered as audio do not come in the TextTrack. They
come in the multitrack API. I'm not talking about that here. I am only
referring to text descriptions that are expected to be voiced by a TTS
engine. They are like any other text on the page and while we can give
it special attributes in the screen reader, I'd still expect screen
readers to voice them.


> "That is possible. You can have those tracks. But what would be the reason to have the text descriptions displayed on screen, when they are clearly designed to be read out?"
>
> Because comprehension can be vastly better if you have the redundancy of both reading and listening at the same time.

So you want them displayed as well as the captions? Always or only
when they are also read out? What screen real estate are you expecting
to use? Can you provide an example as a use case?


> Screen readers are not termed Braille devices, or vice versa; although they are often used together. So that's why I referred to it as AT, rather than specifically TTS.

Screen readers provide the interface to the Braille devices.
Therefore, we don't need to explicitly talk about Braille devices,
since they are given their text from the screen reader. In my
understanding AT is a much broader term incorporating more than just
TTS and Braille.


Cheers,
Silvia.



> -----Original Message-----
> From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
> Sent: 06 June 2011 14:39
> To: Sean Hayes
> Cc: public-html-a11y@w3.org
> Subject: Re: [media] how to support extended text descriptions
>
> On Mon, Jun 6, 2011 at 11:22 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>> I'm imagining it being fairly difficult to have the AT register an event handler on the media subsystem using the AT API; AFAIK none of the API's today support that, but I'm open to being mistaken. The idea is that if the AT is exposing the right data and methods (something I'm confident they can do), the TTS engine can work with the browser more passively.
>
> How is that supposed to work? The AT will not know how long it will
> take to render a piece of text until it is finished rendering (also
> because there may be user interactions). So, I don't think it is
> possible for the AT to tell the UA how long a cue will take.
>
> Therefore, it is necessary to add to the a11y API a more active
> interface that can e.g. execute some JavaScript. The Mozilla a11y API
> developers seemed to think that it was possible to do that.
>
>
>> Text tracks don't have a volume today no, however clearly they will need something like that if they are producing audio; perhaps we can rely on the system audio mixer, but I'd guess authors would want to be able to control it. In a decent design, descriptions delivered as text and descriptions delivered as audio would have closer, if not identical, behavior and API.
>
> Text tracks don't have a volume because they are text. I suspect they
> will not have a volume in the future either. However, screen readers
> have a volume and that is already controllable by the user. I don't
> think that script needs a means to control that.
>
>
>> "Text descriptions as defined right now don't have a visual representation. Is there a use case for doing so?"
>> -- yes. Cognitive e.g. Autism spectrum.
>> You should be able to point a caption track and a description track at the same timed text file and it should work.
>
> That is possible. You can have those tracks. But what would be the
> reason to have the text descriptions displayed on screen, when they
> are clearly designed to be read out?
>
>
>> Your text is a little more specific about what "interact with AT" implies, but other than that I think they would achieve the same thing. I don't have any preferences for wording, although I was trying to stay fairly generic. I would note that descriptions for example may also go to a non TTS engine (e.g. Braille), but still require pausing; in such a case it might require the device detecting the user had finished reading the Braille.
>
> Yes, that's what screen readers already do IIUC.
>
> Cheers,
> Silvia.
>
>
>
>> -----Original Message-----
>> From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
>> Sent: 06 June 2011 13:42
>> To: Sean Hayes
>> Cc: public-html-a11y@w3.org
>> Subject: Re: [media] how to support extended text descriptions
>>
>> Hi Sean,
>>
>> On Mon, Jun 6, 2011 at 9:08 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>>> Well it seems to me that checking the paused flag for false and currentTime and calling pause is no more or less onerous than checking the paused flag for true and calling play; but that could depend on the specific implementations I guess.
>>
>> I don't understand this.
>>
>> In the case with the "onExit" event, we won't look at currentTime and
>> we won't look at the "pauseOnExit" flag. The screen reader would just
>> register an onExit event on the current cue, which calls pause() on
>> the video. The screen reader then, when it's finished reading out the
>> text, calls play() on the video again and removes the event handler.
>>
>> In the case of using the "pauseOnExit" flag of the cue, the screen
>> reader would first set the "pauseOnExit" flag of the cue, then start
>> reading out the text. When it's finished reading, it will set the
>> "pauseOnExit" to false and call play() again.
>>
>> In neither of these cases do we use currentTime or the paused flag.
>>
>>
>>> Since @pauseOnExit will always need to be set for descriptions to work properly with TTS, I think it might be better if it is assumed for the description track kind, (or maybe implied by a non-zero @volume on a text track), and the mechanism were left up to the implementation and not exposed to the author.
>>
>> TextTracks don't have @volume, but we can identify those TextTracks
>> that are active through mode=SHOWING.
>>
>> I agree that we don't need to expose anything to the author. I tried
>> to establish that we don't have a need for this. I am, however, trying
>> to describe what a screen reader would need to do to get text
>> descriptions to work in an automated fashion without the user having
>> to constantly pause the video and play it again.
>>
>> It is an idea to have @pauseOnExit be true by default, but I think it
>> makes sense only when we have a screen reader attached. We don't want
>> to create a situation where the video keeps getting paused just
>> because we have a text description track activated.
>>
>>
>>> I would also note that if you want to simultaneously display the text of a description visually while it is read out, for cognitive use; then pauseOnExit is precisely the wrong time to halt the video, since the cue will have disappeared from screen when the pause happens. This means that the description track will have to have different timing when used visually and when used aurally.
>>
>> Text descriptions as defined right now don't have a visual
>> representation. Is there a use case for doing so?
>>
>>
>>> We could just write something like:
>>> User agents must interact with assistive technology in such a way as to allow sufficient time for cues marked as descriptions to be read out by AT before the marked end time of the cue occurs in the playing video (this may involve pausing and restarting the video for long descriptions).
>>
>> I don't think there is anything that a Web browser can do to provide
>> for sufficient time, since it doesn't know how far the screen reader
>> is. Thus, I don't think this text would make any difference.
>>
>> I think what we could write something like the following:
>> Text tracks of kind "descriptions" are intended for TTS engines, e.g.
>> through a screen reader for blind users. The screen reader may
>> determine that the video needs to be paused for a cue to be rendered
>> in the available cue duration. The user agent needs to accept having
>> the video playback be controlled by the screen reader.
>>
>> Cheers,
>> Silvia.
>>
>>
>>> -----Original Message-----
>>> From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
>>> Sent: 06 June 2011 10:39
>>> To: Sean Hayes
>>> Cc: public-html-a11y@w3.org
>>> Subject: Re: [media] how to support extended text descriptions
>>>
>>> On Mon, Jun 6, 2011 at 7:31 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>>>> If the TTS engine can know - presumably through the AT APIs - the video is paused, and call play() to restart the video, why can it not also know the video is still playing as it runs out of time and call pause(), obviating the need for pauseOnExit?
>>>
>>> That's the first proposal with registering an event handler on the
>>> "onexit" event. Either one should work (either pauseOnExit or the
>>> onexit event). The use of pauseOnExit is probably cheaper and faster
>>> than registering an event handler, since it doesn't require video and
>>> TTS to synchronize. But I'd be curious to hear from the video
>>> developers on this.
>>>
>>>> Or are you intending that the JS call the play(), in which case there would need to be an event (e.g. onTextReadingFinished or some such) on cues.
>>>
>>> No, the screen reader has to do that. This should all work without any
>>> JS in the middle.
>>>
>>> Cheers,
>>> Silvia.
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: public-html-a11y-request@w3.org [mailto:public-html-a11y-request@w3.org] On Behalf Of Silvia Pfeiffer
>>>> Sent: 06 June 2011 09:26
>>>> To: public-html-a11y@w3.org
>>>> Subject: Re: [media] how to support extended text descriptions
>>>>
>>>> I've chatted with one of the Mozilla a11y developers and it seems he
>>>> favors the third option with the "pauseOnExit" flag. He also pointed
>>>> out that this requires additions to the accessibility APIs, in
>>>> particular to IA2 and others. Do people know more about this and what
>>>> other APIs would need to be adapted, too?
>>>>
>>>> Silvia.
>>>>
>>>> On Mon, Jun 6, 2011 at 5:55 PM, Silvia Pfeiffer
>>>> <silviapfeiffer1@gmail.com> wrote:
>>>>> Incidentally, I have another question and something that we may need
>>>>> to at least mention in the HTML spec:
>>>>> right now I am assuming that captions should not be exposed to screen
>>>>> readers. I think it would be really annoying if everything that is
>>>>> being said in the video is being repeated by the screen reader over
>>>>> the top of the video's audio. Does this also match the general opinion
>>>>> here?
>>>>>
>>>>> If so, what about subtitles - in particular subtitles that are
>>>>> provided in a different language? Should a screen reader expose those?
>>>>> Maybe the screen reader should provide the possibility to have one
>>>>> audio track turned on or alternatively a subtitle track, plus one text
>>>>> description track? Thoughts?
>>>>>
>>>>> Silvia.
>>>>>
>>>>> On Mon, Jun 6, 2011 at 12:16 PM, Silvia Pfeiffer
>>>>> <silviapfeiffer1@gmail.com> wrote:
>>>>>> Hi Janina,
>>>>>>
>>>>>> Actually, your approach and mine are identical for all I can tell. I'm
>>>>>> just focused at one particular technical issue, while you have
>>>>>> described the big picture. I apologize for not having expressed it
>>>>>> well: I simply assumed we were all on the same page already (which, in
>>>>>> fact, we are, but weren't able to communicate). Your email and mine
>>>>>> looked at different aspects of the "extended text description"
>>>>>> problem, which is good.
>>>>>>
>>>>>> We both agree that there is no difference between an "ordinary" text
>>>>>> description and a "extended" text description - any text description
>>>>>> has to be regarded as potentially having a need for time extension
>>>>>> because of the speed settings in the screen reader.
>>>>>>
>>>>>> We also both agree that extended text descriptions require the video
>>>>>> to be paused to allow time for the remaining descriptive text to be
>>>>>> voiced.
>>>>>>
>>>>>> As for your description of the two things that a "module" would need
>>>>>> to be able to do, I think that functionality already exists. In
>>>>>> detail:
>>>>>>
>>>>>>> 1.)     Track time--including how much time is available in between segments of spoken dialog.
>>>>>>
>>>>>> Text descriptions are specified through the TextTrack mechanism as
>>>>>> currently specified in HTML5. A TextTrack consists of a sequence of
>>>>>> TextTrackCues which contain a startTime and a endTime attribute each.
>>>>>> The author of a TextTrack will create a text description in such a way
>>>>>> that they specify TextTrackCues with startTime and endTime that fit in
>>>>>> between segments of spoken dialog. If there is no pause time within a
>>>>>> spoken dialog, it's even possible to specify the startTime and the
>>>>>> endTime to be the same time.
>>>>>>
>>>>>> Note that this functionality is already available as currently
>>>>>> specified in HTML5.
>>>>>>
>>>>>>
>>>>>>> 2.)     Intelligently pipe text to a TTS engine. The "intelligence" relates to calculating how much time will be required to voice the text that precedes the onset of the next spoken dialog.
>>>>>>
>>>>>> To achieve this, description text of each TextTrackCues will need to
>>>>>> be handed over to the TTS engine at the time that the TextTrackCue
>>>>>> becomes active. In current implementations, this has been achieved in
>>>>>> JavaScript by using an aria-live attribute on <div> elements that are
>>>>>> created as a TextTrackCue becomes active. A TTS engine that supports
>>>>>> aria-live becomes aware of that text immediately and voices it
>>>>>> immediately.
>>>>>>
>>>>>> Now to the problem that I wanted to solve and that you call
>>>>>> "inelligent piping". The time that is required to voice the text of a
>>>>>> TextTrackCue is unknown to the author of the TextTrackCue and to the
>>>>>> browser, because the TTS engine controls the speech rate and therefore
>>>>>> the length of time that it takes to voice the text. Thefore, neither
>>>>>> the author nor the browser will now for how long to pause the video to
>>>>>> allow the TTS engine to finish speaking the text.
>>>>>>
>>>>>> The TextTrackCue author can certainly make a good guess as to how much
>>>>>> time is available in a dialog pause and try to accommodate the length
>>>>>> of the descriptive text to that pause, but he can never be 100%
>>>>>> certain that it's right, because the TTS of every user may be
>>>>>> different. The browser is in a similar position, because it doesn't
>>>>>> know how long the TTS engine is speaking.
>>>>>>
>>>>>> The only module that knows how long it will take to voice the
>>>>>> description text is the TTS engine itself.
>>>>>>
>>>>>> Now, this is where we have to find a solution. Your suggestion of
>>>>>> introducing an application that would look after the synchronisation
>>>>>> between the TTS engine and the browser is one way to do it, but I
>>>>>> would prefer if we can solve the extended text description problem
>>>>>> with existing browsers and existing screen readers and just some
>>>>>> minimal extra functionality.
>>>>>>
>>>>>> Here's how I can see it working:
>>>>>>
>>>>>> If the browser exposes to the TTS engine the current time that the
>>>>>> video is playing at, and the end time of the associated TextTrackCue
>>>>>> by which time the description text has to be finished reading, then
>>>>>> the TTS engine has all the information that it needs to work out if
>>>>>> the text can be read during the given time frame.
>>>>>>
>>>>>> If the available time is 0, it would immediately pause the video and
>>>>>> un-pause only when the description text is finished reading or when
>>>>>> the user indicated that he/she wanted to skip over this cue text.
>>>>>>
>>>>>> If the available time is larger than 0, then the TTS engine can start
>>>>>> a count-down timer to measure how long its reading-out time takes and
>>>>>> pause the video when this timer reaches 0 - then un-pause the video
>>>>>> again when it has finished reading the description text.
>>>>>>
>>>>>> Since this may be rather inaccurate because of the asynchronous nature
>>>>>> of TTS and video playback, we're probably better off if the TTS engine
>>>>>> registers an onexit event handler on the TextTrackCue with the browser
>>>>>> and pauses the video when that onexit event was reached by the browser
>>>>>>  - then un-pauses the video again when it has finished reading the
>>>>>> description text.
>>>>>>
>>>>>> A third alternative is the use of the "pauseOnExit" attribute on the
>>>>>> TextTrackCue. A TTS could set the "pauseOnExit" for every
>>>>>> description-bearing TextTrackCue to "true" and when it has finished
>>>>>> reading out text, it only needs to set the "pauseOnExit" to false (in
>>>>>> case the video hadn't reached the end of the cue yet) and call the
>>>>>> "play()" function on the video to resume playback (in case the video
>>>>>> was indeed paused).
>>>>>>
>>>>>> Note that this is all in addition to the usual interactions that a
>>>>>> user has with a TTS engine: the user can control the skipping of
>>>>>> description text, the user can pause the video (and therefore pause
>>>>>> the reading of description text) and start reading other content on
>>>>>> the page, the user can come back to a previously paused video element
>>>>>> and continue playing from where they were (including the associated
>>>>>> description text), and the user can reset a video to play from the
>>>>>> beginning (including the associated description text).
>>>>>>
>>>>>> The analysis above provides three different ways in which the TTS
>>>>>> engine can interact with the video player to provide extended text
>>>>>> descriptions. Have I overseen any problems in this approach? The key
>>>>>> issue that I am trying to figure out is whether or not we already have
>>>>>> the required events and attributes available in current HTML5
>>>>>> specification to satisfy this use case. The above analysis indicates
>>>>>> that we have everything, but I may have overlooked something.
>>>>>>
>>>>>> Best Regards,
>>>>>> Silvia.
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 5, 2011 at 10:29 AM, Janina Sajka <janina@rednote.net> wrote:
>>>>>>> Hi, Silvia:
>>>>>>>
>>>>>>>
>>>>>>> I would suggest relying on the screen reader is asking for unnecessary
>>>>>>> complications. They're not designed for interacting with anything that
>>>>>>> moves through time.
>>>>>>>
>>>>>>> I think there's a simpler way. Please bear with me for a moment and set
>>>>>>> the screen reader aside. The problem is to get the texted description
>>>>>>> voiced during such time as is available inbetween the spoken dialog
>>>>>>> which is in the media resource. Expressed this way, there's actually no
>>>>>>> functional difference between extended and "ordinary" texted
>>>>>>> descriptions. In other words, by definition we know that extended
>>>>>>> descriptions will require pausing the audio in order to allow time for
>>>>>>> remaining descriptive text to be voiced. However, if the TTS rate is set
>>>>>>> slow enough, this could well also be the functional result of trying to
>>>>>>> squeeze a descriptive string inbetween segments of recorded audio.
>>>>>>> Fortunately, I would propose both can be handled the same way.
>>>>>>>
>>>>>>> What we need is a module that can do two things:
>>>>>>>
>>>>>>> 1,(     Track time--including how much time is available inbetween
>>>>>>> segments of spoken dialog.
>>>>>>>
>>>>>>>
>>>>>>> 2.)     Intelligently pipe text to a TTS engine. The "intelligence"
>>>>>>> relates to calculating how much time will be required to voice the text
>>>>>>> that precedes the onset of the next spoken dialog.
>>>>>>>
>>>>>>> Thus, if the time required is longer than that available in
>>>>>>>        the media resource, pause the primary resource long enough for
>>>>>>>        voicing to complete.
>>>>>>>
>>>>>>> No screen reader does anything remotely like this. Certainly, some might
>>>>>>> want to add the capability, but the capability could just as readily
>>>>>>> come from an app that does only this task.
>>>>>>>
>>>>>>> Note that it would be inappropriate to vary rate of TTS speech in order
>>>>>>> to try and "squeeze" text into an available segment inbetween spoken
>>>>>>> dialog.
>>>>>>>
>>>>>>> Note also that the screen reader can be expected to have access to all
>>>>>>> text on screen at any point.If the user is satisfied to rely on the
>>>>>>> screen reader alone, pausing the media and using it to read what's on
>>>>>>> screen is always an option. At times, I would expect users would pause
>>>>>>> the app I've described in order to check the spelling of some term, for
>>>>>>> instance. This is fully in keeping with what a screen reader can do
>>>>>>> today. It's common to set a screen reader NOT to auto read text, yet
>>>>>>> it's still able to voice as the user interactively "reads" acress the
>>>>>>> screen word by word, or char by char.
>>>>>>>
>>>>>>> Thus, the only remaining behavior to consider, is whether current
>>>>>>> position is reset by user initiated screen reading activity. May I
>>>>>>> suggest that typical screen reader functionality is again available to
>>>>>>> help us answer that. It's the user's choice. In some cases the user will
>>>>>>> want to resume from where playback was stopped, regardless of what the
>>>>>>> user may have "read" with the screen reader. In other cases, the user
>>>>>>> may choose to indicate "start from here," which is an option most modern
>>>>>>> screen readers support. As we should expect to navigate the media
>>>>>>> structure using the screen reader, this would be in keeping with
>>>>>>> expected functionality, so the plugin app I proposed above needs to be
>>>>>>> capable of receiving a new start point (earlier or later in the
>>>>>>> timeline).
>>>>>>>
>>>>>>> Janina
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
Received on Tuesday, 7 June 2011 23:31:20 UTC