- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Wed, 5 May 2010 09:46:56 +1000
- To: HTML Accessibility Task Force <public-html-a11y@w3.org>
Not sure if this came through to the list, but here is a copy. Silvia. ---------- Forwarded message ---------- From: Masatomo Kobayashi <MSTM@jp.ibm.com> Date: 2010/5/5 Subject: Re: Format Requirements for Text Audio Descriptions (was Re: HTML5 TF from my team) To: Silvia Pfeiffer <silviapfeiffer1@gmail.com> Cc: Hironobu Takagi <TAKAGIH@jp.ibm.com>, John Foliot <jfoliot@stanford.edu>, public-html-a11y@w3.org Hi Silvia and John, I am happy to join this discussion and sorry for the late response. I have just read wikis you mentioned and recent discussions in the a11y TF ML. I emailed Mike but I think I am not in the TF members list yet. So please forward this reply to the ML, if necessary. At this time, the additional requirements based on our research will include: a) behavior when overflowing b) extended audio descriptions c) support of SSML If you have already discussed these topics, I would appreciate if you could send me any related links. a) Behavior when overflowing The current proposals seem not to explicitly mention the case in which the screen reader does not finish reading out a description sentence by the 'end time'. This is likely to be caused by at least three reasons: - A typical author of textual audio descriptions does not have a screen reader. This means s/he cannot check if the sentence is fit within the time frame. Even if s/he has a screen reader, a different screen reader may take longer to read out the same sentence; - Some screen reader users (e.g., elderly and people with learning disabilities) may slow down the speech rate; or - A visually-complicated scene (e.g., figures on a blackboard in an online physics class) may not be sufficiently described within any time interval in the original audio track. So, the specification should support to specify the behavior for this case. The options will include: - none -- continue to read out the sentence even after the end time. This may overlap important information in the video. - clip -- force to stop reading out the sentence at the end time. This may cause the user to miss important information in the sentence. - extend -- pause the video at the end time until the screen reader finishes reading out the sentence. This may require an additional mechanism beyond "aria-live: assertive", but at least our prototype aiBrowser can do it. This option might be able to be specified as an attribute: <track src="..." type="text/srt" role="textaudesc" overflow="extend"></track> Or using CSS like the 'overflow' property for a visual element: track[role= textaudesc] {audio-overflow: extend} For now only 'textaudesc' tracks need this mechanism, but in the future other types of tracks, such as synthesized sign language, would need to be covered. b) Extended audio descriptions As mentioned above, a visually-complicated scene may not be fully described within a silent space in the original audio track. For that case, guidelines for audio descriptions recommend using extended descriptions. WCAG 2.0 (level AAA) also include it. In our experiments, the use of extended descriptions made two important advantages. First, it was nearly impossible to sufficiently describe a kind of instructional video without extended descriptions. Second, it allowed a novice describer to effectively describe at least a short video because it did not require special skills to make an appropriate description within a very limited time frame. I think these advantages strongly encourage us to include extended descriptions into the proposal. If we have the 'overflow=extend' attribute introduced above, we will be able to make an fully-extended description simply by specifying the same time for both 'begin' and 'end' times. c) Support of SSML The SubRip srt format can produce minimal audio descriptions, and even minimal descriptions can greatly help people who are blind or have low vision. But just like captions sometimes need TTML instead of srt, textual audio descriptions may need SSML (and EmotionML) to produce richer descriptions. This includes the voice gender, speech rate, volume, prosody, emotions, pre-recorded audio files, etc. In our experiments, it was indicated that the speech quality for audio descriptions is more critical than that for usual screen reading because of two reasons. First, users often want to be relaxed when watching a video. Second, they are forced to frequently switch their attention between synthesized descriptions and human conversations in the original audio track. As common TTS engines such as Cepstral already support SSML, we can consider it in the proposal. I think what a Web browser needs to do is simply pass SSML to the TTS engine. If you are interested in our prototype software that supports textual audio descriptions, see: ScriptEditor, http://www.eclipse.org/actf/downloads/tools/ScriptEditor/ aiBrowser, http://www.eclipse.org/actf/downloads/tools/aiBrowser/ The XML format we are using includes a boolean field to set extended or not, speech rate, voice gender, and an optional pointer to a pre-recorded alternative, for each sentence, in addition to time and text. I think that "extended" descriptions should be supported at least in some way as described above. The latter three additional fields will be covered by SSML. I am not sure they should be explicitly supported if SSML is not included in the specification. Please let me know if you have any questions and comments on my thoughts. Regards, Masatomo Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote on 2010/04/27 07:41:48: > From: > > Silvia Pfeiffer <silviapfeiffer1@gmail.com> > > To: > > Hironobu Takagi/Japan/IBM@IBMJP, Masatomo Kobayashi/Japan/IBM@IBMJP, > John Foliot <jfoliot@stanford.edu> > > Cc: > > public-html-a11y@w3.org > > Date: > > 2010/04/27 07:42 > > Subject: > > Format Requirements for Text Audio Descriptions (was Re: HTML5 TF > from my team) > > On Tue, Apr 27, 2010 at 4:52 AM, John Foliot <jfoliot@stanford.edu> wrote: > > Hironobu Takagi wrote: > >> > >> Finally, one of my team members will officially > >> join the HTML5 Accessibility Task Force. > >> Masatomo has been working for the audio description > >> project, especially for the experiments in Japan > >> and in US with WGBH. > >> He can be the bridge between the open source authoring > >> tool work and aiBrowser on Eclipse.org. > >> He is now checking resources on the Web. > >> If you have any suggestion for our involvement (beyond > >> the mailing-list), please let us know. > >> We are looking forward to working with you. > > > > > > Hiro, this is great news! Hello and welcome to Masatomo. (For those > > unaware or do not remember, Hiro presented IBM Research – Tokyo's work on > > descriptive audio using synthesized voice at the Face-to-Face here at > > Stanford last November, as well he and his team were at the CSUN > > conference in March in San Diego. It is - IMHO - wicked cool! A Word Doc > > can be found here: > > http://www.letsgoexpo.com/utilities/File/viewfile.cfm?LCID=4091&eID=800002 > > 18 and perhaps Hiro you could point us to web-based [HTML] resources too?) > > > > > > > > Masatomo, you might want to start by reviewing the draft specifications > > that are currently under discussion: > > > > http://www.w3.org/WAI/PF/HTML/wiki/Media_MultitrackAPI > > and > > http://www.w3.org/WAI/PF/HTML/wiki/Media_TextAssociations. > > > > > > Silvia Pfeiffer recently wrote a blog post that is more easily readable as > > an introduction, but as she notes it is not as technically accurate. It > > can be found at > > http://blog.gingertech.net/2010/04/11/introducing-media-accessibilit-into- > > html5-media/. > > > > There has also been a fair bit of discussion recently about choosing one > > or more appropriate time-stamp formats to be referenced in the HTML5 > > Specification/Standard - this discussion is very much up in the air at > > this time. > > > > > > As well, while not officially 'W3C', the WHATWG has started collecting > > examples of time-aligned text displays (captions, subtitles, chapter > > markers etc) and is extrapolating requirements from these in their wiki > > at: > > > > http://wiki.whatwg.org/wiki/Timed_tracks > > http://wiki.whatwg.org/wiki/Use_cases_for_timed_tracks_rendered_over_video > > _by_the_UA > > > > http://wiki.whatwg.org/wiki/Use_cases_for_API-level_access_to_timed_tracks > > > > > > (I believe screen captures, etc. of your work with descriptive text would > > be relevant here!) > > > > > Let me chime in here, since it is right now particularly relevant to > you with textual audio descriptions. > > You will find at http://wiki.whatwg.org/wiki/Timed_tracks several > mentions of "text audio descriptions". > > The assumption of that page is that textual audio descriptions do not > require more than the following information in a format: > * start time > * end time > * text > * possibly the voice to chose to read back > > Are there any other requirements that you have come across in your > work with textual audio descriptions? What do the files that you are > using as input to your speech synthesis system for audio descriptions > look like? Do they have any special fields that would need to be taken > care of in a standardised storage format for textual audio > descriptions? > > Cheers, > Silvia.
Received on Tuesday, 4 May 2010 23:47:51 UTC