WCAG 2.0 and VoiceXML 2.0

Introduction

This documents collects some notes on how the Web Content Accessibility Guidelines 2.0 relate to VoiceXML and dialog systems. The intention is to find out which success criteria are had or impossible to meet in VoiceXML, which success criteria should be modified to take dialog systems into account, and to describe techniques that VoiceXML developers can use to meet the success criteria. The techniques still need to be written; hopefully it will be possible to gather a sufficient number of techniques to create a W3C Note on “VoiceXML Techniques for WCAG 2.0”.

This document is a draft and refers to the 30 June 2005 Working Draft of WCAG 2.0.

Minimalistic Introduction to VoiceXML

Voice Extensible Markup Language Version 2.0 is a recommendation since March 2004.

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

An example of a form that asks the user for a choice of drink and then submits it to a server script:

<?xml version="1.0" encoding="UTF-8"?>

<vxml xmlns="http://www.w3.org/2001/vxml" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2001/vxml 
   http://www.w3.org/TR/voicexml20/vxml.xsd"
   version="2.0">
  <form>
  <field name="drink">
     <prompt>Would you like coffee, tea, milk, or nothing?</prompt>
     <grammar src="drink.grxml" type="application/srgs+xml"/>
  </field>

  <block>
     <submit next="http://www.drink.example.com/drink2.asp"/>
  </block>
 </form>
</vxml>

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:

C (computer): Would you like coffee, tea, milk, or nothing?

H (human): Orange juice.

C: I did not understand what you said. (a platform-specific default message.)

C: Would you like coffee, tea, milk, or nothing?

H: Tea

C: (continues in document drink2.asp)

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.

Accessibility in the VoiceXML Specification

The VoiceXML 2.0 specification contains an appendix on accessibility. The last part of this appendix contains some guidelines:

Additional guidelines for enabling persons with disabilities to access VoiceXML applications include the following:

Reuse navigation structures that are highly usable and leverage learning across multiple applications“for example, the navigational techniques for the ANSI/NISO Z39.86-2002 Digital Talking book Standard, http://www.loc.gov/nls/z3986/.

Each element in which an event can occur should specify catch elements, including one with a fail-soft or recovery functionality.

Enable users to control the length of time before timeout, the speaking rate of synthesized speech, and other such variables that provide a little extra time to respond or complete an input action, particularly when the VoiceXML interpreter Context has detected that the user is using an ancillary device instead of listening or speaking. These are especially useful for users with cognitive disabilities.

Advertise alternative modes through which equivalent service is available, including transfer to a human operator, text telephone service, etc., or the availability of the same information via the World Wide Web.

It would be useful to have code examples for these guidelines, especially the third one.

Principle 1: Content must be perceivable

Guideline 1.1 Provide text alternatives for all non-text content

Level 1 Success Criteria for Guideline 1.1

1. For all non-text content that is used to convey information, text alternatives identify the non-text content and convey the same information. For multimedia, provide a text-alternative that identifies the multimedia.

Applies to recorded prompts and any other recordings of spoken text, and to earcons. The busy tone and the network busy tone are outside the scope of VoiceXML.

Speech Synthesis Markup Language has a desc element that can be used inside audio elements:

The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of the desc element(s) should be rendered instead of other alternative content in audio. The optional xml:lang attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses of xml:lang in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.

[code example]

The desc element can only contain descriptive text.

Text alternatives can be provided for each audio prompt within the audio elements, for example:

<audio src="greetings.wav">Greetings</audio>

2. For functional non-text content, text alternatives serve the same purpose as the non-text content. If text alternatives can not serve the same purpose as the functional non-text content, text alternatives identify the purpose of the functional non-text content.

Applies to flow control and scripts (ECMAScript) that can be used in VoiceXML. However, if flow control and scripts produce a result that should be presented to the user, this is always something that is covered by other success criteria. VoiceXML has no equivalent for graphical links or buttons.

3. For non-text content that is intended to create a specific sensory experience, text alternatives at least identify the non-text content with a descriptive label.

This applies to music that does not include words (although telephones are not well suited for this because the limited frequency range).

4. Non-text content that is not functional, is not used to convey information, and does not create a specific sensory experience is implemented such that it can be ignored by assistive technology.

This may apply to telephone hold music. (Otherwise, non-text content that meets these criteria is probably not used a lot in voice applications: it would be a waste of time.) However, you cannot skip on-hold music.

5. For live audio-only or live video-only content, text alternatives at least identify the purpose of the content with a descriptive label.

Applies only to live audio-only content.

Level 3 Success Criteria for Guideline 1.1

For prerecorded multimedia content, a combined transcript of captions and audio descriptions of video is available.

Not applicable. In the context of this success criterion, multimedia means combined audio and video. Streaming video through a voice application doens't seem to make sense. Prerecorded audio can be used, but this is already covered by the success criteria at level 1.

Requiring a text alternative (GL 1.1 L1 SC 1-5) seems reasonable because the person whose voice is being used (in the recording) also needs this text, so it is definitely available.

Guideline 1.2 Provide synchronized alternatives for multimedia

Level 1 Success Criteria for Guideline 1.2

1. Captions are provided for prerecorded multimedia.

Not applicable (multimedia).

2. Audio descriptions of video are provided for prerecorded multimedia.

Not applicable (multimedia).

Level 2 Success Criteria for Guideline 1.2

1. Real-time captions are provided for live multimedia.

Not applicable (multimedia).

Level 3 Success Criteria for Guideline 1.2

1. Sign language interpretation is provided for multimedia.

Not applicable (multimedia).

2. Extended audio descriptions of video are provided for prerecorded multimedia.

Not applicable (multimedia).

5. Audio descriptions of video are provided for live multimedia.

Not applicable (multimedia).

Guideline 1.3 Ensure that information, functionality, and structure can be separated from presentation

The main reasons for existence of this guidelines are

the fact that structural HTML elements have often been abused for visual effects, and
the fact that structure and presentation can to a great extent be separated in HTML when using CSS.

VoiceXML does not describe documents where structures such as headings, blockquotes etcetera can be discerned, and there is no possibility to fake structure by means of presentational elements, or, if it were possible , the voice application would not be usable.

Presentation elements for voice applications are defined in W3C Speech Synthesis Markup Language (recommendation since September 2004) and reused in VoiceXML 2.0. Most importantly for this guideline, there is no technique to separate speech markup from dialog markup.

Level 1 Success Criteria for Guideline 1.3

1. Structures within the content can be programmatically determined.

Until I find evidence to the contrary, a VoiceXML document must meet this success criterion or the voice application will not work. However, if VoiceXML is used to present structured text (newspaper articles, any non-fiction or fiction), there are no elements to mark headings, quotes, etcetera. There is even no element for descriptive titles (see discussion of Guideline 2.4 Level 2 SC 3 below).

Note: the assumption behind the above comment is that <emphasis>, <break> and most other SSML elements (with the exception of <meta>, <metadata>, <p> and <s>) are presentational elements, not structural elements. One can use SSML to mark up the types of (structured) text I listed, but that would be like abusing presentational HTML markup to fake structure (e.g. using big, bold text instead of <hx>).
To my understanding, VoiceXML elements do not represent structure in the sense of document structure, but dialogue/interaction "structure", and without the proper use of these structures (prompts, forms, menus, nomatch, etcetera) a VoiceXML document will not work properly. This is where HTML differs from VoiceXML: if you throw out structure and rely purely on presentational elements, an HTML document will still work for a large percentage of the population.

2. When information is conveyed by color, the color can be programmatically determined or the information is also conveyed through another means that does not depend on the user's ability to differentiate colors.

Not applicable (color).

Level 2 Success Criteria for Guideline 1.3

1. Information that is conveyed by variations in presentation of text is also conveyed in text or the variations in presentation of text can be programmatically determined.

In voice applications, Speech Synthesis Markup Language can be used to control aspects of speech such as pronunciation, volume, pitch, rate etcetera.

2. Any information that is conveyed by color is visually evident when color is not available.

Not applicable (color).

Level 3 Success Criteria for Guideline 1.3

1. When content is arranged in a sequence that affects its meaning, that sequence can be determined programmatically.

Unlike HTML, VoiceXML cannot use a style language that rearranges content.

Guideline 1.4 Make it easy to distinguish foreground information from background images or sounds

Level 1 Success Criteria for Guideline 1.4

1. Any text that is presented over a background image, color, or text can be programmatically determined.

Not applicable (background image, color or text).

Level 2 Success Criteria for Guideline 1.4

1. Text and diagrams that are presented over a background image, color, or text have a contrast greater than X1 where the whiter element is at least Y1 as measured by _____.

Not applicable (background image, color or text).

2. Text that is presented over a background pattern of lines which are within 500% +/- of the stem width of the characters or their serifs must have a contrast between the characters and the lines that is greater than X2, where the whiter element is at least Y2.

Not applicable (background pattern of lines).

3. A mechanism is available to turn off background audio that plays automatically.

This is very relevant to voice applications, as some applications play background audio that makes the prompts hard to understand.

VoiceXML elements that fetch VoiceXML documents also support the fetchaudio attribute. This attribute specifies

[t]he URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The VoiceXML specification also adds that

[t]he fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing.

By default, VoiceXML applications do not play background audio while fetching a resource of any kind. Apparently, background audio can only be played when the application fetches a VoiceXML document, not when it fetches audio, grammar or script files. Also, the background audio does not loop. Queued prompts are played during the fetch and before the background audio is played. It is not clear how VoiceXML supports background audio that plays during dialogs. WCAG 2.0 does not define background audio, but if VoiceXML cannot play background audio during prompts or other blocks of synthesized speech, then its concept of background audio does not fit WCAG's (implicit) concept of background audio.

Level 3 Success Criteria for Guideline 1.4

1. Text is not presented over any background (image, text, color or pattern), or if any background is present, the contrast between the text and the background is greater than X2.

Not applicable (background image, color or text).

2. Audio content does not contain background sounds or the background sounds are at least 20 decibels lower than the foreground audio content, with the exception of occasional sound effects.

In VoiceXML, audio content can only contain background sounds if it is pre-recorded and must then meet Guideline 1.1 Level 1. However, for people who are not deaf, the difference in volume between foreground and background audio is very relevant.

Principle 2: Interface elements in the content must be operable

Guideline 2.1 Make all functionality operable via a keyboard interface

Level 1 Success Criteria for Guideline 2.1

1. All of the functionality of the content, where the functionality or its outcome can be described in a sentence, is operable through a keyboard interface.

If this forces voice applications to provide both speech grammars and DTMF grammars for every dialog, this leads to longer prompts, which may harm the usability of voice applications. On the other hand, this success criterion (combined with Guideline 1.1 an d others) is necessary to make VoiceXML applications accessible to people whose voice may be hard to recognize or who cannot speak at all.

Level 3 Success Criteria for Guideline 2.1

1. All functionality of the content is designed to be operated through a keyboard interface.

See comment on previous success criterion.

Guideline 2.2 Allow users to control time limits on their reading or interaction

@TODO: check interdigittimeouts in the appendix on timing properties in VoiceXML 2.0 specification.

Level 1 Success Criteria for Guideline 2.2

1. Content is designed so that time-outs are not an essential part of interaction, or …

Content is designed so that time-outs are not an essential part of interaction, or at least one of the following is true for each time-out that is a function of the content:

the user is allowed to deactivate the time-out or;

the user is allowed to adjust the time-out over a wide range which is at least ten times the length of the default setting or;

the user is warned before time expires, allowed to extend the time-out with a simple action (for example, "hit any key") and given at least 20 seconds to respond or;

the time-out is an important part of a real-time event (for example, an auction), and no alternative to the time-out is possible or;

the time-out is part of an activity where timing is essential (for example, competitive gaming or time-based testing) and time limits can not be extended further without invalidating the activity.

Prompts in VoiceXML dialogs can specify a timeout attribute:

The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.

The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.

When a noinput event is thrown, it can be caught with <catch event="noinput">, or with <noinput>, which is shorthand notation for <catch event="noinput">. The input element has an optional count attribute that allows the developer to omit a different prompt when noinput events are repeated. For example, on the first noinput event (<noinput count="1">), the application may simply reprompt, and on subsequent prompts (<noinput count="2">), it may emit more directed prompts. (See the example in the VoiceXML 2.1 Development Guide) This technique is called tapered prompting. The noinput element can contain a prompt element that defines a longer timeout than the first prompt or any other previous prompts. See the example below.

<?xml version="1.0" encoding="UTF-8"?>

<vxml xmlns="http://www.w3.org/2001/vxml" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2001/vxml 
   http://www.w3.org/TR/voicexml20/vxml.xsd"
   version="2.0">

  <form id="tapered">
  <field name="color">
     <prompt timeout="20s">
       Please pick a color for your Model T. You can have any color as long as its black.
     </prompt>
     <grammar src="black.grxml" type="application/srgs+xml"/>

     <noinput count="1">
       <prompt timout="40s">
         Pick a color for your new Model T.
       </prompt>
     </noinput>
     <noinput count="2">
       <prompt timout="60s">
         Please choose color of your new nineteen twenty four Ford Model T. 
         Possible colors are black, black, or black. Please take your time.   
       </prompt>
     </noinput>
  </field>

  <block>
     <submit next="http://www.ford-t.com/color.asp"/>
  </block>
 </form>
</vxml>

However, the use of this feature does not meet any of the bullet points of this success criterion:

the user is not allowed to deactivate the time-out,
the user is not allowed to adjust the time-out,
the user is not warned before time expires or allowed to extend the time-out with a simple action,
the time-out is not an important part of a real-time event, and
the time-out is not part of an activity where timing is essential.

The user is reprompted and given at least 20 seconds extra time: this comes close to the third bullet point in the success criterion. Maybe this should be added as an extra bullet point for “computer-directed interactions”.

I am not sure if it is possible to store the timeout value in a variable, then reference this variable in the timeout attribute and increment it in <noinput>, or to prompt the user for a new value.

It is not obvious how one can turn off timeout: section 6.3.4 states that the default value for timeout is platform-dependent. To my understanding, this means that some platforms (may) provide their own timeout value, while others don't provide a value, and that both categories of platforms conform to the specification. [If turning off timeout is possible, some example code would be useful.]

This success criterion does not only apply to the timeout attribute, but also to the maxtime and finalsilence attributes of the record element, and probably also to the maxspeechtimeout property (one of the generic speech recognizer properties).

Level 2 Success Criteria for Guideline 2.2

1. Content does not blink for more than 3 seconds, or a method is available to stop any blinking content in the delivery unit.

Not applicable (blinking).

2. Moving or time-based content can be paused by the user.

If a prompt does not specify a noinput timeout, the application uses a platform-specific default timeout. It is not obvious how content can be paused by the user. The mark element can be used to place a marker into the text/tag sequence; this location can then be referenced. The problem here is that mark is in SSML but regarding its purpose in VoiceXML, the VoiceXML specification says that it is “ignored by VoiceXML platforms” (section 4.1.1: Speech Markup).

Level 3 Success Criteria for Guideline 2.2

1. Except for real-time events, timing is not an essential part of the event or activity presented by the content.

Since it is not obvious how timeouts in VoiceXML applications can be disabled, it is not clear how this success criterion can be met.

2. Non-emergency interruptions, such as the availability of updated content, can be postponed or suppressed by the user.

VoiceXML applications do not use automatic updates of content that is being presented to the user, because that would cause major usability problems.

3. When an authenticated session has an inactivity timeout, the user can continue the activity without loss of data after re-authenticating.

In the context of web sites, this success criterion does not require that the user is led back to the form he/she was filling in, only that the data are not lost. This should work in VoiceXML. For example, there is a disconnect element that allows an application to perform some necessary final cleanup, such as submitting data to the server.

Guideline 2.3 Allow users to avoid content that could cause seizures due to photosensitivity

Not applicable.

Guideline 2.4 Provide mechanisms to help users find content, orient themselves within it, and navigate through it

Level 1 Success Criteria for Guideline 2.4

1. Navigational features can be programmatically identified.

In voice applications, users navigate through several layers of ‘menus’ by answering prompts. Unlike HTML, where an image or paragraph can be turned into a link by means of JavaScript, VoiceXML has no techniques to turn, say, a prompt into a link, choice (in a menu) or field (in a form). It appears that VoiceXML applications meet this criterion by default.

Level 2 Success Criteria for Guideline 2.4

1. More than one way is available to locate content within a set of delivery units.

In voice applications, there is usually only one way. If the developer can think of several ways (i.e. several ways to structure the menus) to a specific piece of content, he will probably only implemnt the most efficient one, because implementing the other ways would mean that prompts (containing lists of choice where to go next) would only become longer and therefore less usable.

Web applications where users are required to provide information in a specific sequence of forms also violate this criterion. The success criterion needs an exemption; see the editorial note:

There are some concerns about how this criterion would apply to web applications where the only way to expose certain types of content would require the user to follow a specific path.

The success criterion could be reworded as follows:

More than one way is available to locate content within a set of delivery units, except in computer-directed interactions.

This reformulation requires a definition:

computer-directed interaction: interaction in which the user fills in forms and form items exactly once in sequential order

The rewording and the defintion are inspired by VoiceXML's concept of directed forms.

2. Blocks of content that are repeated on multiple perceivable units are implemented so that they can be bypassed.

Not applicable, because perceivable units in voice applications cannot be broken down in anything smaller?

3. Delivery units have descriptive titles.

VoiceXML does not describe documents, so titles are useless to end users. A title can be provided in the meta element and/or metadata element, but the content of this element is not supposed to be played back to the user. This success criterion may need an exemption, for example as follows:

Delivery units have descriptive titles, except in interaction technologies where titles cannot be presented to the user.

4. The destination of each programmatic reference to another delivery unit is identified through words or phrases that either occur in text or can be programmatically determined.

VoiceXML applications always meet this success criterion if the items in a grammar are interpreted as “link text”, except where a programmatic reference is not matched by anything in a prompt or grammar (such a reference would be like a piece of unused code in a program).

Also, in voice applications, users do not notice when another delivery is fetched; there is no way for users to find out what the deliery units in a voice application are, with the exeption of recorded sound.

Level 3 Success Criteria for Guideline 2.4

1. When a page or other delivery unit is navigated sequentially, elements receive focus in an order that follows relationships and sequences in the content.

In VoiceXML applications, users don't really “navigate” a delivery unit. For the content of a prompt or block, this success criterion is always met. For a VoiceXML document that contains multiple dialogs, the sequence depends on the user's input. Requiring that the dialogs should be navigated in source code order makes no sense.

2. Information about the user's location within a set of delivery units is available.

No applicable: location is not a useful concept in voice applications. However, it is useful and easily achievable to provide a user with the possibility to go back to the main menu (“home”), to quit the current dialog, to ask for help, etcetera.

Guideline 2.5 Help users avoid mistakes and make it easy to correct them

Level 2 Success Criteria for Guideline 2.5

1. If an input error is detected, the error is identified and provided to the user in text.

It is not possible to detect errors, only that inputs don't match grammars, so VoiceXML applications don't need to do anything to meet this criterion (?).

(VoiceXML applications can also throw and catch other types of errors: for example if a resource is not available, it is possible to warn the user. Howver, these are not input errors or “user errors”, as they were previously called. When a resource is not available, the user should be notified, but WCAG has no guideline on this: it is a usability issue rather than an accessibility issue.)

2. If an input error is detected and suggestions for correction are known and can be provided without jeopardizing the security or purpose of the content, the error is identified and the suggestions are provided to the user.

See previous comment. However, it is possible to tell the user that there was no match between his input and the grammar, and to provide more useful prompts. When there is no match, the VoiceXML interpreter throws a nomatch event, which can be caught with <catch event="nomatch"> or <nomatch>. As with timeouts, which cause a noinput event, it is possible to provide tapered prompts (see comments on Guideline 2.2 Level 1 SC 1).

3. For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses …

For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses, at least one of the following is true:

actions are reversible;
actions are checked for errors before going on to the next step in the process;
the user is able to review and confirm or correct information before submitting it.

@TODO

Level 3 Success Criteria for Guideline 2.5

1. Additional context-relevant assistance is available for text input.

This is easily achieved with tapered prompts and/or the help element (or <catch event="help">). However, this is a usability issue rather than an accessibility issue, except for people with cognitive disabilities.

Principle 3: Content and controls must be understandable

Guideline 3.1 Make text content readable and understandable

Level 1 Success Criteria for Guideline 3.1

1. The primary natural language or languages of the delivery unit can be programmatically determined.

Use the xml:lang attribute.

Level 2 Success Criteria for Guideline 3.1

1. The natural language of each foreign passage or phrase in the content can be programmatically determined.

Use the xml:lang attribute. @TODO: check lowest level where xml:lang is available.

Level 3 Success Criteria for Guideline 3.1

1. A mechanism is available for finding definitions for all words in text content.

??? If such a mechanism is necessary, the content is not usable. ???

This is hardly feasible in any context because one would either need to

provide a glossary with every word used in the content, keep it synchronised with updates in content and in the case of VoiceXML applications also update the grammar file for the whole glossary (which may become too big for reliable speech recognition?), or
link to an existing online dictionary and check that each word used in the content is actually in the dictionary (there are currently no online dictionaries with an API that allows one the run all the words in one's website against the dictionary, and even if such an API existed it would still be necessary to check if the meaning in the dictionary corresponds to the meaning used in the content), but there are probably no such dictionaries with a voice interface.

2. A mechanism is available for identifying specific definitions of words used in an unusual or restricted way, including idioms and jargon.

??? If such a mechanism is necessary, the content is not usable. ???

3. A mechanism for finding the expanded form of acronyms and abbreviations is available.

???

4. Section titles are descriptive.

Not applicable, unless VoiceXML is used to read structured text to the user, but then there is no specific markup for section titles (see comment on Guideline 2.4 Level 2 SC 3 .

5. When text requires reading ability at or above the upper secondary education level …

When text requires reading ability at or above the upper secondary education level, one or more of the following supplements is available:

a text summary that requires reading ability no higher than primary education level;
graphical illustrations of concepts or processes that must be understood in order to use the content;
a spoken version of the text content.

Voice applications don't require a certain level of reading skills, but they do require a certain level of concentration and memory.

Guideline 3.2 Make the placement and functionality of content predictable

Superficially, voice applications have no issues regarding placement and functionality of content. However, interface consistency is very important.

Interface consistency is a fundamental quality of a usable application. Speech interface consistency includes consistent use of terminlology, personality, DTMF, etc. People will pick up implicit rules as they interact with the system. Oveall consistency makes the learning process smoother, and helps users predict what will happen and what they need to do, which in turn helps build confidence in using the system.

Interfaces should be consistent in:

Terminology: use the same words rather than synonyms to refer to an object or event. So, use “To delete the message, say delete” rather than “To remove the message, say delete”.

Personality: To convey a particular personality to the user, use the same wording, attitude, and style in all dialogs in an application.

Use of DTMF: assign the same key to the same word or action. For example, you may assign “1” to “yes” and “9” to “no” across the application.
System: “Would you like to make an appointment? If yes, press 1; otherwise, press 9.”
System: "Would you like to go back to the main menu? If yes, press 1; otherwise, press 9.”

(Eve Astrid Andersson et al: Early Adopter VoiceXML. Birmingham, UK: Wrox, 2002. P. 123.)

Level 1 Success Criteria for Guideline 3.2

1. Any change of context is implemented in a manner that can be programmatically determined.

It is not clear how there can be changes of context that can not be programmatically determined.

Level 2 Success Criteria for Guideline 3.2

1. Components that are repeated on multiple delivery units within a set of delivery units occur in the same order each time they are repeated.

In the context of voice applications, the distinction between delivery unit and perceivable unit is not important, because there is no way for users to specify the adress (URL) of a specific delivery unit. As mentioned before, users are not aware of transitions between delivery units, unless there is a noticeable delay.

Because functions that need to be available in the whole application need only be specified once in the root document, there is no need to repeat them in the leaf documents (see Executing a Multi-Document Application). HTML has no similar concept, but the success criterion is based on this deficiency of HTML-based sites. If leaf documents in voice applications do not repeat components that are also available in other leaf documents, this success criterion is met, but if components are repeated, the order in the source code does not matter (assuming that the user interface is consistent).

2. When any component receives focus, it does not cause a change of context.

VoiceXML does not use the concept of “focus”. Unlike HTML documents, a VoiceXML document forms a finite state machine.

A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

(Wikipedia has a definition of finite state machine.)

3. Changing the setting of any input field does not automatically cause a change of context.

If a VoiceXML form has only one field, setting the input field may automatically submit the form; it it has multiple input fields, setting the last field automatically submit the form, but form submission is only one of the things that can happen: variables may be set, fields may be cleared, or there may be transitions to subdialogs. Subdialogs may be in another delivery unit, especially if they are reused in multiple dialogs.

A VoiceXML menu is essentially a syntactic shorthand for a form containing a single anonymous field. Picking a choice from a menu automatically causes a transition to the dialog or VoiceXML document referenced by this choice.

4. Components that have the same functionality in multiple delivery units within a set of delivery units are labeled consistently.

Since VoiceXML knows the concept of an application root document where application-wide fuctnions can be defined just once, it is not wise to repeat the definition of these functions in leaf documents. However, if certain functions are available in multiple delivery units, terminology should be consistent (see the comment on consistency above).

Level 3 Success Criteria for Guideline 3.2

1. Graphical components that appear on multiple pages, including graphical links, are associated with the same text equivalents wherever they appear.

Not applicable (graphical components).

2. Changes of context are initiated only by user action.

A noinput event may cause a transition to another VoiceXML document, for example because this event is caught and handled in the root document (so there is transition from a leaf document to the root document), or the event handler may transfer the user to another VoiceXML document

Principle 4: Content must be robust enough to work with current and future technologies

Guideline 4.1 Use technologies according to specification

The 30 June 2005 Working Draft contains no success criteria for this guideline. In the 19 November 2004 Working Draft, this guideline contained the following success criteria.

Level 1 Success Criteria for Guideline 4.1

Except where the site has documented that a specification was violated for backward compatibility or compatibility with assistive technology, the technology has:
1. passed validity tests for the version of the technology in use (whether it be conforming to a schema, Document Type Definition (DTD), or other tests described in the specification),
2. structural elements and attributes are used as defined in the specification.

The existence of this success criterion has to do with HTML extensions that can be helpful for screen readers. There are probably no such extensions in VoiceXML. Note that this success criterion does not justify the creation of new proprietary extensions.

Level 3 Success Criteria for Guideline 4.1

Technologies are used according to specification without exception

Any VoiceXML application that uses VoiceXML, SSML, SRGS, JSGF, Nuance SpeechObjects, etcetera according to specification (which is more than just writing code that validates against a DTD or formal grammar) should pass this success criterion.

Guideline 4.2 Ensure that user interfaces are accessible or provide an accessible alternative(s)

Level 1 Success Criteria for Guideline 4.2

1. If content does not meet all level 1 success criteria, then an alternate form is provided that does meet all level 1 success criteria.

2. Content using baseline technologies or non-baseline technologies, must meet the following criteria …

Content using baseline technologies or non-baseline technologies, must meet the following criteria:

Content that violates international health and safety standards for general flash or red flash is marked in a way that the user can avoid its appearance

If the user can enter the content using the keyboard, then the user can exit the content using the keyboard.

Criterion a does not apply to VoiceXML because flash is only relevant to visual media.

Criterion b may require modifications in existing VoiceXML applications. [@TODO check]

Level 2 Success Criteria for Guideline 4.2

1. Accessibility conventions of the markup or programming language (API's or specific markup) are used.

@TODO

Level 3 Success Criteria for Guideline 4.2

1. Content implemented using technologies outside of baseline follows all WCAG requirements supported by the technology.

@TODO

Christophe Strobbe: Christophe.Strobbe ( @ ) esat.kuleuven.be