WCAG 2.0 and VoiceXML 2.0

Introduction

This documents collects some notes on how the Web Content Accessibility Guidelines 2.0 relate to VoiceXML and dialog systems. This document is a draft and refers to the 30 June 2005 Working Draft of WCAG 2.0.

@TODO: check appendix on accessibility in VoiceXML 2.0 specification.

Minimalisitc Introduction to VoiceXML

Voice Extensible Markup Language Version 2.0 is a recommendation since March 2004.

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

An example of a form that asks the user for a choice of drink and then submits it to a server script:

<?xml version="1.0" encoding="UTF-8"?>

<vxml xmlns="http://www.w3.org/2001/vxml" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2001/vxml 
   http://www.w3.org/TR/voicexml20/vxml.xsd"
   version="2.0">
  <form>
  <field name="drink">
     <prompt>Would you like coffee, tea, milk, or nothing?</prompt>
     <grammar src="drink.grxml" type="application/srgs+xml"/>
  </field>

  <block>
     <submit next="http://www.drink.example.com/drink2.asp"/>
  </block>
 </form>
</vxml>

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:

C (computer): Would you like coffee, tea, milk, or nothing?

H (human): Orange juice.

C: I did not understand what you said. (a platform-specific default message.)

C: Would you like coffee, tea, milk, or nothing?

H: Tea

C: (continues in document drink2.asp)

Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.

Principle 1: Content must be perceivable

Guideline 1.1 Provide text alternatives for all non-text content

Level 1 Success Criteria for Guideline 1.1

1. For all non-text content that is used to convey information, text alternatives identify the non-text content and convey the same information. For multimedia, provide a text-alternative that identifies the multimedia.

Applies to recorded prompts and any other recordings of spoken text, and to earcons. The busy tone and the network busy tone are outside the scope of VoiceXML.

Speech Synthesis Markup Language has a desc element that can be used inside audio elements:

The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of the desc element(s) should be rendered instead of other alternative content in audio. The optional xml:lang attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses of xml:lang in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.

[code example]

The desc element can only contain descriptive text.

Text alternatives can be provided for each audio prompt within the audio elements, for example:

<audio src="greetings.wav">Greetings</audio>

2. For functional non-text content, text alternatives serve the same purpose as the non-text content. If text alternatives can not serve the same purpose as the functional non-text content, text alternatives identify the purpose of the functional non-text content.

Applies to flow control and scripts (ECMAScript) that can be used in VoiceXML. However, if flow control and scripts produce a result that should be presented to the user, this is always something that is covered by other success criteria. VoiceXML has no equivalent for graphical links or buttons.

3. For non-text content that is intended to create a specific sensory experience, text alternatives at least identify the non-text content with a descriptive label.

This applies to music that does not include words (although telephones are not well suited for this because the limited frequency range).

4. Non-text content that is not functional, is not used to convey information, and does not create a specific sensory experience is implemented such that it can be ignored by assistive technology.

This may apply to telephone hold music. (Otherwise, non-text content that meets these criteria is probably not used a lot in voice applications: it would be a waste of time.) However, you cannot skip on-hold music.

5. For live audio-only or live video-only content, text alternatives at least identify the purpose of the content with a descriptive label.

Applies only to live audio-only content.

Level 3 Success Criteria for Guideline 1.1

For prerecorded multimedia content, a combined transcript of captions and audio descriptions of video is available.

Not applicable (multimedia).

Guideline 1.2 Provide synchronized alternatives for multimedia

Not applicable (multimedia).

Guideline 1.3 Ensure that information, functionality, and structure can be separated from presentation

The main reasons for existence of this guidelines are

the fact that structural HTML elements have often been abused for visual effects, and
the fact that structure and presentation can to a great extent be separated in HTML when using CSS.

VoiceXML does not describe documents where structures such as headings, blockquotes etcetera can be discerned, and there is no possibility to fake structure by means of presentational elements, or, if it were possible , the voice application would not be usable.

Presentation elements for voice applications are defined in W3C Speech Synthesis Markup Language (recommendation since September 2004) and reused in VoiceXML 2.0. Most importantly for this guideline, there is no technique to separate speech markup from dialog markup.

Level 1 Success Criteria for Guideline 1.3

1. Structures within the content can be programmatically determined.

Until I find evidence to the contrary, a VoiceXML document must meet this success criterion or the voice application will not work. However, if VoiceXML is used to present structured text (newspaper articles, any non-fiction or fiction), there are no elements to mark headings, quotes, etcetera. There is even no element for descriptive titles (see discussion of Guideline 2.4 Level 2 SC 3 below).

2. When information is conveyed by color, the color can be programmatically determined or the information is also conveyed through another means that does not depend on the user's ability to differentiate colors.

Not applicable (color).

Level 2 Success Criteria for Guideline 1.3

1. Information that is conveyed by variations in presentation of text is also conveyed in text or the variations in presentation of text can be programmatically determined.

In voice applications, Speech Synthesis Markup Language can be used to control aspects of speech such as pronunciation, volume, pitch, rate etcetera.

2. Any information that is conveyed by color is visually evident when color is not available.

Not applicable (color).

Level 3 Success Criteria for Guideline 1.3

1. When content is arranged in a sequence that affects its meaning, that sequence can be determined programmatically.

Unlike HTML, VoiceXML cannot use a style language that rearranges content.

Guideline 1.4 Make it easy to distinguish foreground information from background images or sounds

Level 1 Success Criteria for Guideline 1.4

1. Any text that is presented over a background image, color, or text can be programmatically determined.

Not applicable (background image, color or text).

Level 2 Success Criteria for Guideline 1.4

1. Text and diagrams that are presented over a background image, color, or text have a contrast greater than X1 where the whiter element is at least Y1 as measured by _____.

Not applicable (background image, color or text).

2. Text that is presented over a background pattern of lines which are within 500% +/- of the stem width of the characters or their serifs must have a contrast between the characters and the lines that is greater than X2, where the whiter element is at least Y2.

Not applicable (background pattern of lines).

3. A mechanism is available to turn off background audio that plays automatically.

This is very relevant to voice applications, as some applications play background audio that makes the prompts hard to understand.

VoiceXML elements that fetch VoiceXML documents also support the fetchaudio attribute. This attribute specifies

[t]he URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The VoiceXML specification also adds that

[t]he fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing.

By default, VoiceXML applications do not play background audio while fetching a resource of any kind. Apparently, background audio can only be played when the application fetches a VoiceXML document, not when it fetches audio, grammar or script files. Also, the background audio does not loop. Queued prompts are played during the fetch and before the background audio is played. It is not clear how VoiceXML supports background audio that plays during dialogs. WCAG 2.0 does not define background audio, but if VoiceXML cannot play background audio during prompts or other blocks of synthesized speech, then its concept of background audio does not fit WCAG's (implicit) concept of background audio.

Level 3 Success Criteria for Guideline 1.4

1. Text is not presented over any background (image, text, color or pattern), or if any background is present, the contrast between the text and the background is greater than X2.

Not applicable (background image, color or text).

2. Audio content does not contain background sounds or the background sounds are at least 20 decibels lower than the foreground audio content, with the exception of occasional sound effects.

In VoiceXML, audio content can only contain background sounds if it is pre-recorded and must then meet Guideline 1.1 Level 1. However, for people who are not deaf, the difference in volume between foreground and background audio is very relevant.

Principle 2: Interface elements in the content must be operable

Guideline 2.1 Make all functionality operable via a keyboard interface

Level 1 Success Criteria for Guideline 2.1

1. All of the functionality of the content, where the functionality or its outcome can be described in a sentence, is operable through a keyboard interface.

If this forces voice applications to provide both speech grammars and DTMF grammars for every dialog, this leads to longer prompts, which may harm the usability of voice applications. On the other hand, this success criterion (combined with Guideline 1.1 an d others) is necessary to make VoiceXML applications accessible to people whose voice may be hard to recognize or who cannot speak at all.

Level 3 Success Criteria for Guideline 2.1

1. All functionality of the content is designed to be operated through a keyboard interface.

See comment on previous success criterion.

Guideline 2.2 Allow users to control time limits on their reading or interaction

@TODO: check interdigittimeouts in the appendix on timing properties in VoiceXML 2.0 specification.

Level 1 Success Criteria for Guideline 2.2

1. Content is designed so that time-outs are not an essential part of interaction, or …

Content is designed so that time-outs are not an essential part of interaction, or at least one of the following is true for each time-out that is a function of the content:

the user is allowed to deactivate the time-out or;
the user is allowed to adjust the time-out over a wide range which is at least ten times the length of the default setting or;
the user is warned before time expires, allowed to extend the time-out with a simple action (for example, "hit any key") and given at least 20 seconds to respond or;
the time-out is an important part of a real-time event (for example, an auction), and no alternative to the time-out is possible or;
the time-out is part of an activity where timing is essential (for example, competitive gaming or time-based testing) and time limits can not be extended further without invalidating the activity.

Prompts in VoiceXML dialogs can specify a timeout attribute:

The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.

The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.

When a noinput event is thrown, it can be caught with <catch event="noinput">, or with <noinput>, which is shorthand notation for <catch event="noinput">. The input element has an optional count attribute that allows the developer to omit a different prompt when noinput events are repeated. For example, on the first noinput event (<noinput count="1">), the application may simply reprompt, and on subsequent prompts (<noinput count="2">), it may emit more directed prompts. (See the example in the VoiceXML 2.1 Development Guide) This technique is called tapered prompting. The noinput element can contain a prompt element that defines a longer timeout than the first prompt or any other previous prompts. However, the use of this feature does not meet any of the bullet points of this success criterion. I am not sure if it is possible to store the timeout value in a variable, then reference this variable in the timeout attribute and increment it in <noinput>, or to prompt the user for a new value.

Level 2 Success Criteria for Guideline 2.2

1. Content does not blink for more than 3 seconds, or a method is available to stop any blinking content in the delivery unit.

Not applicable (blinking).

2. Moving or time-based content can be paused by the user.

If a prompt does not specify a noinput timeout, the application uses a platform-specific default timeout. It is not obvious how content can be paused by the user.

Level 3 Success Criteria for Guideline 2.2

1. Except for real-time events, timing is not an essential part of the event or activity presented by the content.

Since it is not obvious how timeouts in VoiceXML applications can be disabled, it is not clear how this success criterion can be met.

2. Non-emergency interruptions, such as the availability of updated content, can be postponed or suppressed by the user.

VoiceXML applications do not use automatic updates of content that is being presented to the user, because that would cause major usability problems.

3. When an authenticated session has an inactivity timeout, the user can continue the activity without loss of data after re-authenticating.

In the context of web sites, this success criterion does not require that the user is led back to the form he/she was filling in, only that the data are not lost. This should work in VoiceXML. For example, there is a disconnect element that allows an application to perform some necessary final cleanup, such as submitting data to the server.

Guideline 2.3 Allow users to avoid content that could cause seizures due to photosensitivity

Not applicable.

Guideline 2.4 Provide mechanisms to help users find content, orient themselves within it, and navigate through it

Level 1 Success Criteria for Guideline 2.4

1. Navigational features can be programmatically identified.

In voice applications, users navigate through several layers of ‘menus’ by answering prompts. Unlike HTML, where an image or paragraph can be turned into a link by means of JavaScript, VoiceXML has no techniques to turn, say, a prompt into a link, choice (in a menu) or field (in a form). It appears that VoiceXML applications meet this criterion by default.

Level 2 Success Criteria for Guideline 2.4

1. More than one way is available to locate content within a set of delivery units.

In voice applications, there is usually only one way. If the developer can think of several ways (i.e. several ways to structure the menus) to a specific piece of content, he will probably only implemnt the most efficient one, because implementing the other ways would mean that prompts (containing lists of choice where to go next) would only become longer and therefore less usable.

Web applications where users are required to provide information in a specific sequence of forms also violate this criterion. The success criterion needs an exemption; see the editorial note:

There are some concerns about how this criterion would apply to web applications where the only way to expose certain types of content would require the user to follow a specific path.

The success criterion could be reworded as follows:

More than one way is available to locate content within a set of delivery units, except in computer-directed interactions.

This reformulation requires a definition:

computer-directed interaction: interaction in which the user fills in forms and form items exactly once in sequential order

The rewording and the defintion are inspired by VoiceXML's concept of directed forms.

2. Blocks of content that are repeated on multiple perceivable units are implemented so that they can be bypassed.

Not applicable, because perceivable units in voice applications cannot be broken down in anything smaller?

3. Delivery units have descriptive titles.

VoiceXML does not describe documents, so titles are useless to end users. A title can be provided in the meta element and/or metadata element, but the content of this element is not supposed to be played back to the user. This success criterion may need an exemption, for example as follows:

Delivery units have descriptive titles, except in interaction technologies where titles cannot be presented to the user.

4. The destination of each programmatic reference to another delivery unit is identified through words or phrases that either occur in text or can be programmatically determined.

VoiceXML applications always meet this success criterion if the items in a grammar are interpreted as “link text”, except where a programmatic reference is not matched by anything in a prompt or grammar (such a reference would be like a piece of unused code in a program).

Also, in voice applications, users do not notice when another delivery is fetched; there is no way for users to find out what the deliery units in a voice application are, with the exeption of recorded sound.

Level 3 Success Criteria for Guideline 2.4

1. When a page or other delivery unit is navigated sequentially, elements receive focus in an order that follows relationships and sequences in the content.

In VoiceXML applications, users don't really “navigate” a delivery unit. For the content of a prompt or block, this success criterion is always met. For a VoiceXML document that contains multiple dialogs, the sequence depends on the user's input. Requiring that the dialogs should be navigated in source code order makes no sense.

2. Information about the user's location within a set of delivery units is available.

No applicable: location is not a useful concept in voice applications. However, it is useful and easily achievable to provide a user with the possibility to go back to the main menu (“home”), to quit the current dialog, to ask for help, etcetera.

Guideline 2.5 Help users avoid mistakes and make it easy to correct them

Level 2 Success Criteria for Guideline 2.5

1. If an input error is detected, the error is identified and provided to the user in text.

It is not possible to detect errors, only that inputs don't match grammars, so VoiceXML applications don't need to do anthing to meet this criterion.

2. If an input error is detected and suggestions for correction are known and can be provided without jeopardizing the security or purpose of the content, the error is identified and the suggestions are provided to the user.

See previous comment. However, it is possible to tell the user that there was no match between his input and the grammar, and to provide more useful prompts. When there is no match, the VoiceXML interpreter throws a nomatch event, which can be caught with <catch event="nomatch"> or <nomatch>. As with timeouts, which cause a noinput event, it is possible to provide tapered prompts (see comments on Guideline 2.2 Level 1 SC 1).

3. For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses …

For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses, at least one of the following is true:

actions are reversible;
actions are checked for errors before going on to the next step in the process;
the user is able to review and confirm or correct information before submitting it.

@TODO

Level 3 Success Criteria for Guideline 2.5

1. Additional context-relevant assistance is available for text input.

This is easily achieve with tapered prompts and/or the help element (or <catch event="help">). However, this is a usability issue rather than an accessibility issue, except for people with cognitive disabilities.

Principle 3: Content and controls must be understandable

Guideline 3.1 Make text content readable and understandable

Level 1 Success Criteria for Guideline 3.1

1. The primary natural language or languages of the delivery unit can be programmatically determined.

Use the xml:lang attribute.

Level 2 Success Criteria for Guideline 3.1

1. The natural language of each foreign passage or phrase in the content can be programmatically determined.

Use the xml:lang attribute. @TODO: check lowest level where xml:lang is available.

Level 3 Success Criteria for Guideline 3.1

1. A mechanism is available for finding definitions for all words in text content.

??? If such a mechanism is necessary, the content is not usable. ???

2. A mechanism is available for identifying specific definitions of words used in an unusual or restricted way, including idioms and jargon.

??? If such a mechanism is necessary, the content is not usable. ???

3. A mechanism for finding the expanded form of acronyms and abbreviations is available.

???

4. Section titles are descriptive.

Not applicable, unless VoiceXML is used to read structured text to the user, but then there is no specific markup for section titles (see comment on Guideline 2.4 Level 2 SC 3 .

5. When text requires reading ability at or above the upper secondary education level …

When text requires reading ability at or above the upper secondary education level, one or more of the following supplements is available:

a text summary that requires reading ability no higher than primary education level;
graphical illustrations of concepts or processes that must be understood in order to use the content;
a spoken version of the text content.

Voice applications don't require a certain level of reading skills, but they do require a certain level of concentration and memory.

Guideline 3.2 Make the placement and functionality of content predictable

Superficially, voice applications have no issues regarding placement and functionality of content. However, interface consistency is very important.

Interface consistency is a fundamental quality of a usable application. Speech interface consistency includes consistent use of terminlology, personality, DTMF, etc. People will pick up implicit rules as they interact with the system. Oveall consistency makes the learning process smoother, and helps users predict what will happen and what they need to do, which in turn helps build confidence in using the system.

Interfaces should be consistent in:

Terminology: use the same words rather than synonyms to refer to an object or event. So, use “To delete the message, say delete” rather than “To remove the message, say delete”.

Personality: To convey a particular personality to the user, use the same wording, attitude, and style in all dialogs in an application.

Use of DTMF: assign the same key to the same word or action. For example, you may assign “1” to “yes” and “9” to “no” across the application.
System: “Would you like to make an appointment? If yes, press 1; otherwise, press 9.”
System: "Would you like to go back to the main menu? If yes, press 1; otherwise, press 9.”

(Eve Astrid Andersson et al: Early Adopter VoiceXML. Birmingham, UK: Wrox, 2002. P. 123.)

Level 1 Success Criteria for Guideline 3.2

1. Any change of context is implemented in a manner that can be programmatically determined.

It is not clear how there can be changes of context that can not be programmatically determined.

Level 2 Success Criteria for Guideline 3.2

1. Components that are repeated on multiple delivery units within a set of delivery units occur in the same order each time they are repeated.

In the context of voice applications, the distinction between delivery unit and perceivable unit is not important, because there is no way for users to specify the adress (URL) of a specific delivery unit. As mentioned before, users are not aware of transitions between delivery units, unless there is a noticeable delay.

Because functions that need to be available in the whole application need only be specified once in the root document, there is no need to repeat them in the leaf documents (see Executing a Multi-Document Application). HTML has no similar concept, but the success criterion is based on this deficiency of HTML-based sites. If leaf documents in voice applications do not repeat components that are also available in other leaf documents, this success criterion is met, but if components are repeated, the order in the source code does not matter (assuming that the user interface is consistent).

2. When any component receives focus, it does not cause a change of context.

VoiceXML does not use the concept of “focus”. Unlike HTML documents, a VoiceXML forms a state machine.

A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

3. Changing the setting of any input field does not automatically cause a change of context.

If a VoiceXML form has only one field, setting the input field may automatically submit the form; it it has multiple input fields, setting the last field automatically submit the form, but form submission is only one of the things that can happen: variables may be set, fields may be cleared, or there may be transitions to subdialogs. Subdialogs may be in another delivery unit, especially if they are reused in multiple dialogs.

A VoiceXML menu is essentially a syntactic shorthand for a form containing a single anonymous field. Picking a choice from a menu automatically causes a transition to the dialog or VoiceXML document referenced by this choice.

4. Components that have the same functionality in multiple delivery units within a set of delivery units are labeled consistently.

Since VoiceXML knows the concept of an application root document where application-wide fuctnions can be defined just once, it is not wise to repeat the definition of these functions in leaf documents. However, if certain functions are available in multiple delivery units, terminology should be consistent (see the comment on consistency above).

Level 3 Success Criteria for Guideline 3.2

1. Graphical components that appear on multiple pages, including graphical links, are associated with the same text equivalents wherever they appear.

Not applicable (graphical components).

2. Changes of context are initiated only by user action.

A noinput event may cause a transition to another VoiceXML document, for example because this event is caught and handled in the root document (so there is transition from a leaf document to the root document), or the event handler may transfer the user to another VoiceXML document

Principle 4: Content must be robust enough to work with current and future technologies

Guideline 4.1 Use technologies according to specification

@TODO

Guideline 4.2 Ensure that user interfaces are accessible or provide an accessible alternative(s)

@TODO

Christophe Strobbe: Christophe.Strobbe ( @ ) esat.kuleuven.be