This documents collects some notes on how the Web Content Accessibility Guidelines 2.0 relate to VoiceXML and dialog systems. This document is a draft and refers to the 30 June 2005 Working Draft of WCAG 2.0.
@TODO: check appendix on accessibility in VoiceXML 2.0 specification.
Voice Extensible Markup Language Version 2.0 is a recommendation since March 2004.
VoiceXML
is designed for creating audio dialogs that feature synthesized speech,
digitized audio, recognition of spoken and DTMF key input,
recording of spoken input, telephony, and mixed initiative conversations. Its
major goal is to bring the advantages of Web-based development and content
delivery to interactive voice response applications.
There are
two kinds of dialogs: forms and menus. Forms define an interaction
that collects values for a set of form item variables. Each field may specify
a grammar that defines the allowable inputs for that field. If a form-level
grammar is present, it can be used to fill several fields from one utterance.
A menu presents the user with a choice of options and then
transitions to another dialog based on that choice.
An example of a form that asks the user for a choice of drink and then submits it to a server script:
<?xml version="1.0" encoding="UTF-8"?> <vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <field name="drink"> <prompt>Would you like coffee, tea, milk, or nothing?</prompt> <grammar src="drink.grxml" type="application/srgs+xml"/> </field> <block> <submit next="http://www.drink.example.com/drink2.asp"/> </block> </form> </vxml>
A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:
C (computer): Would you like coffee, tea, milk, or nothing?
H (human): Orange juice.
C: I did not understand what you said. (a platform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
Each
dialog has one or more speech and/or DTMF
grammars associated with it. In machine directed
applications, each dialog's grammars are active only when the user is in that
dialog. In mixed initiative applications, where the user and the
machine alternate in determining what to do next, some of the dialogs are
flagged to make their grammars active (i.e.,
listened for) even when the user is in another dialog in the same document,
or on another loaded document in the same application. In this situation, if
the user says something matching another dialog's active grammars, execution
transitions to that other dialog, with the user's utterance treated as if it
were said in that dialog. Mixed initiative adds flexibility and power to
voice applications.
A
link supports mixed initiative. It specifies a grammar that is
active whenever the user is in the scope of the link. If user input matches
the link's grammar, control transfers to the link's destination
URI. A link can be used to throw an event or go to a
destination URI.
Applies to recorded prompts and any other recordings of spoken text, and to earcons. The busy tone and the network busy tone are outside the scope of VoiceXML.
Speech Synthesis Markup Language has a desc element that can be used inside audio elements:
The
desc
element can only occur within the content of theaudio
element. When the audio source referenced inaudio
is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain adesc
element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of thedesc
element(s) should be rendered instead of other alternative content inaudio
. The optionalxml:lang
attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses ofxml:lang
in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.[code example]
The
desc
element can only contain descriptive text.
Text alternatives can be provided for each audio prompt within the
audio
elements, for example:
<audio src="greetings.wav">Greetings</audio>
Applies to flow control and scripts (ECMAScript) that can be used in VoiceXML. However, if flow control and scripts produce a result that should be presented to the user, this is always something that is covered by other success criteria. VoiceXML has no equivalent for graphical links or buttons.
This applies to music that does not include words (although telephones are not well suited for this because the limited frequency range).
This may apply to telephone hold music. (Otherwise, non-text content that meets these criteria is probably not used a lot in voice applications: it would be a waste of time.) However, you cannot skip on-hold music.
Applies only to live audio-only content.
Not applicable (multimedia).
Not applicable (multimedia).
The main reasons for existence of this guidelines are
VoiceXML does not describe documents where structures such as headings, blockquotes etcetera can be discerned, and there is no possibility to fake structure by means of presentational elements, or, if it were possible , the voice application would not be usable.
Presentation elements for voice applications are defined in W3C Speech Synthesis Markup Language (recommendation since September 2004) and reused in VoiceXML 2.0. Most importantly for this guideline, there is no technique to separate speech markup from dialog markup.
Until I find evidence to the contrary, a VoiceXML document must meet this success criterion or the voice application will not work. However, if VoiceXML is used to present structured text (newspaper articles, any non-fiction or fiction), there are no elements to mark headings, quotes, etcetera. There is even no element for descriptive titles (see discussion of Guideline 2.4 Level 2 SC 3 below).
Not applicable (color).
In voice applications, Speech Synthesis Markup Language can be used to control aspects of speech such as pronunciation, volume, pitch, rate etcetera.
Not applicable (color).
Unlike HTML, VoiceXML cannot use a style language that rearranges content.
Not applicable (background image, color or text).
Not applicable (background image, color or text).
Not applicable (background pattern of lines).
This is very relevant to voice applications, as some applications play background audio that makes the prompts hard to understand.
VoiceXML elements that fetch
VoiceXML documents also support the fetchaudio
attribute. This attribute specifies
[t]he URI of the audio clip to play while the fetch is being done. If not specified, the
fetchaudio
property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.
The VoiceXML specification also adds that
[t]he fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing.
By default, VoiceXML applications do not play background audio while fetching a resource of any kind. Apparently, background audio can only be played when the application fetches a VoiceXML document, not when it fetches audio, grammar or script files. Also, the background audio does not loop. Queued prompts are played during the fetch and before the background audio is played. It is not clear how VoiceXML supports background audio that plays during dialogs. WCAG 2.0 does not define background audio, but if VoiceXML cannot play background audio during prompts or other blocks of synthesized speech, then its concept of background audio does not fit WCAG's (implicit) concept of background audio.
Not applicable (background image, color or text).
In VoiceXML, audio content can only contain background sounds if it is pre-recorded and must then meet Guideline 1.1 Level 1. However, for people who are not deaf, the difference in volume between foreground and background audio is very relevant.
If this forces voice applications to provide both speech grammars and DTMF grammars for every dialog, this leads to longer prompts, which may harm the usability of voice applications. On the other hand, this success criterion (combined with Guideline 1.1 an d others) is necessary to make VoiceXML applications accessible to people whose voice may be hard to recognize or who cannot speak at all.
See comment on previous success criterion.
@TODO: check interdigittimeouts in the appendix on timing properties in VoiceXML 2.0 specification.
Content is designed so that time-outs are not an essential part of interaction, or at least one of the following is true for each time-out that is a function of the content:
Prompts in VoiceXML dialogs can specify a timeout
attribute:
The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.
The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.
When a noinput event is thrown, it can be caught with <catch
event="noinput">
, or with <noinput>
, which is
shorthand notation for <catch event="noinput">
. The input
element has an optional count
attribute that allows the
developer to omit a different prompt when noinput events are repeated. For
example, on the first noinput event (<noinput count="1">
),
the application may simply reprompt, and on subsequent prompts
(<noinput count="2">
), it may emit more directed prompts.
(See the example in the VoiceXML
2.1 Development Guide) This technique is called tapered
prompting. The noinput
element can contain a
prompt
element that defines a longer timeout than the first
prompt or any other previous prompts. However, the use of this
feature does not meet any of the bullet points of this success
criterion. I am not sure if it is possible to store the timeout
value in a variable, then reference this variable in the timeout attribute
and increment it in <noinput>
, or to prompt the user for a
new value.
Not applicable (blinking).
If a prompt does not specify a noinput timeout, the application uses a platform-specific default timeout. It is not obvious how content can be paused by the user.
Since it is not obvious how timeouts in VoiceXML applications can be disabled, it is not clear how this success criterion can be met.
VoiceXML applications do not use automatic updates of content that is being presented to the user, because that would cause major usability problems.
In the context of web sites, this success criterion does not require that
the user is led back to the form he/she was filling in, only that the data
are not lost. This should work in VoiceXML. For example,
there is a disconnect
element that allows an application to perform some necessary final
cleanup, such as submitting data to the server.
Not applicable.
In voice applications, users navigate through several layers of
‘menus’ by answering prompts. Unlike HTML,
where an image or paragraph can be turned into a link by means of JavaScript,
VoiceXML has no techniques to turn, say, a prompt into a
link
, choice
(in a menu
) or
field
(in a form
). It appears that
VoiceXML applications meet this criterion by default.
In voice applications, there is usually only one way. If the developer can think of several ways (i.e. several ways to structure the menus) to a specific piece of content, he will probably only implemnt the most efficient one, because implementing the other ways would mean that prompts (containing lists of choice where to go next) would only become longer and therefore less usable.
Web applications where users are required to provide information in a specific sequence of forms also violate this criterion. The success criterion needs an exemption; see the editorial note:
There are some concerns about how this criterion would apply to web applications where the only way to expose certain types of content would require the user to follow a specific path.
The success criterion could be reworded as follows:
More than one way is available to locate content within a set of delivery units, except in computer-directed interactions.
This reformulation requires a definition:
The rewording and the defintion are inspired by VoiceXML's concept of directed forms.
Not applicable, because perceivable units in voice applications cannot be broken down in anything smaller?
VoiceXML does not describe documents, so titles are
useless to end users. A title can be provided in the
meta
element and/or metadata
element, but the
content of this element is not supposed to be played back to the user. This
success criterion may need an exemption, for example as follows:
Delivery units have descriptive titles, except in interaction technologies where titles cannot be presented to the user.
VoiceXML applications always meet this success criterion if the items in a grammar are interpreted as “link text”, except where a programmatic reference is not matched by anything in a prompt or grammar (such a reference would be like a piece of unused code in a program).
Also, in voice applications, users do not notice when another delivery is fetched; there is no way for users to find out what the deliery units in a voice application are, with the exeption of recorded sound.
In VoiceXML applications, users don't really “navigate” a
delivery unit. For the content of a prompt
or
block
, this success criterion is always met. For a
VoiceXML document that contains multiple dialogs, the
sequence depends on the user's input. Requiring that the dialogs should be
navigated in source code order makes no sense.
No applicable: location is not a useful concept in voice applications. However, it is useful and easily achievable to provide a user with the possibility to go back to the main menu (“home”), to quit the current dialog, to ask for help, etcetera.
It is not possible to detect errors, only that inputs don't match grammars, so VoiceXML applications don't need to do anthing to meet this criterion.
See previous comment. However, it is possible to tell the user that there
was no match between his input and the grammar, and to provide more useful
prompts. When there is no match, the VoiceXML interpreter
throws a nomatch event, which can be caught with <catch
event="nomatch">
or <nomatch>
. As with timeouts,
which cause a noinput event, it is possible to provide tapered prompts (see
comments on Guideline 2.2 Level 1
SC 1).
For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses, at least one of the following is true:
@TODO
This is easily achieve with tapered prompts and/or the help
element (or <catch event="help">
). However, this is a
usability issue rather than an accessibility issue, except for people with
cognitive disabilities.
Use the xml:lang
attribute.
Use the xml:lang
attribute. @TODO: check
lowest level where xml:lang is available.
??? If such a mechanism is necessary, the content is not usable. ???
??? If such a mechanism is necessary, the content is not usable. ???
???
Not applicable, unless VoiceXML is used to read structured text to the user, but then there is no specific markup for section titles (see comment on Guideline 2.4 Level 2 SC 3 .
When text requires reading ability at or above the upper secondary education level, one or more of the following supplements is available:
Voice applications don't require a certain level of reading skills, but they do require a certain level of concentration and memory.
Superficially, voice applications have no issues regarding placement and functionality of content. However, interface consistency is very important.
Interface consistency is a fundamental quality of a usable application. Speech interface consistency includes consistent use of terminlology, personality, DTMF, etc. People will pick up implicit rules as they interact with the system. Oveall consistency makes the learning process smoother, and helps users predict what will happen and what they need to do, which in turn helps build confidence in using the system.
Interfaces should be consistent in:
- Terminology: use the same words rather than synonyms to refer to an object or event. So, use “To delete the message, say delete” rather than “To remove the message, say delete”.
- Personality: To convey a particular personality to the user, use the same wording, attitude, and style in all dialogs in an application.
- Use of DTMF: assign the same key to the same word or action. For example, you may assign “1” to “yes” and “9” to “no” across the application.
System: “Would you like to make an appointment? If yes, press 1; otherwise, press 9.”
System: "Would you like to go back to the main menu? If yes, press 1; otherwise, press 9.”(Eve Astrid Andersson et al: Early Adopter VoiceXML. Birmingham, UK: Wrox, 2002. P. 123.)
It is not clear how there can be changes of context that can not be programmatically determined.
In the context of voice applications, the distinction between delivery unit and perceivable unit is not important, because there is no way for users to specify the adress (URL) of a specific delivery unit. As mentioned before, users are not aware of transitions between delivery units, unless there is a noticeable delay.
Because functions that need to be available in the whole application need only be specified once in the root document, there is no need to repeat them in the leaf documents (see Executing a Multi-Document Application). HTML has no similar concept, but the success criterion is based on this deficiency of HTML-based sites. If leaf documents in voice applications do not repeat components that are also available in other leaf documents, this success criterion is met, but if components are repeated, the order in the source code does not matter (assuming that the user interface is consistent).
VoiceXML does not use the concept of “focus”. Unlike HTML documents, a VoiceXML forms a state machine.
A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.
If a VoiceXML form
has only one
field
, setting the input field may automatically submit the
form; it it has multiple input field
s, setting the last field
automatically submit the form, but form submission is only one of the things
that can happen: variables may be set, fields may be cleared, or there may be
transitions to subdialogs. Subdialogs
may be in another delivery unit, especially if they are reused in multiple
dialogs.
A VoiceXML menu
is
essentially a syntactic shorthand for a form containing a single anonymous
field. Picking a choice
from a menu
automatically
causes a transition to the dialog or VoiceXML document
referenced by this choice
.
Since VoiceXML knows the concept of an application root document where application-wide fuctnions can be defined just once, it is not wise to repeat the definition of these functions in leaf documents. However, if certain functions are available in multiple delivery units, terminology should be consistent (see the comment on consistency above).
Not applicable (graphical components).
A noinput event may cause a transition to another VoiceXML document, for example because this event is caught and handled in the root document (so there is transition from a leaf document to the root document), or the event handler may transfer the user to another VoiceXML document
@TODO
@TODO
Christophe Strobbe: Christophe.Strobbe ( @ ) esat.kuleuven.be