This documents collects some notes on how the Web Content Accessibility Guidelines 2.0 relate to VoiceXML and dialog systems. The intention is to find out which success criteria are had or impossible to meet in VoiceXML, which success criteria should be modified to take dialog systems into account, and to describe techniques that VoiceXML developers can use to meet the success criteria. The techniques still need to be written; hopefully it will be possible to gather a sufficient number of techniques to create a W3C Note on “VoiceXML Techniques for WCAG 2.0”.
This document is a draft and refers to the 30 June 2005 Working Draft of WCAG 2.0.
Voice Extensible Markup Language Version 2.0 is a recommendation since March 2004.
VoiceXML
is designed for creating audio dialogs that feature synthesized speech,
digitized audio, recognition of spoken and DTMF key input,
recording of spoken input, telephony, and mixed initiative conversations. Its
major goal is to bring the advantages of Web-based development and content
delivery to interactive voice response applications.
There are
two kinds of dialogs: forms and menus. Forms define an interaction
that collects values for a set of form item variables. Each field may specify
a grammar that defines the allowable inputs for that field. If a form-level
grammar is present, it can be used to fill several fields from one utterance.
A menu presents the user with a choice of options and then
transitions to another dialog based on that choice.
An example of a form that asks the user for a choice of drink and then submits it to a server script:
<?xml version="1.0" encoding="UTF-8"?> <vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form> <field name="drink"> <prompt>Would you like coffee, tea, milk, or nothing?</prompt> <grammar src="drink.grxml" type="application/srgs+xml"/> </field> <block> <submit next="http://www.drink.example.com/drink2.asp"/> </block> </form> </vxml>
A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:
C (computer): Would you like coffee, tea, milk, or nothing?
H (human): Orange juice.
C: I did not understand what you said. (a platform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
Each
dialog has one or more speech and/or DTMF
grammars associated with it. In machine directed
applications, each dialog's grammars are active only when the user is in that
dialog. In mixed initiative applications, where the user and the
machine alternate in determining what to do next, some of the dialogs are
flagged to make their grammars active (i.e.,
listened for) even when the user is in another dialog in the same document,
or on another loaded document in the same application. In this situation, if
the user says something matching another dialog's active grammars, execution
transitions to that other dialog, with the user's utterance treated as if it
were said in that dialog. Mixed initiative adds flexibility and power to
voice applications.
A
link supports mixed initiative. It specifies a grammar that is
active whenever the user is in the scope of the link. If user input matches
the link's grammar, control transfers to the link's destination
URI. A link can be used to throw an event or go to a
destination URI.
The VoiceXML 2.0 specification contains an appendix on accessibility. The last part of this appendix contains some guidelines:
Additional guidelines for enabling persons with disabilities to access VoiceXML applications include the following:
- Reuse navigation structures that are highly usable and leverage learning across multiple applications“for example, the navigational techniques for the ANSI/NISO Z39.86-2002 Digital Talking book Standard, http://www.loc.gov/nls/z3986/.
- Each element in which an event can occur should specify catch elements, including one with a fail-soft or recovery functionality.
- Enable users to control the length of time before timeout, the speaking rate of synthesized speech, and other such variables that provide a little extra time to respond or complete an input action, particularly when the VoiceXML interpreter Context has detected that the user is using an ancillary device instead of listening or speaking. These are especially useful for users with cognitive disabilities.
- Advertise alternative modes through which equivalent service is available, including transfer to a human operator, text telephone service, etc., or the availability of the same information via the World Wide Web.
It would be useful to have code examples for these guidelines, especially the third one.
Applies to recorded prompts and any other recordings of spoken text, and to earcons. The busy tone and the network busy tone are outside the scope of VoiceXML.
Speech Synthesis Markup Language has a desc element that can be used inside audio elements:
The
desc
element can only occur within the content of theaudio
element. When the audio source referenced inaudio
is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain adesc
element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of thedesc
element(s) should be rendered instead of other alternative content inaudio
. The optionalxml:lang
attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses ofxml:lang
in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.[code example]
The
desc
element can only contain descriptive text.
Text alternatives can be provided for each audio prompt within the
audio
elements, for example:
<audio src="greetings.wav">Greetings</audio>
Applies to flow control and scripts (ECMAScript) that can be used in VoiceXML. However, if flow control and scripts produce a result that should be presented to the user, this is always something that is covered by other success criteria. VoiceXML has no equivalent for graphical links or buttons.
This applies to music that does not include words (although telephones are not well suited for this because the limited frequency range).
This may apply to telephone hold music. (Otherwise, non-text content that meets these criteria is probably not used a lot in voice applications: it would be a waste of time.) However, you cannot skip on-hold music.
Applies only to live audio-only content.
Not applicable. In the context of this success criterion, multimedia means combined audio and video. Streaming video through a voice application doens't seem to make sense. Prerecorded audio can be used, but this is already covered by the success criteria at level 1.
Requiring a text alternative (GL 1.1 L1 SC 1-5) seems reasonable because the person whose voice is being used (in the recording) also needs this text, so it is definitely available.
Not applicable (multimedia).
Not applicable (multimedia).
Not applicable (multimedia).
Not applicable (multimedia).
Not applicable (multimedia).
Not applicable (multimedia).
The main reasons for existence of this guidelines are
VoiceXML does not describe documents where structures such as headings, blockquotes etcetera can be discerned, and there is no possibility to fake structure by means of presentational elements, or, if it were possible , the voice application would not be usable.
Presentation elements for voice applications are defined in W3C Speech Synthesis Markup Language (recommendation since September 2004) and reused in VoiceXML 2.0. Most importantly for this guideline, there is no technique to separate speech markup from dialog markup.
Until I find evidence to the contrary, a VoiceXML document must meet this success criterion or the voice application will not work. However, if VoiceXML is used to present structured text (newspaper articles, any non-fiction or fiction), there are no elements to mark headings, quotes, etcetera. There is even no element for descriptive titles (see discussion of Guideline 2.4 Level 2 SC 3 below).
Note: the assumption behind the above comment is that
<emphasis>
, <break>
and most other
SSML
elements (with the exception of <meta>
,
<metadata>
, <p>
and
<s>
) are presentational elements, not structural elements.
One can use SSML to mark up the types of (structured) text
I listed, but that would be like abusing presentational
HTML markup to fake structure (e.g. using big, bold text instead of
<hx>
).
To my understanding, VoiceXML elements do not represent
structure in the sense of document structure, but dialogue/interaction
"structure", and without the proper use of these structures (prompts, forms,
menus, nomatch, etcetera) a VoiceXML document will not
work properly. This is where HTML differs from
VoiceXML: if you throw out structure and rely purely on
presentational elements, an HTML document will still work
for a large percentage of the population.
Not applicable (color).
In voice applications, Speech Synthesis Markup Language can be used to control aspects of speech such as pronunciation, volume, pitch, rate etcetera.
Not applicable (color).
Unlike HTML, VoiceXML cannot use a style language that rearranges content.
Not applicable (background image, color or text).
Not applicable (background image, color or text).
Not applicable (background pattern of lines).
This is very relevant to voice applications, as some applications play background audio that makes the prompts hard to understand.
VoiceXML elements that fetch
VoiceXML documents also support the fetchaudio
attribute. This attribute specifies
[t]he URI of the audio clip to play while the fetch is being done. If not specified, the
fetchaudio
property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.
The VoiceXML specification also adds that
[t]he fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing.
By default, VoiceXML applications do not play background audio while fetching a resource of any kind. Apparently, background audio can only be played when the application fetches a VoiceXML document, not when it fetches audio, grammar or script files. Also, the background audio does not loop. Queued prompts are played during the fetch and before the background audio is played. It is not clear how VoiceXML supports background audio that plays during dialogs. WCAG 2.0 does not define background audio, but if VoiceXML cannot play background audio during prompts or other blocks of synthesized speech, then its concept of background audio does not fit WCAG's (implicit) concept of background audio.
Not applicable (background image, color or text).
In VoiceXML, audio content can only contain background sounds if it is pre-recorded and must then meet Guideline 1.1 Level 1. However, for people who are not deaf, the difference in volume between foreground and background audio is very relevant.
If this forces voice applications to provide both speech grammars and DTMF grammars for every dialog, this leads to longer prompts, which may harm the usability of voice applications. On the other hand, this success criterion (combined with Guideline 1.1 an d others) is necessary to make VoiceXML applications accessible to people whose voice may be hard to recognize or who cannot speak at all.
See comment on previous success criterion.
@TODO: check interdigittimeouts in the appendix on timing properties in VoiceXML 2.0 specification.
Content is designed so that time-outs are not an essential part of interaction, or at least one of the following is true for each time-out that is a function of the content:
- the user is allowed to deactivate the time-out or;
- the user is allowed to adjust the time-out over a wide range which is at least ten times the length of the default setting or;
- the user is warned before time expires, allowed to extend the time-out with a simple action (for example, "hit any key") and given at least 20 seconds to respond or;
- the time-out is an important part of a real-time event (for example, an auction), and no alternative to the time-out is possible or;
- the time-out is part of an activity where timing is essential (for example, competitive gaming or time-based testing) and time limits can not be extended further without invalidating the activity.
Prompts in VoiceXML dialogs can specify a timeout
attribute:
The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value.
The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next.
When a noinput event is thrown, it can be caught with <catch
event="noinput">
, or with <noinput>
, which is
shorthand notation for <catch event="noinput">
. The input
element has an optional count
attribute that allows the
developer to omit a different prompt when noinput events are repeated. For
example, on the first noinput event (<noinput count="1">
),
the application may simply reprompt, and on subsequent prompts
(<noinput count="2">
), it may emit more directed prompts.
(See the example in the VoiceXML
2.1 Development Guide) This technique is called tapered
prompting. The noinput
element can contain a
prompt
element that defines a longer timeout than the first
prompt or any other previous prompts. See the example below.
<?xml version="1.0" encoding="UTF-8"?> <vxml xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd" version="2.0"> <form id="tapered"> <field name="color"> <prompt timeout="20s">
Please pick a color for your Model T. You can have any color as long as its black.
</prompt> <grammar src="black.grxml" type="application/srgs+xml"/>
<noinput count="1">
<prompt timout="40s">
Pick a color for your new Model T.
</prompt>
</noinput>
<noinput count="2">
<prompt timout="60s">
Please choose color of your new nineteen twenty four Ford Model T.
Possible colors are black, black, or black. Please take your time.
</prompt>
</noinput> </field> <block> <submit next="http://www.ford-t.com/color.asp"/> </block> </form> </vxml>
However, the use of this feature does not meet any of the bullet points of this success criterion:
The user is reprompted and given at least 20 seconds extra time: this comes close to the third bullet point in the success criterion. Maybe this should be added as an extra bullet point for “computer-directed interactions”.
I am not sure if it is possible to store the timeout value in a variable,
then reference this variable in the timeout attribute and increment it in
<noinput>
, or to prompt the user for a new value.
It is not obvious how one can turn off timeout
: section 6.3.4
states that the default value for timeout is platform-dependent. To my
understanding, this means that some platforms (may) provide their own timeout
value, while others don't provide a value, and that both categories of
platforms conform to the specification. [If turning off timeout
is possible, some example code would be useful.]
This success criterion does not only apply to the timeout
attribute, but also to the maxtime
and finalsilence
attributes of the record
element, and probably also to the
maxspeechtimeout
property (one of the generic speech recognizer
properties).
Not applicable (blinking).
If a prompt does not specify
a noinput
timeout, the application uses a platform-specific
default timeout. It is not obvious how content can be paused by the user. The
mark
element can be used to place a marker into the text/tag sequence; this
location can then be referenced. The problem here is that mark
is in SSML but regarding its purpose in
VoiceXML, the VoiceXML specification
says that it is “ignored by VoiceXML
platforms” (section 4.1.1:
Speech Markup).
Since it is not obvious how timeouts in VoiceXML applications can be disabled, it is not clear how this success criterion can be met.
VoiceXML applications do not use automatic updates of content that is being presented to the user, because that would cause major usability problems.
In the context of web sites, this success criterion does not require that
the user is led back to the form he/she was filling in, only that the data
are not lost. This should work in VoiceXML. For example,
there is a disconnect
element that allows an application to perform some necessary final
cleanup, such as submitting data to the server.
Not applicable.
In voice applications, users navigate through several layers of
‘menus’ by answering prompts. Unlike HTML,
where an image or paragraph can be turned into a link by means of JavaScript,
VoiceXML has no techniques to turn, say, a prompt into a
link
, choice
(in a menu
) or
field
(in a form
). It appears that
VoiceXML applications meet this criterion by default.
In voice applications, there is usually only one way. If the developer can think of several ways (i.e. several ways to structure the menus) to a specific piece of content, he will probably only implemnt the most efficient one, because implementing the other ways would mean that prompts (containing lists of choice where to go next) would only become longer and therefore less usable.
Web applications where users are required to provide information in a specific sequence of forms also violate this criterion. The success criterion needs an exemption; see the editorial note:
There are some concerns about how this criterion would apply to web applications where the only way to expose certain types of content would require the user to follow a specific path.
The success criterion could be reworded as follows:
More than one way is available to locate content within a set of delivery units, except in computer-directed interactions.
This reformulation requires a definition:
The rewording and the defintion are inspired by VoiceXML's concept of directed forms.
Not applicable, because perceivable units in voice applications cannot be broken down in anything smaller?
VoiceXML does not describe documents, so titles are
useless to end users. A title can be provided in the
meta
element and/or metadata
element, but the
content of this element is not supposed to be played back to the user. This
success criterion may need an exemption, for example as follows:
Delivery units have descriptive titles, except in interaction technologies where titles cannot be presented to the user.
VoiceXML applications always meet this success criterion if the items in a grammar are interpreted as “link text”, except where a programmatic reference is not matched by anything in a prompt or grammar (such a reference would be like a piece of unused code in a program).
Also, in voice applications, users do not notice when another delivery is fetched; there is no way for users to find out what the deliery units in a voice application are, with the exeption of recorded sound.
In VoiceXML applications, users don't really “navigate” a
delivery unit. For the content of a prompt
or
block
, this success criterion is always met. For a
VoiceXML document that contains multiple dialogs, the
sequence depends on the user's input. Requiring that the dialogs should be
navigated in source code order makes no sense.
No applicable: location is not a useful concept in voice applications. However, it is useful and easily achievable to provide a user with the possibility to go back to the main menu (“home”), to quit the current dialog, to ask for help, etcetera.
It is not possible to detect errors, only that inputs don't match grammars, so VoiceXML applications don't need to do anything to meet this criterion (?).
(VoiceXML applications can also throw and catch other types of errors: for example if a resource is not available, it is possible to warn the user. Howver, these are not input errors or “user errors”, as they were previously called. When a resource is not available, the user should be notified, but WCAG has no guideline on this: it is a usability issue rather than an accessibility issue.)
See previous comment. However, it is possible to tell the user that there
was no match between his input and the grammar, and to provide more useful
prompts. When there is no match, the VoiceXML interpreter
throws a nomatch event, which can be caught with <catch
event="nomatch">
or <nomatch>
. As with timeouts,
which cause a noinput event, it is possible to provide tapered prompts (see
comments on Guideline 2.2 Level 1
SC 1).
For forms that cause legal or financial transactions to occur, that modify or delete data in remote data storage systems, or that submit test responses, at least one of the following is true:
@TODO
This is easily achieved with tapered prompts and/or the help
element (or <catch event="help">
). However, this is a
usability issue rather than an accessibility issue, except for people with
cognitive disabilities.
Use the xml:lang
attribute.
Use the xml:lang
attribute. @TODO: check
lowest level where xml:lang is available.
??? If such a mechanism is necessary, the content is not usable. ???
This is hardly feasible in any context because one would either need to
??? If such a mechanism is necessary, the content is not usable. ???
???
Not applicable, unless VoiceXML is used to read structured text to the user, but then there is no specific markup for section titles (see comment on Guideline 2.4 Level 2 SC 3 .
When text requires reading ability at or above the upper secondary education level, one or more of the following supplements is available:
Voice applications don't require a certain level of reading skills, but they do require a certain level of concentration and memory.
Superficially, voice applications have no issues regarding placement and functionality of content. However, interface consistency is very important.
Interface consistency is a fundamental quality of a usable application. Speech interface consistency includes consistent use of terminlology, personality, DTMF, etc. People will pick up implicit rules as they interact with the system. Oveall consistency makes the learning process smoother, and helps users predict what will happen and what they need to do, which in turn helps build confidence in using the system.
Interfaces should be consistent in:
- Terminology: use the same words rather than synonyms to refer to an object or event. So, use “To delete the message, say delete” rather than “To remove the message, say delete”.
- Personality: To convey a particular personality to the user, use the same wording, attitude, and style in all dialogs in an application.
- Use of DTMF: assign the same key to the same word or action. For example, you may assign “1” to “yes” and “9” to “no” across the application.
System: “Would you like to make an appointment? If yes, press 1; otherwise, press 9.”
System: "Would you like to go back to the main menu? If yes, press 1; otherwise, press 9.”(Eve Astrid Andersson et al: Early Adopter VoiceXML. Birmingham, UK: Wrox, 2002. P. 123.)
It is not clear how there can be changes of context that can not be programmatically determined.
In the context of voice applications, the distinction between delivery unit and perceivable unit is not important, because there is no way for users to specify the adress (URL) of a specific delivery unit. As mentioned before, users are not aware of transitions between delivery units, unless there is a noticeable delay.
Because functions that need to be available in the whole application need only be specified once in the root document, there is no need to repeat them in the leaf documents (see Executing a Multi-Document Application). HTML has no similar concept, but the success criterion is based on this deficiency of HTML-based sites. If leaf documents in voice applications do not repeat components that are also available in other leaf documents, this success criterion is met, but if components are repeated, the order in the source code does not matter (assuming that the user interface is consistent).
VoiceXML does not use the concept of “focus”. Unlike HTML documents, a VoiceXML document forms a finite state machine.
A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.
(Wikipedia has a definition of finite state machine.)
If a VoiceXML form
has only one
field
, setting the input field may automatically submit the
form; it it has multiple input field
s, setting the last field
automatically submit the form, but form submission is only one of the things
that can happen: variables may be set, fields may be cleared, or there may be
transitions to subdialogs. Subdialogs
may be in another delivery unit, especially if they are reused in multiple
dialogs.
A VoiceXML menu
is
essentially a syntactic shorthand for a form containing a single anonymous
field. Picking a choice
from a menu
automatically
causes a transition to the dialog or VoiceXML document
referenced by this choice
.
Since VoiceXML knows the concept of an application root document where application-wide fuctnions can be defined just once, it is not wise to repeat the definition of these functions in leaf documents. However, if certain functions are available in multiple delivery units, terminology should be consistent (see the comment on consistency above).
Not applicable (graphical components).
A noinput event may cause a transition to another VoiceXML document, for example because this event is caught and handled in the root document (so there is transition from a leaf document to the root document), or the event handler may transfer the user to another VoiceXML document
The 30 June 2005 Working Draft contains no success criteria for this guideline. In the 19 November 2004 Working Draft, this guideline contained the following success criteria.
The existence of this success criterion has to do with HTML extensions that can be helpful for screen readers. There are probably no such extensions in VoiceXML. Note that this success criterion does not justify the creation of new proprietary extensions.
Any VoiceXML application that uses VoiceXML, SSML, SRGS, JSGF, Nuance SpeechObjects, etcetera according to specification (which is more than just writing code that validates against a DTD or formal grammar) should pass this success criterion.
Content using baseline technologies or non-baseline technologies, must meet the following criteria:
- Content that violates international health and safety standards for general flash or red flash is marked in a way that the user can avoid its appearance
- If the user can enter the content using the keyboard, then the user can exit the content using the keyboard.
Criterion a does not apply to VoiceXML because flash is only relevant to visual media.
Criterion b may require modifications in existing VoiceXML applications. [@TODO check]
@TODO
@TODO
Christophe Strobbe: Christophe.Strobbe ( @ ) esat.kuleuven.be