Re: Aural extensions

Like Al, I have several reservations about Or Ben-Natan's Proposal For
Aural HTML Extensions recently posted to the HC list.  Al pointed out the
overall problem: the proposal takes too much control away from the user and
gives it to the author.  Allowing the author to force a user to listen to
particular parts of the page, making parts non-interruptable, and
specifying acceptable inputs by context violate  much of what I consider to
be good UI design principles.  On a distributed medium like the Web, it is
a recipe for disaster.

What follows is a rather long attempt to justify that statement, so if
you're not interested in the details, you'd best stop reading now.

Consider:

>   InterruptSpeech controls the user's freedominform the user agent to
>   recommend that the user will not barge in and interrupt a message with
>   a voice or a keyboard selection. in stopping the speech rendering
>   process and move to the next input element.

This appears to recommend completely disabling all user control of
navigation while some portion of a document is being read.  What kinds of
things will authors consider important enough to be non-interruptable?
Certainly advertising messages, but I suspect many authors will disable
interrupts for entire pages.  Imagine the Web listening experience when you
have the potential of following a link and suddenly you are stuck listening
to 100K of text on someone's favorite subject, and you can't stop it.

>   When relying on text to speech engines rather then on pre-recorded
>   voice files, the offering of anchors and other input tags may be done
>   using the text associated with the anchor or with input tag (text may
>   be associated with input tags using the <label> tag).
>   For example.
>       <A href=driving.htm> Driving instruction </A>
>   May be offered by the voice browser using the following words:
>       "for driving instructions press D1"

I'm not entirely sure of the purpose of this proposal, so I may be off-base
here, but it appears to address the issue of dealing with a menu of
hot-links.  It seems to want to associate a different keystroke with each
anchor in a set of anchors, emulating the voice-text telephone menu systems.
I see no need for this.  First off, those are terrible interfaces and the
navigation functions of most screen readers can avoid the necessity of
doing this.  Consider the cognitive load in these two cases:
1.	You listen to a set of menu items and key names,
	construct a mental mapping of menu items to key names,
	chose an item from the menu,
	recall the key name for that menu item, and
	find that key on the keyboard and press it to follow the link.

2.	You listen to a set of menu items,
	remember the order of the items,
	chose an item from the menu,
	press the back-arrow key enough times to get back to that item,
	press a "select" key to follow the link.

The second approach seems to have the lesser cognitive load, and provides
the user with feedback: if they don't press "select" right away, the user
agent will start reading the document at the menu item they reached.

Besides increasing the user's cognitive load, assigning keystrokes to
anchors will permit inconsistency between pages.  Some authors might use
numbers, some a mnemonic lettering scheme, and so on.  The user is forced
to figure out a new response system for each page.

>       The Select attribute allows a string to be used with the help of
>       speech recognition software for the selection of the input.

This proposal allows the author to control the inputs accepted by a speech
recognition engine for a particular link or <INPUT> tag.  I really hope
that the proposal was suggesting that the author's suggestions were in
addition to whatever inputs are expected by default.  If not, this will
destroy the consistency of speech interfaces.  If an author uses the Select
attribute to allow the responses "sure" and "OK" to a checkbox, will the
user no longer be able to say "yes"?  How is the user to know what the
permitted responses are at any given time?  There needs to be a consistent
set of responses available to the user in any given context, regardless of
what the author wants.  These are features of the user agent, not the
document being read.

Even if the proposal is to allow the author to augment the set of valid
responses, there are additional problems.  What if the recognition engine
cannot recognize those words?  What if a word for a positive response is
unexpectedly close to some word for a negative response in the recognition
space, so that if they say one word it is quite likely that the other will
be recognized?  Most importantly, there is no user feedback.  How is the
user to know if their response was correctly recognised?  This is the most
important aspect of any voice interface.

>         1. OnSelectionTimeout
>            The browser may generate OnSelectionTimeout event when the
>            user is asked to provide input of any kind such as a
>            selection from a list of anchors or an text input box and
>            fails to do so.
...
>         2. OnSelectionError
>   When the user selects an option not offered by the browser the user
>   must be notified that an error occurred. The notification and the
>   resulting action is to be performed by a script associated with
>   OnSelectionError event.

Control of error response mechanisms belongs in the user agent, not the
document.  An experienced user is not going to want the same level of error
notification as a novice, and will not tolerate it for long.  Were I to
mis-type a key, I don't want to hear anything more than a half-second
tone.  Hearing the same message every time, especially one that is not
front-loaded, would get old really fast.

I have some suggestions for addressing these problems in a more friendly
manner.  They are:

1.  Instead of making text playback non-interruptable, allow the author to
specify points at which a user agent's navigation functions should stop.
For example, a user agent might have a navigation function that moves
forward to the "next section" of the document (however that might be
defined).  An author could place an attribute on any tag to get such a
browser to treat that tag as dividing a section, thus causing it to stop
and begin reading at that tag.  The user then gets to decide if they want
to skip forward again or keep listening.  This addresses the issue of
getting people to attend to new information on a page.

2.  Dealing with speech input for links and <INPUT> tags is not a simple
issue.  There are three topics to address here: active vocabulary, word
assignment and feedback.

a) By "active vocabulary" I mean the set of words that a recognition engine
will accept at any given time.  This should be controlled by the user
agent, as it must be able to inform the user in some way just which words
are allowed.  We could allow an author to suggest additions to the active
vocabulary, but the user agent must decide to accept or reject those
additions.

b) By "word assignment" I mean how the user agent decides what to do in
response to each word.  Does it check or uncheck a checkbox when it hears
"sure"?  If we allow the author to suggest additions to the active
vocabulary, they must also specify what should be done when those words are
recognized.  You could have two attributes for this: SelectCheck and
SelectUncheck.

c) Feedback on the results of speech recognition is the most important
issue.  Personally, I'd like to hear a non-speech signal telling me if I
checked or unchecked a checkbox.  I'd also like to have a method of
reviewing an entire form quickly before selecting the "Submit" button,
hearing what I selected in each input item.  Note that both of these ideas
have to do with the user agent, not the document.

This has gotten far too long for a response to a proposal that is somewhat
peripheral to this list, so I'd better stop now.
	- MacK.
-----
Edmund R. MacKenty
    <mack@sonicon.com>	SONICON Development, Inc.	  (1-617-926-2131)
		 http://www.sonicon.com/	    (FAX:  1-617-926-2126)

Received on Wednesday, 22 October 1997 12:43:59 UTC