- From: <mack@sonicon.com>
- Date: Wed, 22 Oct 1997 16:43:14 GMT
- To: w3c-wai-ig@w3.org
Like Al, I have several reservations about Or Ben-Natan's Proposal For Aural HTML Extensions recently posted to the HC list. Al pointed out the overall problem: the proposal takes too much control away from the user and gives it to the author. Allowing the author to force a user to listen to particular parts of the page, making parts non-interruptable, and specifying acceptable inputs by context violate much of what I consider to be good UI design principles. On a distributed medium like the Web, it is a recipe for disaster. What follows is a rather long attempt to justify that statement, so if you're not interested in the details, you'd best stop reading now. Consider: > InterruptSpeech controls the user's freedominform the user agent to > recommend that the user will not barge in and interrupt a message with > a voice or a keyboard selection. in stopping the speech rendering > process and move to the next input element. This appears to recommend completely disabling all user control of navigation while some portion of a document is being read. What kinds of things will authors consider important enough to be non-interruptable? Certainly advertising messages, but I suspect many authors will disable interrupts for entire pages. Imagine the Web listening experience when you have the potential of following a link and suddenly you are stuck listening to 100K of text on someone's favorite subject, and you can't stop it. > When relying on text to speech engines rather then on pre-recorded > voice files, the offering of anchors and other input tags may be done > using the text associated with the anchor or with input tag (text may > be associated with input tags using the <label> tag). > For example. > <A href=driving.htm> Driving instruction </A> > May be offered by the voice browser using the following words: > "for driving instructions press D1" I'm not entirely sure of the purpose of this proposal, so I may be off-base here, but it appears to address the issue of dealing with a menu of hot-links. It seems to want to associate a different keystroke with each anchor in a set of anchors, emulating the voice-text telephone menu systems. I see no need for this. First off, those are terrible interfaces and the navigation functions of most screen readers can avoid the necessity of doing this. Consider the cognitive load in these two cases: 1. You listen to a set of menu items and key names, construct a mental mapping of menu items to key names, chose an item from the menu, recall the key name for that menu item, and find that key on the keyboard and press it to follow the link. 2. You listen to a set of menu items, remember the order of the items, chose an item from the menu, press the back-arrow key enough times to get back to that item, press a "select" key to follow the link. The second approach seems to have the lesser cognitive load, and provides the user with feedback: if they don't press "select" right away, the user agent will start reading the document at the menu item they reached. Besides increasing the user's cognitive load, assigning keystrokes to anchors will permit inconsistency between pages. Some authors might use numbers, some a mnemonic lettering scheme, and so on. The user is forced to figure out a new response system for each page. > The Select attribute allows a string to be used with the help of > speech recognition software for the selection of the input. This proposal allows the author to control the inputs accepted by a speech recognition engine for a particular link or <INPUT> tag. I really hope that the proposal was suggesting that the author's suggestions were in addition to whatever inputs are expected by default. If not, this will destroy the consistency of speech interfaces. If an author uses the Select attribute to allow the responses "sure" and "OK" to a checkbox, will the user no longer be able to say "yes"? How is the user to know what the permitted responses are at any given time? There needs to be a consistent set of responses available to the user in any given context, regardless of what the author wants. These are features of the user agent, not the document being read. Even if the proposal is to allow the author to augment the set of valid responses, there are additional problems. What if the recognition engine cannot recognize those words? What if a word for a positive response is unexpectedly close to some word for a negative response in the recognition space, so that if they say one word it is quite likely that the other will be recognized? Most importantly, there is no user feedback. How is the user to know if their response was correctly recognised? This is the most important aspect of any voice interface. > 1. OnSelectionTimeout > The browser may generate OnSelectionTimeout event when the > user is asked to provide input of any kind such as a > selection from a list of anchors or an text input box and > fails to do so. ... > 2. OnSelectionError > When the user selects an option not offered by the browser the user > must be notified that an error occurred. The notification and the > resulting action is to be performed by a script associated with > OnSelectionError event. Control of error response mechanisms belongs in the user agent, not the document. An experienced user is not going to want the same level of error notification as a novice, and will not tolerate it for long. Were I to mis-type a key, I don't want to hear anything more than a half-second tone. Hearing the same message every time, especially one that is not front-loaded, would get old really fast. I have some suggestions for addressing these problems in a more friendly manner. They are: 1. Instead of making text playback non-interruptable, allow the author to specify points at which a user agent's navigation functions should stop. For example, a user agent might have a navigation function that moves forward to the "next section" of the document (however that might be defined). An author could place an attribute on any tag to get such a browser to treat that tag as dividing a section, thus causing it to stop and begin reading at that tag. The user then gets to decide if they want to skip forward again or keep listening. This addresses the issue of getting people to attend to new information on a page. 2. Dealing with speech input for links and <INPUT> tags is not a simple issue. There are three topics to address here: active vocabulary, word assignment and feedback. a) By "active vocabulary" I mean the set of words that a recognition engine will accept at any given time. This should be controlled by the user agent, as it must be able to inform the user in some way just which words are allowed. We could allow an author to suggest additions to the active vocabulary, but the user agent must decide to accept or reject those additions. b) By "word assignment" I mean how the user agent decides what to do in response to each word. Does it check or uncheck a checkbox when it hears "sure"? If we allow the author to suggest additions to the active vocabulary, they must also specify what should be done when those words are recognized. You could have two attributes for this: SelectCheck and SelectUncheck. c) Feedback on the results of speech recognition is the most important issue. Personally, I'd like to hear a non-speech signal telling me if I checked or unchecked a checkbox. I'd also like to have a method of reviewing an entire form quickly before selecting the "Submit" button, hearing what I selected in each input item. Note that both of these ideas have to do with the user agent, not the document. This has gotten far too long for a response to a proposal that is somewhat peripheral to this list, so I'd better stop now. - MacK. ----- Edmund R. MacKenty <mack@sonicon.com> SONICON Development, Inc. (1-617-926-2131) http://www.sonicon.com/ (FAX: 1-617-926-2126)
Received on Wednesday, 22 October 1997 12:43:59 UTC