Re: [HTML Speech] Let's get started! from Eric S. Johansson on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Eric S. Johansson <esj@harvee.org>
Date: Thu, 09 Sep 2010 17:54:47 -0400
To: Deborah Dahl <dahl@conversational-technologies.com>
CC: public-xg-htmlspeech@w3.org
Message-ID: <4C8957A7.10805@harvee.org>
  On 9/9/2010 4:32 PM, Deborah Dahl wrote:
>   Use case 2: user-controlled speech parameters
> User has difficulty speaking quickly enough for the existing timeouts
> because of a speech, reading or cognitive disability and would like to
> lengthen the speech timeout. For example, I've heard anecdotally that speech
> timeouts are extremely stressful for people who stutter and actually make
> their stuttering worse.
Stuttering or cognitive impairments are not necessary to have trouble with the 
pressures imposed by continuous speech recognition.

You'll probably hear about this from other speech recognition users but the 
process of speaking continuous speech is sometimes very stressful because you 
need to put together the entire sentence or command is a single thought, work 
out all the arguments your mind, then say it, and correct any misrecognition's, 
and then move onto the next one. I know what I'm writing text, I find myself 
saying the first half of one sentence and the second half of another because as 
I got through the process of thinking about what I was saying, I change my mind.

Long timeouts are also frustrating because you learn not to dictate too much 
because the cost of correcting a recognition is so high. One I'm writing 
fiction, sometimes I don't pay attention to the screen for a paragraph or more 
and usually I end up with half a paragraph of crap because the recognition 
process fell off the face of the earth and gave me a set of words that I didn't 
say in a language I don't know. I've since learned that keeping my eye on the 
recognition box is critically important. When the delays in recognition 
performance cross the 5 second mark, then the stress of holding in your mind 
what you want to say next becomes stress in your body and you can't dictate is 
much before causing damage.
> B. Use case that motivates a requirement to make it easy to integrate input
> from different modalities.
> Use case: User is using a mobile friend-finding application and says, "is
> Mary Smith anywhere around here?" To answer this question the application
> should combine information from geolocation (to understand "here"), speech
> recognition, and potentially even speaker verification  information to
> insure that Mary Smith has actually authorized the user to know where she
> is. New modalities are continually becoming available, so it would be
> difficult to provide for integration on a case by case basis.

Wouldn't the application simply provide the "is < user> anywhere around here" 
grammar to the default recognizer as well as a list of values for "user"? I 
imagine in return it would get the top five users and their confidence values. 
Once the geolocation application has that information, then it would go off into 
it's own magic that makes the user happy.

Voice recognition and speech recognition are two radically different processes. 
You can do it with the same audio stream but only right from the raw data. And 
yes, voice recognition would be a really good authentication tool although I'm 
not sure about holding the camera up to your face for retinal scans. A bit too 
minority report for me.
> D. Use cases that motivate a requirement to always make the use of speech
> optional.
> Use case 1: the user can't speak, can't speak the language well enough to be
> recognized, the speech recognizer just doesn't work well for them, or they
> like typing.
> Use case 2: the user is in an environment where speech is inappropriate,
> like a meeting, or they want to communicate something private, or it's just
> noisy.

Train, plane, airport, bus station, subway stop. Most of these can be solved by 
using a steno mask although you might draw the attention of law enforcement and 
terrorist phobic people if you're wearing something that covers your face and 
looks like a gas mask.

> E. Use case that the standard should support completely hands-free
> operation.
> This would mean that there should be a way to speech-enable everything that
> you would do with a mouse, a touchscreen, or by typing.
> Use case: the user doesn't have the use of their hands, either temporarily
> or permanently, or using their hands is difficult. For example, someone is
> repairing a machine, their hands are holding tools and are dirty, but they
> want to browse an HTML manual for the machine.
> I realize there are a lot of difficulties in completely hands-free
> operation, but I wanted to put it out for discussion. It would be good to
> explore how close we can come.

We can do a lot better than we have with hands-free operation. As I've said 
elsewhere, the user interface needs to be radically different from a GUI 
interface. It needs to do appropriate hinting when the user stalls and most 
importantly, you don't want to do anything that even vaguely stresses a person's 
throat. I've blew my hands after 18 years of programming.  I've managed to keep 
my voice intact over the last 15 years by not using every hands-free option in 
NaturallySpeaking  I've tried all of them and they all are dangerous to the 
throat and I would have to be in dire straits to count on them. If I lose my 
voice, I am so royally screwed. I don't want SSDI and section 8 housing to be 
part of my future.  I know of two people in the Boston area who have had this 
happen to them. I've seen others potentially be in that state but I've lost 
track of them.

If you ever want to interview disabled speech recognition users, let me know. I 
can array for some through the Boston voice  users group.
Received on Thursday, 9 September 2010 21:56:23 UTC