RE: [HTML Speech] Let's get started! from Deborah Dahl on 2010-09-13 (public-xg-htmlspeech@w3.org from September 2010)

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Mon, 13 Sep 2010 18:21:13 -0400
To: "'Eric S. Johansson'" <esj@harvee.org>
Cc: <public-xg-htmlspeech@w3.org>
Message-ID: <003e01cb5391$fa401c60$eec05520$@conversational-technologies.com>
Thanks for your comments, some responses inline.

> -----Original Message-----
> From: Eric S. Johansson [mailto:esj@harvee.org]
> Sent: Thursday, September 09, 2010 5:55 PM
> To: Deborah Dahl
> Cc: public-xg-htmlspeech@w3.org
> Subject: Re: [HTML Speech] Let's get started!
> 
>   On 9/9/2010 4:32 PM, Deborah Dahl wrote:
> >   Use case 2: user-controlled speech parameters
> > User has difficulty speaking quickly enough for the existing timeouts
> > because of a speech, reading or cognitive disability and would like to
> > lengthen the speech timeout. For example, I've heard anecdotally that
> speech
> > timeouts are extremely stressful for people who stutter and actually
make
> > their stuttering worse.
> Stuttering or cognitive impairments are not necessary to have trouble with
> the
> pressures imposed by continuous speech recognition.
> 
> You'll probably hear about this from other speech recognition users but
the
> process of speaking continuous speech is sometimes very stressful because
> you
> need to put together the entire sentence or command is a single thought,
> work
> out all the arguments your mind, then say it, and correct any
> misrecognition's,
> and then move onto the next one. I know what I'm writing text, I find
myself
> saying the first half of one sentence and the second half of another
because
> as
> I got through the process of thinking about what I was saying, I change my
> mind.
> 
> Long timeouts are also frustrating because you learn not to dictate too
much
> because the cost of correcting a recognition is so high. One I'm writing
> fiction, sometimes I don't pay attention to the screen for a paragraph or
> more
> and usually I end up with half a paragraph of crap because the recognition
> process fell off the face of the earth and gave me a set of words that I
didn't
> say in a language I don't know. I've since learned that keeping my eye on
the
> recognition box is critically important. When the delays in recognition
> performance cross the 5 second mark, then the stress of holding in your
> mind
> what you want to say next becomes stress in your body and you can't
dictate
> is
> much before causing damage.

This sounds like more good evidence that end users need a way to adjust
timeouts.

> > B. Use case that motivates a requirement to make it easy to integrate
> input
> > from different modalities.
> > Use case: User is using a mobile friend-finding application and says,
"is
> > Mary Smith anywhere around here?" To answer this question the
> application
> > should combine information from geolocation (to understand "here"),
> speech
> > recognition, and potentially even speaker verification  information to
> > insure that Mary Smith has actually authorized the user to know where
she
> > is. New modalities are continually becoming available, so it would be
> > difficult to provide for integration on a case by case basis.
> 
> Wouldn't the application simply provide the "is < user> anywhere around
> here"
> grammar to the default recognizer as well as a list of values for "user"?
I
> imagine in return it would get the top five users and their confidence
> values.
> Once the geolocation application has that information, then it would go
off
> into
> it's own magic that makes the user happy.

Yes, but the point of the use case is that there should be support for
making it easy to integrate information from speech and geolocation.  I
don't know too much about geolocation, but I was assuming that the
geolocation application typically just provides the current location of the
user, and there would need to be an integration step that decides that the
interpretation of "here" should be assigned to the current location of the
user. 
> 
> Voice recognition and speech recognition are two radically different
> processes.
> You can do it with the same audio stream but only right from the raw data.
Right, so there would typically be an integration process to integrate the
results of the speech recognition process and the speaker
verification/identification process. Again, this should be easy to do. 

> And
> yes, voice recognition would be a really good authentication tool although
> I'm
> not sure about holding the camera up to your face for retinal scans. A bit
too
> minority report for me.
I'm not sure I want to be identified by a retinal scan, either, but there
are other biometrics that you could do from camera input, like face
recognition, that might be less unnerving. 

> > D. Use cases that motivate a requirement to always make the use of
> speech
> > optional.
> > Use case 1: the user can't speak, can't speak the language well enough
to
> be
> > recognized, the speech recognizer just doesn't work well for them, or
they
> > like typing.
> > Use case 2: the user is in an environment where speech is inappropriate,
> > like a meeting, or they want to communicate something private, or it's
just
> > noisy.
> 
> Train, plane, airport, bus station, subway stop. Most of these can be
solved
> by
> using a steno mask although you might draw the attention of law
> enforcement and
> terrorist phobic people if you're wearing something that covers your face
> and
> looks like a gas mask.
That's an interesting point about the steno mask. It might address some use
cases, but not where the person can't speak at all or can't get speech
recognition to work. I still think it's important to avoid doing anything in
the standard that would REQUIRE speech.

> 
> > E. Use case that the standard should support completely hands-free
> > operation.
> > This would mean that there should be a way to speech-enable everything
> that
> > you would do with a mouse, a touchscreen, or by typing.
> > Use case: the user doesn't have the use of their hands, either
temporarily
> > or permanently, or using their hands is difficult. For example, someone
is
> > repairing a machine, their hands are holding tools and are dirty, but
they
> > want to browse an HTML manual for the machine.
> > I realize there are a lot of difficulties in completely hands-free
> > operation, but I wanted to put it out for discussion. It would be good
to
> > explore how close we can come.
> 
> We can do a lot better than we have with hands-free operation. As I've
said
> elsewhere, the user interface needs to be radically different from a GUI
> interface. It needs to do appropriate hinting when the user stalls and
most
> importantly, you don't want to do anything that even vaguely stresses a
> person's
> throat. I've blew my hands after 18 years of programming.  I've managed to
> keep
> my voice intact over the last 15 years by not using every hands-free
option
> in
> NaturallySpeaking  I've tried all of them and they all are dangerous to
the
> throat and I would have to be in dire straits to count on them. If I lose
my
> voice, I am so royally screwed. I don't want SSDI and section 8 housing to
be
> part of my future.  I know of two people in the Boston area who have had
> this
> happen to them. I've seen others potentially be in that state but I've
lost
> track of them.
> 
> If you ever want to interview disabled speech recognition users, let me
> know. I
> can array for some through the Boston voice  users group.
Thanks I might take you up on that at some point.
Received on Monday, 13 September 2010 22:21:48 UTC