Re: Use cases from Eric S. Johansson on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Eric S. Johansson <esj@harvee.org>
Date: Thu, 09 Sep 2010 17:31:49 -0400
To: jim@larson-tech.com
CC: public-xg-htmlspeech@w3.org
Message-ID: <4C895245.5040108@harvee.org>
  On 9/9/2010 3:59 PM, James Larson wrote:
> I believe we should consider some use cases describing how speech might be 
> used within HTML-5 applications. Below are brief sketches of four HTML use 
> cases that use speech technology. I believe that developers will want to 
> develop applications using these and similar use cases.

I highly encourage spending time with and watching sophisticated blind users and 
sophisticated upper extremity disabled users a speech recognition like myself in 
operation. The world we live in is significantly different from the current IVR 
driven models of speech recognition. Most suggestions from nondisabled users are 
frightening and would encourage me and I suspect others to walk away from 
computers and computer use if there's any way possible. Yeah, they're that 
dangerous.

The lack of knowledge about what really works for speech interfaces is one of 
the reasons why I constantly harp on "give me the tools to do it myself, and get 
out of the way". as example, we had a bridge that coupled NaturallySpeaking to 
Emacs (VR-mode). It was flawed and we no longer have a developer with hands. I 
can't fix it because my hands don't work right. the free software foundation has 
declared VR-mode generally evil. They won't help us because they assert that the 
needs of free software comes before the needs of disabled. I've been locked out 
of Emacs because of their stance and all the things I used to do it. Please 
don't lock us out of the browser by interface choices. We know better than 
anybody else what interfaces work right.

> 4. Grammar-based collector
A.k.a. IVR/small vocabulary fixed grammar environment?
> 5. Dictation collector (press to speak)

Please, no press to speak? If anything press to go mute. Press to speak with 
significantly increase my pain level. This is a real world example of my life. I 
cannot hold mouse button down long enough to accurately select a region of text 
more than a few lines long. I cannot target accurately on a line a need to use 
the arrow key to move my cursor .3 or four characters to the left or right.

If I had to press to speak, my speech would be interrupted every time my hand 
spasmed.

It just occurred to me, NaturallySpeaking does have a toggle key for the 
microphone (keypad +) but I never use it because I don't have a keypad. I have 
fallen into the habit of using a mute button on my headset which is far easier 
for me to use than the keypad because my hands are near the cable anyway and 
without thinking, I think "mute button, on the microphone" and I am closing the 
switch versus searching the keyboard for the keypad + is and how to reach out 
for it. For some reason, the button on the cord on my chest is easier on the 
hands, easier on the mind.

> a. Example, the user dictates the contents of an e-mail message: “This meeting 
> is going overtime. I will be late getting home. See you later.” The user 
> presses a button while speaking and releases the button when finished.

Take a message

Meetings going overtime. Homely, see later kisses

Select homely
home late
select later kisses
later. Kisses and more

Send to mary
My wife

That's the dialogue I expect complete with errors, corrections and an almost 
"whoops" moment.



> b. Input to ASR: start-speaking event, stop speaking event, audio to be 
> transcribed to text.
> c. Output from ASR: dictated text.

Should also probably contain audio as well so that downstream corrections can be 
made and training improvements can be made upstream.

You should also know that I don't believe in speaker independent recognition. 
After all, why should we expect machines that are more flawed than us have 
better recognition capability that we do. Example in the story of my wife 
meeting my grandfather for the first time.

My grandfather was old Swedish merchant Marine sailor who came to America when 
he was not yet old and started a business. My father and I grew up in that 
business but we also grew up hearing him mumble with a thick Swedish accent and 
no teeth. We understood him perfectly. My girlfriend (who became my wife) 
visited and met him for the first time one fine Saturday afternoon.

My grandfather looks at her, asks her question and my wife got a stricken look 
on her face. She looked at me, desperate for a clue so I translated. She 
answered, he said something else, and after about the fourth or fifth go around, 
I didn't wait for her to ask, I just translated. On that day I finally 
understood just how special my grandfather speech was and how it was almost a 
private language for the grandkids.

So, I expect we'll have other people with special speech. They will always need 
training. Therefore, I argue for audio in addition to text. I guess by that 
definition, I have special speech because I'm always correcting. Such is the 
life speech recognition dependent
Received on Thursday, 9 September 2010 21:33:22 UTC