- From: Eric S. Johansson <esj@harvee.org>
- Date: Thu, 09 Sep 2010 17:31:49 -0400
- To: jim@larson-tech.com
- CC: public-xg-htmlspeech@w3.org
On 9/9/2010 3:59 PM, James Larson wrote: > I believe we should consider some use cases describing how speech might be > used within HTML-5 applications. Below are brief sketches of four HTML use > cases that use speech technology. I believe that developers will want to > develop applications using these and similar use cases. I highly encourage spending time with and watching sophisticated blind users and sophisticated upper extremity disabled users a speech recognition like myself in operation. The world we live in is significantly different from the current IVR driven models of speech recognition. Most suggestions from nondisabled users are frightening and would encourage me and I suspect others to walk away from computers and computer use if there's any way possible. Yeah, they're that dangerous. The lack of knowledge about what really works for speech interfaces is one of the reasons why I constantly harp on "give me the tools to do it myself, and get out of the way". as example, we had a bridge that coupled NaturallySpeaking to Emacs (VR-mode). It was flawed and we no longer have a developer with hands. I can't fix it because my hands don't work right. the free software foundation has declared VR-mode generally evil. They won't help us because they assert that the needs of free software comes before the needs of disabled. I've been locked out of Emacs because of their stance and all the things I used to do it. Please don't lock us out of the browser by interface choices. We know better than anybody else what interfaces work right. > 4. Grammar-based collector A.k.a. IVR/small vocabulary fixed grammar environment? > 5. Dictation collector (press to speak) Please, no press to speak? If anything press to go mute. Press to speak with significantly increase my pain level. This is a real world example of my life. I cannot hold mouse button down long enough to accurately select a region of text more than a few lines long. I cannot target accurately on a line a need to use the arrow key to move my cursor .3 or four characters to the left or right. If I had to press to speak, my speech would be interrupted every time my hand spasmed. It just occurred to me, NaturallySpeaking does have a toggle key for the microphone (keypad +) but I never use it because I don't have a keypad. I have fallen into the habit of using a mute button on my headset which is far easier for me to use than the keypad because my hands are near the cable anyway and without thinking, I think "mute button, on the microphone" and I am closing the switch versus searching the keyboard for the keypad + is and how to reach out for it. For some reason, the button on the cord on my chest is easier on the hands, easier on the mind. > a. Example, the user dictates the contents of an e-mail message: “This meeting > is going overtime. I will be late getting home. See you later.” The user > presses a button while speaking and releases the button when finished. Take a message Meetings going overtime. Homely, see later kisses Select homely home late select later kisses later. Kisses and more Send to mary My wife That's the dialogue I expect complete with errors, corrections and an almost "whoops" moment. > b. Input to ASR: start-speaking event, stop speaking event, audio to be > transcribed to text. > c. Output from ASR: dictated text. Should also probably contain audio as well so that downstream corrections can be made and training improvements can be made upstream. You should also know that I don't believe in speaker independent recognition. After all, why should we expect machines that are more flawed than us have better recognition capability that we do. Example in the story of my wife meeting my grandfather for the first time. My grandfather was old Swedish merchant Marine sailor who came to America when he was not yet old and started a business. My father and I grew up in that business but we also grew up hearing him mumble with a thick Swedish accent and no teeth. We understood him perfectly. My girlfriend (who became my wife) visited and met him for the first time one fine Saturday afternoon. My grandfather looks at her, asks her question and my wife got a stricken look on her face. She looked at me, desperate for a clue so I translated. She answered, he said something else, and after about the fourth or fifth go around, I didn't wait for her to ask, I just translated. On that day I finally understood just how special my grandfather speech was and how it was almost a private language for the grandkids. So, I expect we'll have other people with special speech. They will always need training. Therefore, I argue for audio in addition to text. I guess by that definition, I have special speech because I'm always correcting. Such is the life speech recognition dependent
Received on Thursday, 9 September 2010 21:33:22 UTC