- From: Eric S. Johansson <esj@harvee.org>
- Date: Thu, 09 Sep 2010 12:54:28 -0400
- To: Satish Sampath <satish@google.com>
- CC: public-xg-htmlspeech@w3.org
On 9/9/2010 11:55 AM, Satish Sampath wrote: >>> - Web app developers should not have to run their own speech >>> recognition services. >> Nor should they be excluded from running their own speech recognition >> services For reasons of privacy. I dictate confidential information. I don't >> want anything concerning my dictations leaving my machine. > I think these two points are about different things. The first is > about the web app developer not being forced to run and specify a > recognition server for each use of speech recognition in html, whereas > the second seems to be about allowing the UA to interface with a > speech recognizer present in the local machine. Is that correct? Interesting. I interpreted your original point as saying that a Web application is not required to run a local recognition engine. This doesn't exclude the possibility of requiring a remote engine because of some tie-in between a speech recognition engine vendor and an application vendor. That's why I said we shouldn't exclude running a local engine. I would go further and say that the local engine takes preference over a remote engine. But as you point out, there's another interpretation which is that the user agent interface should be local/remote agnostic. That a local recognition engine and a remote engine should have the same interface. A user should be able to select which engine they use and switch engines on a per application basis. >> For reasons of privacy, the user should not be forced to store anything >> about their speech recognition environment on the cloud. > I think this is satisfied if as mentioned above the UA can interface > with a local recognizer so the speech data doesn't have to be sent > over to a server. Yes but I wanted that stated explicitly in case there was a strong pressure to go to remote only recognizers. >> I see no mention of retrieval of contents of a text area for editing >> purposes. Look at NaturallySpeaking's Select-and-Say functionality. It works >> very nicely for small grain text editing. I'm also experimenting with speech >> user interfaces for non-English text dictation. The basic model is selected >> region by speech, run the selected region through transformation, edit the >> transformed text by speech, run text through reverse transform and replace >> selected region with new text. > This seems related to a global voice IME than a speech aware/enabled > web application, i.e. using voice to dictate text and select+edit > portions of it should be possible in any web page, rather than just in > pages which have speech-enabled features. However I can see complex > web apps such as email clients which use speech input for > command-and-edit cases (such as "Change subject to 'Pictures from our > recent trip'" or "In subject change 'recent trip' to 'hawaii trip'") > and these could be implemented by the web app. I can see some value in being able to select and correspondingly copy any region on a webpage. being able to verbally selected region and operate on it is critical in a text region us any size. It gets weird when you start trying to search for text that isn't visible from the current text area window. I'll argue that the disabled, at the very least, really want verbal selection and editing my voice everywhere, independent of the application. We have a 15 year history showing that counting on the application developer to add speech recognition features doesn't work. Dragon/NaturallySpeaking has some very nice examples of what happens when you create a fully enabled application. You have nicely integrated commands, dictation is fast, highly accurate and you are always chasing tail lights. Every time the application changes, your updates need to be changed and as a result, they've enabled something like 10 applications in 15 years. Now they put a lot of energy into keeping up with changes to those applications. In the context of independent application vendors, they have less than zero experience at building a good speech user interface that won't strain the throat. Sometimes I think they have a similar level experience with graphical user interfaces but that's a different discussion. They have no incentive to create a speech user interface. They have no incentive to add accessibility features. Therefore, why should they do it let alone do it right? It's important to let us Crips create good interfaces around the application, independent of the application so that if they don't do it right, we can replace it with a right interface. An example of not really understanding a speech interface is your example. It is a classic IVR expression using what I call "spoken language". It is imperative and has no room for errors. Try instead: Edit subject select recent trip Hawaii trip This interaction reduces the vocal stress, gives you time to think about what you want to do next (i.e. mental context switch), breaks command into multiple components, each of which has a low impact in case of misrecognition. Another way to look at it is that each command lets you check to make sure executed correctly before moving on. One thing I've become sensitive to in the working with speech recognition is a difference between spoken speech and written speech. Spoken speech and written speech are great at creating sets of data. for the for the most part, only written speech can be edited easily. Spoken speech is classic IVR language. in my experience, it seems that most nondisabled speech recognition users are familiar with. Written speech is what disabled people are most familiar with. There is also a whole discussion on out of context editing with something like dictation box but, I've got to get some work done today. ---- eric
Received on Thursday, 9 September 2010 16:56:09 UTC