Re: Requirements for the speech input API (derived from our earlier proposal)

  On 9/9/2010 11:55 AM, Satish Sampath wrote:
>>> - Web app developers should not have to run their own speech
>>> recognition services.
>> Nor should they be excluded from running their own speech recognition
>> services For reasons of privacy. I dictate confidential information. I don't
>> want anything concerning my dictations leaving my machine.
> I think these two points are about different things. The first is
> about the web app developer not being forced to run and specify a
> recognition server for each use of speech recognition in html, whereas
> the second seems to be about allowing the UA to interface with a
> speech recognizer present in the local machine. Is that correct?

Interesting. I interpreted your original point as saying that a Web application 
is not required to run a local recognition engine. This doesn't exclude the 
possibility of requiring a remote engine because of some tie-in between a speech 
recognition engine vendor and an application vendor. That's why I said we 
shouldn't exclude running a local engine. I would go further and say that the 
local engine takes preference over a remote engine.

But as you point out, there's another interpretation which is that the user 
agent interface should be local/remote agnostic. That a local recognition engine 
and a remote engine should have the same interface. A user should be able to 
select which engine they use and switch engines on a per application basis.

>> For reasons of privacy, the user should not be forced to store anything
>> about their speech recognition environment on the cloud.
> I think this is satisfied if as mentioned above the UA can interface
> with a local recognizer so the speech data doesn't have to be sent
> over to a server.

Yes but I wanted that stated explicitly in case there was a strong pressure to 
go to remote only recognizers.
>> I see no mention of retrieval of contents of a text area for editing
>> purposes. Look at NaturallySpeaking's Select-and-Say functionality. It works
>> very nicely for small grain text editing. I'm also experimenting with speech
>> user interfaces for non-English text dictation. The basic model is selected
>> region by speech, run the selected region through transformation, edit the
>> transformed text by speech, run text through reverse transform and replace
>> selected region with new text.
> This seems related to a global voice IME than a speech aware/enabled
> web application, i.e. using voice to dictate text and select+edit
> portions of it should be possible in any web page, rather than just in
> pages which have speech-enabled features. However I can see complex
> web apps such as email clients which use speech input for
> command-and-edit cases (such as "Change subject to 'Pictures from our
> recent trip'" or "In subject change 'recent trip' to 'hawaii trip'")
> and these could be implemented by the web app.

I can see some value in being able to select and correspondingly copy any region 
on a webpage. being able to verbally selected region and operate on it is 
critical in a text region us any size. It gets weird when you start trying to 
search for text that isn't visible from the current text area window.

I'll argue that the disabled, at the very least, really want verbal selection 
and editing my voice everywhere, independent of the application. We have a 15 
year history showing that counting on the application developer to add speech 
recognition features doesn't work. Dragon/NaturallySpeaking has some very nice 
examples of what happens when you create a fully enabled application. You have 
nicely integrated commands, dictation is fast, highly accurate and you are 
always chasing tail lights. Every time the application changes, your updates 
need to be changed and as a result, they've enabled something like 10 
applications in 15 years. Now they put a lot of energy into keeping up with 
changes to those applications.

In the context of independent application vendors, they have less than zero 
experience at building a good speech user interface that won't strain the 
throat. Sometimes I think they have a similar level experience with graphical 
user interfaces but that's a different discussion. They have no incentive to 
create a speech user interface. They have no incentive to add accessibility 
features. Therefore, why should they do it let alone do it right? It's important 
to let us Crips create good interfaces around the application, independent of 
the application so that if they don't do it right, we can replace it with a 
right interface.

An example of not really understanding a speech interface is your example. It is 
a classic IVR expression using what I call "spoken language". It is imperative 
and has no room for errors. Try instead:

Edit subject
select recent trip
Hawaii trip

This interaction reduces the vocal stress, gives you time to think about what 
you want to do next (i.e. mental context switch), breaks command into multiple 
components, each of which has a low impact in case of misrecognition. Another 
way to look at it is that each command lets you check to make sure executed 
correctly before moving on.

One thing I've become sensitive to in the working with speech recognition is a 
difference between spoken speech and written speech. Spoken speech and written 
speech are great at creating sets of data. for the for the most part, only 
written speech can be edited easily. Spoken speech is classic IVR language. in 
my experience, it seems that most nondisabled speech recognition users are 
familiar with. Written speech is what disabled people are most familiar with.

There is also a whole discussion on out of context editing with something like 
dictation box but, I've got to get some work done today.

---- eric

Received on Thursday, 9 September 2010 16:56:09 UTC