Re: Requirements for the speech input API (derived from our earlier proposal) from Eric S. Johansson on 2010-09-09 (public-xg-htmlspeech@w3.org from September 2010)

From: Eric S. Johansson <esj@harvee.org>
Date: Thu, 09 Sep 2010 11:41:28 -0400
CC: public-xg-htmlspeech@w3.org
Message-ID: <4C890028.6040702@harvee.org>
  On 9/9/2010 8:59 AM, Satish Sampath wrote:
> Here are some requirements we came up as part of our earlier API proposal.
>
> - The API must notify the web app when a spoken utterance has been recognized.
>
> - The API must notify the web app on speech recognition errors.
>
> - The API should provide access to a list of speech recognition hypotheses.
>
> - The API should allow, but not require, specifying a grammar for the
> speech recognizer to use.
>
> - The API should allow specifying the natural language in which to
> perform speech recognition. This will override the language of the web
> page.
>
> - For privacy reasons, the API should not allow web apps access to raw
> audio data but only provide recognition results.
>
> - For privacy reason, speech recognition should only be started in
> response to user action.
>
> - Web app developers should not have to run their own speech
> recognition services.

Nor should they be excluded from running their own speech recognition services 
For reasons of privacy. I dictate confidential information. I don't want 
anything concerning my dictations leaving my machine.

If speech recognition is present, all keystroke shortcuts to application 
functions should be turned off because misrecognition and accidental recognition 
events can cause unintended action.

End users should be prevented from creating or extend existing grammars on both 
a global and per application basis.

End-user extensions should be accessible either from the desktop or from the cloud.

For reasons of privacy, the user should not be forced to store anything about 
their speech recognition environment on the cloud.

Any public interfaces for creating extensions should be "speakable". A user 
should never need to touch the keyboard in order to expand a grammar, reference 
data, or add functionality.

I've been trying to figure out the right way to express these last few concepts 
but I'm sure they will come with time a conversation.

Currently, local speech recognition services (i.e. NaturallySpeaking) degrade 
both in terms of performance and accuracy if they are coupled to an application 
which is slow. Well-known phenomena, Nuance doesn't seem to be interested in 
fixing it. Web applications are among the worst offenders for degradation of 
recognition accuracy and speed. Don't know any fixes now but, this is something 
to keep an eye on.

The services described for Web services, would be good for the desktop as well. 
Given that I'm a person who rarely use web applications (see 
performance/reliability problems above especially chrome crashing when receiving 
dictation events), it would be useful to many users like myself to have this 
kind of capability on the desktop.  at the very least, there should be no 
boundary between desktop and web app speech recognition functionality.

I see no mention of retrieval of contents of a text area for editing purposes. 
Look at NaturallySpeaking's Select-and-Say functionality. It works very nicely 
for small grain text editing. I'm also experimenting with speech user interfaces 
for non-English text dictation. The basic model is selected region by speech, 
run the selected region through transformation, edit the transformed text by 
speech, run text through reverse transform and replace selected region with new 
text.

For additional examples of what disabled speech recognition users have been 
working with for the past 10 years, check out vocola, dragonfly, unimacro, and 
the base, natlink.
Received on Thursday, 9 September 2010 15:42:27 UTC