Re: Organized first draft of Use Case and Requirements Document from Eric S. Johansson on 2010-10-07 (public-xg-htmlspeech@w3.org from October 2010)

From: Eric S. Johansson <esj@harvee.org>
Date: Thu, 07 Oct 2010 11:57:43 -0400
To: Satish Sampath <satish@google.com>
CC: Michael Bodell <mbodell@microsoft.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4CADEDF7.2050405@harvee.org>
  On 10/7/2010 11:03 AM, Satish Sampath wrote:
>
>> U8. Voice Activity Detection
> While the UA may use voice activity detection either at start of
> speech or stop or both, I don't think it should be exposed as a part
> of the HTML speech API. In particular I'm concerned about letting
> random web pages initiate speech input automatically and get
> recognition results without the user consent. This is the reason our
> draft proposal to the XG required explicit user action to start
> recognition (for e.g. by clicking on a mic button).
>
> The main use case mentioned here is hands-free dialog, which could be
> performed by using the continuous recognition but having the user
> initiate the dialog with a gesture (click, touch or something
> similar). This assures the user that third parties cannot snoop on
> their speech/audio without their knowledge, while also avoiding
> unwanted notifications and permission popups (since users are known to
> dismiss them without paying much attention).

For people with working hands, that's probably not a bad solution. If you are 
running completely hands-free, for example, in a mobile environment, an explicit 
acknowledgment can be an unnecessary distraction from the other tasks (i.e. 
driving). If you're disabled, it's an unnecessary consumption of what ever hand 
capacity one may have.

Real-life example: I use Skype and Skype is something about the Windows sound 
system lets me feed both Skype and NaturallySpeaking at the same time. So I can 
be talking with someone on Skype, tell NaturallySpeaking to "wake up", dictate 
something (usually to the annoyance of the person on Skype) and then tell 
NaturallySpeaking to "go to sleep" again.

What I wish was different was that the audio would switch to NaturallySpeaking 
and away from Skype so that my dictation would not be heard. The verbal command 
to go to sleep and to wake up is not reliable. Sometimes it wakes up at the 
wrong time and starts chatting away at whatever application. Keystroke off 
toggle doesn't always work for a variety of reasons, mostly because it's 
extremely clumsy and my keyboard doesn't have a keypad so that the default 
hotkey (keypad +) is not available.

I think the concern about capturing audio without the users knowledge is 
important issue especially in mobile devices where the NaturallySpeaking 
application for the iphone ships audio to a server upstream and then nuance 
harvests the data for their own use.  I think that it's not resolvable issue 
except by making it a local policy decision as to whether or not to enable 
microphones remotely.  people will still do it for convenience but at least you 
gave them the decision.


> I think these two requirements should be removed and left to the UA.
> In my mind speech input equates to keyboard and other forms of input,
> and each user prefers to use a device/platform/OS which suits them
> best. Some people use an english keyboard while others in their native
> tongue, some use IME software to manage both at the same time, some
> get ergonomic keyboards etc. In all these cases the applications and
> websites are happy to just work with the text/data they are given.
>
> In the speech input context the recognizer becomes part of the
> device/platform/OS and not something chosen by a web page. In real
> life terms having the UA manage speech resources means:
> - the user gets consistent results each time whatever website they visit
> - the recognizer can train itself to perform better over time since
> all the user's speech goes through it
> - the user can easily upgrade/purchase new software if they need to
> improve on that.

All very good points but also consider this, it should be possible for the 
recognizer and the application to be remote from the machine with a microphone. 
The use case combinations would be local recognizer - local application (i.e. 
today), remote recognizer - local application (iPhone/Dragon today), local 
recognizer - remote application and remote recognizer- remote application.

A use example would be a speech recognition user working at a company and they 
are not allowed/don't have a license to install a speech recognition system on 
their work machine. They can't run the remote Gateway because of audio security 
concerns. They bring in their own machine with recognition software but 
can't/don't want to install corporate applications on a personal machine, daily 
solution would be to bridge to the remote machine, and run the Corporate 
applications off of their own recognition system.

Another variant of this is the system administrator using speech recognition to 
control a remote machine that may be in the next room or thousands of miles 
away. You don't worship the audio and run recognizer on the remote machine, you 
want to have one recognizer that you control that connects to many machines.

Yes, this is far afield from the "in the cloud" application but it is the kind 
of functionality needed for a more general speech recognition environment. Not 
sure if it fits in this context so, feel free to toss my comments if necessary.
Received on Thursday, 7 October 2010 15:58:51 UTC