- From: Eric S. Johansson <esj@harvee.org>
- Date: Thu, 07 Oct 2010 11:57:43 -0400
- To: Satish Sampath <satish@google.com>
- CC: Michael Bodell <mbodell@microsoft.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
On 10/7/2010 11:03 AM, Satish Sampath wrote: > >> U8. Voice Activity Detection > While the UA may use voice activity detection either at start of > speech or stop or both, I don't think it should be exposed as a part > of the HTML speech API. In particular I'm concerned about letting > random web pages initiate speech input automatically and get > recognition results without the user consent. This is the reason our > draft proposal to the XG required explicit user action to start > recognition (for e.g. by clicking on a mic button). > > The main use case mentioned here is hands-free dialog, which could be > performed by using the continuous recognition but having the user > initiate the dialog with a gesture (click, touch or something > similar). This assures the user that third parties cannot snoop on > their speech/audio without their knowledge, while also avoiding > unwanted notifications and permission popups (since users are known to > dismiss them without paying much attention). For people with working hands, that's probably not a bad solution. If you are running completely hands-free, for example, in a mobile environment, an explicit acknowledgment can be an unnecessary distraction from the other tasks (i.e. driving). If you're disabled, it's an unnecessary consumption of what ever hand capacity one may have. Real-life example: I use Skype and Skype is something about the Windows sound system lets me feed both Skype and NaturallySpeaking at the same time. So I can be talking with someone on Skype, tell NaturallySpeaking to "wake up", dictate something (usually to the annoyance of the person on Skype) and then tell NaturallySpeaking to "go to sleep" again. What I wish was different was that the audio would switch to NaturallySpeaking and away from Skype so that my dictation would not be heard. The verbal command to go to sleep and to wake up is not reliable. Sometimes it wakes up at the wrong time and starts chatting away at whatever application. Keystroke off toggle doesn't always work for a variety of reasons, mostly because it's extremely clumsy and my keyboard doesn't have a keypad so that the default hotkey (keypad +) is not available. I think the concern about capturing audio without the users knowledge is important issue especially in mobile devices where the NaturallySpeaking application for the iphone ships audio to a server upstream and then nuance harvests the data for their own use. I think that it's not resolvable issue except by making it a local policy decision as to whether or not to enable microphones remotely. people will still do it for convenience but at least you gave them the decision. > I think these two requirements should be removed and left to the UA. > In my mind speech input equates to keyboard and other forms of input, > and each user prefers to use a device/platform/OS which suits them > best. Some people use an english keyboard while others in their native > tongue, some use IME software to manage both at the same time, some > get ergonomic keyboards etc. In all these cases the applications and > websites are happy to just work with the text/data they are given. > > In the speech input context the recognizer becomes part of the > device/platform/OS and not something chosen by a web page. In real > life terms having the UA manage speech resources means: > - the user gets consistent results each time whatever website they visit > - the recognizer can train itself to perform better over time since > all the user's speech goes through it > - the user can easily upgrade/purchase new software if they need to > improve on that. All very good points but also consider this, it should be possible for the recognizer and the application to be remote from the machine with a microphone. The use case combinations would be local recognizer - local application (i.e. today), remote recognizer - local application (iPhone/Dragon today), local recognizer - remote application and remote recognizer- remote application. A use example would be a speech recognition user working at a company and they are not allowed/don't have a license to install a speech recognition system on their work machine. They can't run the remote Gateway because of audio security concerns. They bring in their own machine with recognition software but can't/don't want to install corporate applications on a personal machine, daily solution would be to bridge to the remote machine, and run the Corporate applications off of their own recognition system. Another variant of this is the system administrator using speech recognition to control a remote machine that may be in the next room or thousands of miles away. You don't worship the audio and run recognizer on the remote machine, you want to have one recognizer that you control that connects to many machines. Yes, this is far afield from the "in the cloud" application but it is the kind of functionality needed for a more general speech recognition environment. Not sure if it fits in this context so, feel free to toss my comments if necessary.
Received on Thursday, 7 October 2010 15:58:51 UTC