- From: Eric S. Johansson <esj@harvee.org>
- Date: Thu, 18 Nov 2010 22:04:48 -0500
- To: Robert Brown <Robert.Brown@microsoft.com>
- CC: "Young, Milan" <Milan.Young@nuance.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
On 11/18/2010 8:49 PM, Robert Brown wrote: > > I mostly agree. But do we need bidirectional events? I suspect all the > interesting ones originate at the server: start-of-speech; hypothesis; partial > result; warnings of noise, crosstalk, etc. I’m trying to think why the server > would care about events from the client, other than when the client is done > sending audio (which it could do in response to a click or end-point detection). > I think we do need bidirectional events but more specifically, we need two unidirectional events that can end up on different machines. In my mind, when I look at a speech driven application, there are four major subsystems. o The recognizer, o The application, o The speech user interface application for the vendor and o A speech user interface application from the end-user. These subsystems exist whether or not the recognizer is local or remote and the same interaction between recognition engine and vendor and user interface applications also exists independent of recognizer location. It's possible to locate the user interface application either on the server or the client. There can be two significant chunks of the application encapsulated as an external subsystems that could reside either local or remotely. If the applications are local then you want results passed down for local action. I don't think there's symmetry for upstream because the results would just be handed to the server-side copy and not over the wire to the local copy. We should look at what role the client will have at a minimum. I think it would be smart for the client to control a lot of front-end signal processing, audio management type of stuff that's a fair amount of detail to hand off upstream. Is there any quality of service data should be gathered at the client on the front end? I need to go back and check the archives to see if we've talked about A-B speech recognition environments. A-B environments are where you dictate a one machine and all the results are delivered to a remote machine including the results of a vendor or user application. You might even have a user application on a remote machine reacting to utterances. Think dictating to a virtual machine from your host. If you have a remote recognition engine, you need to connect both machines to the recognition engine so one can receive recognition results and the other hears what you say.
Received on Friday, 19 November 2010 03:06:22 UTC