Re: Requirement for UA / SS protocol from Eric S. Johansson on 2010-11-19 (public-xg-htmlspeech@w3.org from November 2010)

From: Eric S. Johansson <esj@harvee.org>
Date: Thu, 18 Nov 2010 22:04:48 -0500
To: Robert Brown <Robert.Brown@microsoft.com>
CC: "Young, Milan" <Milan.Young@nuance.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4CE5E950.5020703@harvee.org>

On 11/18/2010 8:49 PM, Robert Brown wrote:
>
> I mostly agree. But do we need bidirectional events? I suspect all the 
> interesting ones originate at the server: start-of-speech; hypothesis; partial 
> result; warnings of noise, crosstalk, etc. I’m trying to think why the server 
> would care about events from the client, other than when the client is done 
> sending audio (which it could do in response to a click or end-point detection).
>

I think we do need bidirectional events but more specifically, we need two 
unidirectional events that can end up on different machines.

In my mind, when I look at a speech driven application, there are four major 
subsystems.
o The recognizer,
o The application,
o The speech user interface application for the vendor and
o A speech user interface application from the end-user.

These subsystems exist whether or not the recognizer is local or remote and the 
same interaction between recognition engine and vendor and user interface 
applications also exists independent of recognizer location.

It's possible to locate the user interface application either on the server or 
the client. There can be two significant chunks of the application encapsulated 
as an external subsystems that could reside either local or remotely.

If the applications are local then you want results passed down for local 
action. I don't think there's symmetry for upstream because the results would 
just be handed to the server-side copy and not over the wire to the local copy.

We should look at what role the client will have at a minimum. I think it would 
be smart for the client to control a lot of front-end signal processing, audio 
management type of stuff that's a fair amount of detail to hand off upstream. Is 
there any quality of service data should be gathered at the client on the front end?

I need to go back and check the archives to see if we've talked about A-B speech 
recognition environments. A-B environments are where you dictate a one machine 
and all the results are delivered to a remote machine including the results of a 
vendor or user application. You might even have a user application on a remote 
machine reacting to utterances. Think dictating to a virtual machine from your 
host. If you have a remote recognition engine, you need to connect both machines 
to the recognition engine so one can receive recognition results and the other 
hears what you say.

Received on Friday, 19 November 2010 03:06:22 UTC