RE: Requirement for UA / SS protocol from Young, Milan on 2010-11-19 (public-xg-htmlspeech@w3.org from November 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Fri, 19 Nov 2010 07:23:25 -0800
To: "Eric S. Johansson" <esj@harvee.org>, "Robert Brown" <Robert.Brown@microsoft.com>
Cc: <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD096310A2@SUN-EXCH01.nuance.com>

Hello Eric,

I must admit that web applications are not my expertise.  I'm having a
hard time understanding why the protocol needs to be expanded to handle
these new unidirectional events.

If the event should be sent from the web-app to the application server,
then couldn't this be done using AJAX, or some other standard web
technology?

If the event is to be sent between the SS and application server, then
shouldn't this be triggered with an implementation-specific parameter?
It seems like a stretch to make this part of the specification.

Thanks

-----Original Message-----
From: Eric S. Johansson [mailto:esj@harvee.org] 
Sent: Thursday, November 18, 2010 7:05 PM
To: Robert Brown
Cc: Young, Milan; public-xg-htmlspeech@w3.org
Subject: Re: Requirement for UA / SS protocol

On 11/18/2010 8:49 PM, Robert Brown wrote:
>
> I mostly agree. But do we need bidirectional events? I suspect all the

> interesting ones originate at the server: start-of-speech; hypothesis;
partial 
> result; warnings of noise, crosstalk, etc. I'm trying to think why the
server 
> would care about events from the client, other than when the client is
done 
> sending audio (which it could do in response to a click or end-point
detection).
>

I think we do need bidirectional events but more specifically, we need
two 
unidirectional events that can end up on different machines.

In my mind, when I look at a speech driven application, there are four
major 
subsystems.
o The recognizer,
o The application,
o The speech user interface application for the vendor and
o A speech user interface application from the end-user.

These subsystems exist whether or not the recognizer is local or remote
and the 
same interaction between recognition engine and vendor and user
interface 
applications also exists independent of recognizer location.

It's possible to locate the user interface application either on the
server or 
the client. There can be two significant chunks of the application
encapsulated 
as an external subsystems that could reside either local or remotely.

If the applications are local then you want results passed down for
local 
action. I don't think there's symmetry for upstream because the results
would 
just be handed to the server-side copy and not over the wire to the
local copy.

We should look at what role the client will have at a minimum. I think
it would 
be smart for the client to control a lot of front-end signal processing,
audio 
management type of stuff that's a fair amount of detail to hand off
upstream. Is 
there any quality of service data should be gathered at the client on
the front end?

I need to go back and check the archives to see if we've talked about
A-B speech 
recognition environments. A-B environments are where you dictate a one
machine 
and all the results are delivered to a remote machine including the
results of a 
vendor or user application. You might even have a user application on a
remote 
machine reacting to utterances. Think dictating to a virtual machine
from your 
host. If you have a remote recognition engine, you need to connect both
machines 
to the recognition engine so one can receive recognition results and the
other 
hears what you say.

Received on Friday, 19 November 2010 15:24:06 UTC