W3C home > Mailing lists > Public > public-speech-api@w3.org > May 2012

RE: EMMA in Speech API (was RE: Speech API: first editor's draft posted)

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 30 May 2012 15:08:53 +0000
To: Bjorn Bringert <bringert@google.com>, Deborah Dahl <dahl@conversational-technologies.com>, "Satish S (satish@google.com)" <satish@google.com>
CC: Satish S <satish@google.com>, Glen Shires <gshires@google.com>, "Hans Wennborg" <hwennborg@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A45D938@SOM-EXCH04.nuance.com>
Satish, please take a look at the use cases below.  Items #1 and #3 cannot be achieved unless EMMA is always present.

I'd like to add another use case #4.  Application needs to post the recognition result to server before proceeding in the dialog.  The server might be a traditional application server or it could be the controller in an MMI architecture.  EMMA is a standard serialized representation.



-----Original Message-----
From: Bjorn Bringert [mailto:bringert@google.com] 
Sent: Tuesday, May 22, 2012 8:51 AM
To: Deborah Dahl
Cc: Satish S; Young, Milan; Glen Shires; Hans Wennborg; public-speech-api@w3.org
Subject: Re: EMMA in Speech API (was RE: Speech API: first editor's draft posted)

These sound like valid use cases, and as previously discussed, it's not hard to produce basic EMMA with just the recognition hypothesis.

So I suggest that the whole results object contains an EMMA string, and that each hypothesis contains an object corresponding to the SISR interpretation of that hypothesis.

On Tue, May 22, 2012 at 12:09 AM, Deborah Dahl <dahl@conversational-technologies.com> wrote:
> Hi, a couple of comments.
>
>
>
> From: Satish S [mailto:satish@google.com]
> Sent: Monday, May 21, 2012 5:35 PM
> To: Deborah Dahl
> Cc: Bjorn Bringert; Young, Milan; Glen Shires; Hans Wennborg; 
> public-speech-api@w3.org
>
>
> Subject: Re: EMMA in Speech API (was RE: Speech API: first editor's 
> draft
> posted)
>
>
>
> I agree that having a uniform representation of results and semantic 
> interpretation is necessary. The only question I have is why XML 
> formatted according to EMMA is preferred over native JS objects. To 
> clarify, I'm suggesting that semantic information, if received as EMMA 
> from the recognizer, be converted by the UA to native JS objects so 
> accessing them is far simpler.
>
>
>
> With EMMA XML:
>
>   var doc = alternative.emmaXML;
>
>   var interpretation = 
> doc.getElementsByTagName("emma:interpretation")[0];
>
>   var origin =
> interpretation.getElementsByTagName("origin")[0].childNodes[0].nodeVal
> ue;
>
>   var destination =
> interpretation.getElementsByTagName("destination")[0].childNodes[0].no
> deValue;
>
>
>
> Instead, with native JS object:
>
>   var origin = alternative.interpretation.origin
>
>   var destination = alternative.interpretation.destination
>
>
>
> I prefer the latter as it does away with the boilerplate that every 
> single web app has to go through.
>
>
>
> I don’t disagree with making the JS object available as well as the 
> EMMA – both could be available.
>
> There are at least two use cases where the web app doesn’t have to do 
> anything directly with the EMMA – (1) passing the EMMA along to a 
> dialog manager, and (2) saving the EMMA result for later logging and 
> analysis. For those use cases the web app doesn’t have to unpack the EMMA.
>
>
>
> Yes, SISR is a standard for representing the semantic result, but it 
> doesn’t provide a way to represent any metadata.
>
>
>
> Could you explain what you mean by meta data in this context with a 
> use case? It should be possible to fit that in the above proposal as well.
>
>
>
> Here are some examples.
>
> Use case 1: I’m testing different speech recognition services. I would 
> like to know which service processed the speech associated with a 
> particular result, so that I can compare the services for accuracy. I 
> can use the emma:process parameter for that.
>
> Use case 2: I want the system to dynamically slow down its TTS for 
> users who speak more slowly. The EMMA timestamps, duration, and token 
> parameters can be used to determine the speech rate for a particular utterance.
>
> Use case 3: I’m testing several different grammars to compare their 
> accuracy. I use the emma:grammar parameter to record which grammar was 
> used for each result.
>
>
>
> Obviously you could write Javascript or server-side processing to 
> record all this information, but it would have to be done repeatedly 
> for every application, and it’s much more convenient to have it all 
> available in the EMMA result.
>
> I also think it would be a waste of time for this group to go through 
> the exercise of figuring out how to represent all the EMMA metadata 
> attributes in a native JS fashion. We would inevitably have to spend 
> time agreeing on which EMMA metadata attributes are important enough 
> to work on and I think it would just be less work to make the EMMA 
> result available for those applications that need it.
>
>
>
> Cheers
> Satish
>
> On Mon, May 21, 2012 at 6:36 PM, Deborah Dahl 
> <dahl@conversational-technologies.com> wrote:
>
> Many applications will have a dialog manager that uses the speech 
> recognition result to conduct a spoken dialog with the user. In that 
> case it is extremely useful for the dialog manager to have a uniform 
> representation for speech recognition results, so that the dialog 
> manager can be somewhat independent of the recognizer. In fact, there 
> are existing applications that I know of that do expect EMMA-formatted 
> results. It would be very inconvenient for these dialog managers to 
> have to be modified to accommodate different formats depending on the 
> recognition service. Similarly, another type of consumer of speech 
> recognition results is likely to be logging and analysis applications, which again could benefit from uniform EMMA results.
> I believe it’s also undesirable for the application developer to have 
> to look at the result and then manually create an EMMA wrapper for it.
>
> Yes, SISR is a standard for representing the semantic result, but it 
> doesn’t provide a way to represent any metadata. In addition, it won’t 
> help if the language model is an SLM rather than a grammar.
>
> Also, just a general comment about API’s and novice developers. I 
> think developers in general are very good at ignoring aspects of an 
> API that they don’t plan to use, as long as they have a simple way to 
> get started. I think developer problems mainly arise with API’s where 
> there’s a huge learning curve just to do hello world.
>
>
>
> From: Satish S [mailto:satish@google.com]
> Sent: Monday, May 21, 2012 12:17 PM
> To: Bjorn Bringert
> Cc: Young, Milan; Deborah Dahl; Glen Shires; Hans Wennborg; 
> public-speech-api@w3.org
>
>
> Subject: Re: EMMA in Speech API (was RE: Speech API: first editor's 
> draft
> posted)
>
>
>
> I would prefer having an easy solution for the majority of apps which
>
>
> just want the interpretation, which is either just a string or a JS 
> object (when using SISR). Boilerplate code sucks. Having EMMA 
> available sounds ok too, but that seems like a minority feature to me.
>
>
>
> Seems like the current type "any" is suited for that. Since SISR 
> represents the results of semantic interpretation as ECMAScript that 
> is interoperable and non-proprietary, the goal of a cross-browser 
> semantic interpretation format seems satisfied. Are there other reasons to add EMMA support?
>
>



--
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
Received on Wednesday, 30 May 2012 15:09:28 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 May 2012 15:09:29 GMT