Re: Review of EMMA usage in the Speech API (first editor's draft) from Jerry Carter on 2012-06-15 (public-speech-api@w3.org from June 2012)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Fri, 15 Jun 2012 07:06:31 -0400
To: Satish S <satish@google.com>
Cc: Deborah Dahl <dahl@conversational-technologies.com>, public-speech-api@w3.org
Message-Id: <E8D95758-7A7D-4599-AABE-266F717A1A1F@jerrycarter.org>
On Jun 15, 2012, at 5:06 AM, Satish S wrote:
> Is this roughly what you had in mind?
> 
> 
> I understood what Jerry wrote as
> - There is a local recognizer, probably with a device specific grammar such as contacts and apps
> - There is a remote recognizer that caters to a much wider scope
> The UA would send audio to both and combine the results to deliver to Javascript
> 
> Jerry, could you clarify which use case you meant? The language I proposed was aimed towards a recognizer sitting outside the UA and generating EMMA data in which case it seemed appropriate that the UA would pass it through unmodified. If the UA is indeed generating EMMA data (whether combining from multiple recognizers or where the recognizer doesn't give EMMA data) it should be allowed to do so.

Your understanding is correct.  I've seen different architectures used.  

Most often, there is a single vendor providing a single recognizer (either on device or in the cloud).   Also common, a single vendor will provide a consolidated product in which on device and in cloud resources are used together.  Here the interface may generate a single result which the UA would be free to pass along as-is.  In either case, the vendor will argue for using their result directly.

Then, there are rarer but real cases (e.g. certain Samsung products), where multiple venders are used in combination or as alternatives.  When used in combination, a consolidated recognition result would be generated outside what I think of as the recognition resource.  When used as alternatives, the results might be restructured or altered with the goal of providing consistent content or formats for developers.  Either way, some entity is preparing the final EMMA result.  Coming from the media resource side, I think of that entity as the UA.  Someone from the UA side might very well think of that as being just a different recognition resource!

In my mind and in the discussions as EMMA was coming to maturity, there is no reason that an EMMA result need pass through layers without modification.  There are, in fact, mechanisms within EMMA do describe intermediate results and their connections.  A speech recognition result might detail the phonetic (or sub-phonetic) lattice, the corresponding tokens from a grammar, and the semantic meaning as a derivation chain.  A hybrid resource might offer separate processing chains and then a unified result.  The example that drove much of the discussion was a map application with a touch screen.  The user says "How do I drive from here to here" with corresponding touches.  The EMMA result could include the entire recognition chain (phonetics -> tokens -> semantics) with the sequence of touches (touch1, then touch2) and then produce a final result (mode=drive + location1 + location2) for passing to a routing application.

What is critical is that the application developer have access to the information required for their task.  Developers, in my experience, object when information is stripped.  Media resource vendors, working in response to developer requests, want assurances that additional details that they add will be passed through to the developer.  The current language

> UA implementations for recognizers that supply EMMA must pass that EMMA structure directly.

is too restrictive.   Let me suggest instead

> "The EMMA document MUST/SHOULD contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."

I offered SHOULD or MUST.  I prefer MUST because I believe that the contents of a result generated by the recognizer exists for a reason.  I can accept SHOULD if there is a strong argument for presenting a simplified or altered result.

> Milan and Salish, could you elaborate on what you had in mind when you raised concerns about the UA modifying the speech recognizer’s EMMA?
> 
> 
> The primary reason I added that clause in was to preserve those EMMA attributes (emma:process, ...) from the recognizer to JS without calling out specific attributes. Since we agreed that instead of calling out attributes we'll add use cases as examples, there is lesser reason for this clause now and I agree it does enable use cases like what I mentioned above. So I'm fine dropping that clause if there are no other strong reasons to keep it in.
> 
> --
> Cheers
> Satish

Makes sense.

-=- Jerry
Received on Friday, 15 June 2012 11:07:30 UTC