Re: Review of EMMA usage in the Speech API (first editor's draft) from Jerry Carter on 2012-06-20 (public-speech-api@w3.org from June 2012)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 20 Jun 2012 00:21:48 -0400
To: Satish S <satish@google.com>, Milan Young <Milan.Young@nuance.com>
Cc: public-speech-api@w3.org
Message-Id: <923B38C9-A3A5-496A-B75D-AE4CFF25960A@jerrycarter.org>
Been off-line for a few days, so trying to catch up a bit…

Where we agree

It sounds like there is general agreement, but that additional word-smithing may be required.  By my reading of this thread, no one has objected to language allowing the UA to add additional annotations or to present a result which combines annotations from various sources.  Likewise, no one has objected to a goal of presenting the full set of annotations generated by the resources used.  This is encouraging common ground.


Single resource model: There might be no problem

The point of contention concerns how the UA handles cases where multiple resources might be used to build a result.  This may not be a real problem.  I do not see any language in the current draft which describes how multiple resources would be specified or acquired, so I'm not at all surprised that matters are unsettled.  One option is to draw the architectural lines such that there is never more than one recognition service.  Here the serviceURI would be required to point to a single logical entity rather than a document describing a collection of resources.  The single entity might employ disparate resources under the covers, but this would be outside of the Speech API specification.

In this case, the proposed language may be fine:

> "The EMMA document MUST contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."


and I can propose further language describing speech recognition services in more detail.


Dealing with Multiple Resources

Where the serviceURI  may refer to multiple resources, and at the risk of over generalizing, there are two cases.  The first case is where the 'start()' method invokes one or more resources and the result is built from all of them.  This is the most common case that I've seen in live systems.  There appears to be consensus that the result presented to the application author should contain the detailed annotations produced by the resources involved.

So we're left with the second case, where the 'start()' method invokes multiple resources but only a subset are used to generate the result.  Here at least one of the resources must be optional for result production, otherwise a recognition error would be generated.  Perhaps that optional resource was temporarily unavailable, or perhaps resource constraints prevented it from being used, or perhaps it failed to return a result before a timeout occurred.  Whatever the reason, the optional resource did not contribute to the recognition.  I have no problem with the EMMA result excluding any mention of the disregarded optional resource.  I also see nothing wrong with the UA adding annotations to describe any selection criteria used to generate the EMMA result.

Let me offer a slight variation on my original language and on Milan's proposal to capture this case:

> "The EMMA document MUST contain all annotations and content generated by the recognition resources utilized for recognition.  The UA MAY add additional annotations to provide a richer result for the developer."


Again, further language describing speech recognition services will be helpful.


-=- Jerry



On Jun 19, 2012, at 11:13 AM, Satish S wrote:

> Since there could be data coming from both recognizers and the UA might need to pick one of them and drop the rest (e.g. any EMMA attribute that can't be repeated with different values), we can't say the UA MUST send all content and all annotations. Using SHOULD allows the UA to implement this use case. 
> 
> Cheers
> Satish
> 
> 
> On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> Good to hear.  So what did you think about the proposed text?  I didn’t like the previous SHOULD-based suggestion.
> 
>  
> 
> Thanks
> 
>  
> 
>  
> 
> From: Satish S [mailto:satish@google.com] 
> Sent: Tuesday, June 19, 2012 4:27 AM
> To: Young, Milan
> Cc: Jerry Carter; Deborah Dahl; public-speech-api@w3.org
> 
> 
> Subject: Re: Review of EMMA usage in the Speech API (first editor's draft)
> 
>  
> 
> Yes I believe that was the use case Jerry mentioned earlier as well. There could be data coming from both recognizers and the UA might need to pick one of them (e.g. any EMMA attribute that can't be repeated with different values).
> 
> 
> Cheers
> Satish
> 
> 
> On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com> wrote:
> 
> Satish, are you thinking of a scenario in which the UA runs multiple recognizers in parallel and then selects among the results?  If so, I think that’s a reasonable use case, but I’d like to preserve Jerry’s MUST  clause wrt annotations.  Could we agree on something like:
> 
>  
> 
> "The EMMA document MUST contain all annotations and content generated by the recognizer(s) that were used to produce the corresponding result.  The UA MAY add additional annotations to provide a richer result for the developer."
> 
>  
> 
> Thanks
> 
>  
> 
>  
> 
> From: Satish S [mailto:satish@google.com] 
> Sent: Friday, June 15, 2012 4:12 AM
> To: Jerry Carter
> Cc: Deborah Dahl; public-speech-api@w3.org
> Subject: Re: Review of EMMA usage in the Speech API (first editor's draft)
> 
>  
> 
> "The EMMA document MUST/SHOULD contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."
> 
>  
> 
> If we use MUST above it would disallow UAs from selecting content from one recognizer over the other. So I think SHOULD would be more relevant.
> 
>  
> 
> Cheers
> Satish
> 
> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org> wrote:
> 
>  
> 
> On Jun 15, 2012, at 5:06 AM, Satish S wrote:
> 
> Is this roughly what you had in mind?
> 
>  
> 
> I understood what Jerry wrote as
> 
> - There is a local recognizer, probably with a device specific grammar such as contacts and apps
> 
> - There is a remote recognizer that caters to a much wider scope
> 
> The UA would send audio to both and combine the results to deliver to Javascript
> 
>  
> 
> Jerry, could you clarify which use case you meant? The language I proposed was aimed towards a recognizer sitting outside the UA and generating EMMA data in which case it seemed appropriate that the UA would pass it through unmodified. If the UA is indeed generating EMMA data (whether combining from multiple recognizers or where the recognizer doesn't give EMMA data) it should be allowed to do so.
> 
>  
> 
> Your understanding is correct.  I've seen different architectures used.  
> 
>  
> 
> Most often, there is a single vendor providing a single recognizer (either on device or in the cloud).   Also common, a single vendor will provide a consolidated product in which on device and in cloud resources are used together.  Here the interface may generate a single result which the UA would be free to pass along as-is.  In either case, the vendor will argue for using their result directly.
> 
>  
> 
> Then, there are rarer but real cases (e.g. certain Samsung products), where multiple venders are used in combination or as alternatives.  When used in combination, a consolidated recognition result would be generated outside what I think of as the recognition resource.  When used as alternatives, the results might be restructured or altered with the goal of providing consistent content or formats for developers.  Either way, some entity is preparing the final EMMA result.  Coming from the media resource side, I think of that entity as the UA.  Someone from the UA side might very well think of that as being just a different recognition resource!
> 
>  
> 
> In my mind and in the discussions as EMMA was coming to maturity, there is no reason that an EMMA result need pass through layers without modification.  There are, in fact, mechanisms within EMMA do describe intermediate results and their connections.  A speech recognition result might detail the phonetic (or sub-phonetic) lattice, the corresponding tokens from a grammar, and the semantic meaning as a derivation chain.  A hybrid resource might offer separate processing chains and then a unified result.  The example that drove much of the discussion was a map application with a touch screen.  The user says "How do I drive from here to here" with corresponding touches.  The EMMA result could include the entire recognition chain (phonetics -> tokens -> semantics) with the sequence of touches (touch1, then touch2) and then produce a final result (mode=drive + location1 + location2) for passing to a routing application.
> 
>  
> 
> What is critical is that the application developer have access to the information required for their task.  Developers, in my experience, object when information is stripped.  Media resource vendors, working in response to developer requests, want assurances that additional details that they add will be passed through to the developer.  The current language
> 
>  
> 
> UA implementations for recognizers that supply EMMA must pass that EMMA structure directly.
> 
>  
> 
> is too restrictive.   Let me suggest instead
> 
>  
> 
> "The EMMA document MUST/SHOULD contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."
> 
>  
> 
> I offered SHOULD or MUST.  I prefer MUST because I believe that the contents of a result generated by the recognizer exists for a reason.  I can accept SHOULD if there is a strong argument for presenting a simplified or altered result.
> 
>  
> 
> Milan and Salish, could you elaborate on what you had in mind when you raised concerns about the UA modifying the speech recognizer’s EMMA?
> 
>  
> 
> The primary reason I added that clause in was to preserve those EMMA attributes (emma:process, ...) from the recognizer to JS without calling out specific attributes. Since we agreed that instead of calling out attributes we'll add use cases as examples, there is lesser reason for this clause now and I agree it does enable use cases like what I mentioned above. So I'm fine dropping that clause if there are no other strong reasons to keep it in.
> 
>  
> 
> --
> 
> Cheers
> 
> Satish
> 
>  
> 
> Makes sense.
> 
>  
> 
> -=- Jerry
> 
>  
> 
>  
> 
>
Received on Wednesday, 20 June 2012 04:22:22 UTC