RE: Review of EMMA usage in the Speech API (first editor's draft)

Sorry about that.  I read your previous posting too quickly and missed your point.

What entity is driving the decision to use more than one recognizer?  Was this specified by the developer, UA, OS, recognition engine?



From: Satish S [mailto:satish@google.com]
Sent: Tuesday, June 19, 2012 8:13 AM
To: Young, Milan
Cc: Jerry Carter; Deborah Dahl; public-speech-api@w3.org
Subject: Re: Review of EMMA usage in the Speech API (first editor's draft)

Since there could be data coming from both recognizers and the UA might need to pick one of them and drop the rest (e.g. any EMMA attribute that can't be repeated with different values), we can't say the UA MUST send all content and all annotations. Using SHOULD allows the UA to implement this use case.

Cheers
Satish

On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
Good to hear.  So what did you think about the proposed text?  I didn't like the previous SHOULD-based suggestion.

Thanks


From: Satish S [mailto:satish@google.com<mailto:satish@google.com>]
Sent: Tuesday, June 19, 2012 4:27 AM
To: Young, Milan
Cc: Jerry Carter; Deborah Dahl; public-speech-api@w3.org<mailto:public-speech-api@w3.org>

Subject: Re: Review of EMMA usage in the Speech API (first editor's draft)

Yes I believe that was the use case Jerry mentioned earlier as well. There could be data coming from both recognizers and the UA might need to pick one of them (e.g. any EMMA attribute that can't be repeated with different values).

Cheers
Satish
On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com<mailto:Milan.Young@nuance.com>> wrote:
Satish, are you thinking of a scenario in which the UA runs multiple recognizers in parallel and then selects among the results?  If so, I think that's a reasonable use case, but I'd like to preserve Jerry's MUST  clause wrt annotations.  Could we agree on something like:

"The EMMA document MUST contain all annotations and content generated by the recognizer(s) that were used to produce the corresponding result.  The UA MAY add additional annotations to provide a richer result for the developer."

Thanks


From: Satish S [mailto:satish@google.com<mailto:satish@google.com>]
Sent: Friday, June 15, 2012 4:12 AM
To: Jerry Carter
Cc: Deborah Dahl; public-speech-api@w3.org<mailto:public-speech-api@w3.org>
Subject: Re: Review of EMMA usage in the Speech API (first editor's draft)

"The EMMA document MUST/SHOULD contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."

If we use MUST above it would disallow UAs from selecting content from one recognizer over the other. So I think SHOULD would be more relevant.

Cheers
Satish
On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org<mailto:jerry@jerrycarter.org>> wrote:

On Jun 15, 2012, at 5:06 AM, Satish S wrote:
Is this roughly what you had in mind?

I understood what Jerry wrote as
- There is a local recognizer, probably with a device specific grammar such as contacts and apps
- There is a remote recognizer that caters to a much wider scope
The UA would send audio to both and combine the results to deliver to Javascript

Jerry, could you clarify which use case you meant? The language I proposed was aimed towards a recognizer sitting outside the UA and generating EMMA data in which case it seemed appropriate that the UA would pass it through unmodified. If the UA is indeed generating EMMA data (whether combining from multiple recognizers or where the recognizer doesn't give EMMA data) it should be allowed to do so.

Your understanding is correct.  I've seen different architectures used.

Most often, there is a single vendor providing a single recognizer (either on device or in the cloud).   Also common, a single vendor will provide a consolidated product in which on device and in cloud resources are used together.  Here the interface may generate a single result which the UA would be free to pass along as-is.  In either case, the vendor will argue for using their result directly.

Then, there are rarer but real cases (e.g. certain Samsung products), where multiple venders are used in combination or as alternatives.  When used in combination, a consolidated recognition result would be generated outside what I think of as the recognition resource.  When used as alternatives, the results might be restructured or altered with the goal of providing consistent content or formats for developers.  Either way, some entity is preparing the final EMMA result.  Coming from the media resource side, I think of that entity as the UA.  Someone from the UA side might very well think of that as being just a different recognition resource!

In my mind and in the discussions as EMMA was coming to maturity, there is no reason that an EMMA result need pass through layers without modification.  There are, in fact, mechanisms within EMMA do describe intermediate results and their connections.  A speech recognition result might detail the phonetic (or sub-phonetic) lattice, the corresponding tokens from a grammar, and the semantic meaning as a derivation chain.  A hybrid resource might offer separate processing chains and then a unified result.  The example that drove much of the discussion was a map application with a touch screen.  The user says "How do I drive from here to here" with corresponding touches.  The EMMA result could include the entire recognition chain (phonetics -> tokens -> semantics) with the sequence of touches (touch1, then touch2) and then produce a final result (mode=drive + location1 + location2) for passing to a routing application.

What is critical is that the application developer have access to the information required for their task.  Developers, in my experience, object when information is stripped.  Media resource vendors, working in response to developer requests, want assurances that additional details that they add will be passed through to the developer.  The current language

UA implementations for recognizers that supply EMMA must pass that EMMA structure directly.

is too restrictive.   Let me suggest instead

"The EMMA document MUST/SHOULD contain all annotations and content generated by the recognizer(s).  The UA MAY add additional annotations to provide a richer result for the developer."

I offered SHOULD or MUST.  I prefer MUST because I believe that the contents of a result generated by the recognizer exists for a reason.  I can accept SHOULD if there is a strong argument for presenting a simplified or altered result.

Milan and Salish, could you elaborate on what you had in mind when you raised concerns about the UA modifying the speech recognizer's EMMA?

The primary reason I added that clause in was to preserve those EMMA attributes (emma:process, ...) from the recognizer to JS without calling out specific attributes. Since we agreed that instead of calling out attributes we'll add use cases as examples, there is lesser reason for this clause now and I agree it does enable use cases like what I mentioned above. So I'm fine dropping that clause if there are no other strong reasons to keep it in.

--
Cheers
Satish

Makes sense.

-=- Jerry

Received on Tuesday, 19 June 2012 15:22:01 UTC