Re: Review of EMMA usage in the Speech API (first editor's draft) from Satish S on 2012-06-19 (public-speech-api@w3.org from June 2012)

From: Satish S <satish@google.com>
Date: Tue, 19 Jun 2012 12:27:21 +0100
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Jerry Carter <jerry@jerrycarter.org>, Deborah Dahl <dahl@conversational-technologies.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAHZf7RkpoTyKusKS-Tm=Eifd8gM5p8r6WUJrFEotWFnC_Kax0A@mail.gmail.com>
Yes I believe that was the use case Jerry mentioned earlier as well. There
could be data coming from both recognizers and the UA might need to pick
one of them (e.g. any EMMA attribute that can't be repeated with different
values).

Cheers
Satish


On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com>wrote:

>  Satish, are you thinking of a scenario in which the UA runs multiple
> recognizers in parallel and then selects among the results?  If so, I think
> that’s a reasonable use case, but I’d like to preserve Jerry’s MUST  clause
> wrt annotations.  Could we agree on something like: ****
>
>  ****
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognizer(s) that were used to produce the corresponding result.  The
> UA MAY add additional annotations to provide a richer result for the
> developer."****
>
> ** **
>
> Thanks****
>
> ** **
>
> ** **
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Friday, June 15, 2012 4:12 AM
> *To:* Jerry Carter
> *Cc:* Deborah Dahl; public-speech-api@w3.org
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
> ** **
>
> "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>  ** **
>
> If we use MUST above it would disallow UAs from selecting content from one
> recognizer over the other. So I think SHOULD would be more relevant.****
>
> ** **
>
> Cheers
> Satish
>
> ****
>
> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>
> wrote:****
>
> ** **
>
> On Jun 15, 2012, at 5:06 AM, Satish S wrote:****
>
>    Is this roughly what you had in mind?****
>
>  ** **
>
> I understood what Jerry wrote as****
>
> - There is a local recognizer, probably with a device specific grammar
> such as contacts and apps****
>
> - There is a remote recognizer that caters to a much wider scope****
>
> The UA would send audio to both and combine the results to deliver to
> Javascript****
>
> ** **
>
> Jerry, could you clarify which use case you meant? The language I proposed
> was aimed towards a recognizer sitting outside the UA and generating EMMA
> data in which case it seemed appropriate that the UA would pass it through
> unmodified. If the UA is indeed generating EMMA data (whether combining
> from multiple recognizers or where the recognizer doesn't give EMMA data)
> it should be allowed to do so.****
>
>  ** **
>
> Your understanding is correct.  I've seen different architectures used.  *
> ***
>
> ** **
>
> Most often, there is a single vendor providing a single recognizer (either
> on device or in the cloud).   Also common, a single vendor will provide a
> consolidated product in which on device and in cloud resources are used
> together.  Here the interface may generate a single result which the UA
> would be free to pass along as-is.  In either case, the vendor will argue
> for using their result directly.****
>
> ** **
>
> Then, there are rarer but real cases (e.g. certain Samsung products),
> where multiple venders are used in combination or as alternatives.  When
> used in combination, a consolidated recognition result would be generated
> outside what I think of as the recognition resource.  When used as
> alternatives, the results might be restructured or altered with the goal of
> providing consistent content or formats for developers.  Either way, some
> entity is preparing the final EMMA result.  Coming from the media resource
> side, I think of that entity as the UA.  Someone from the UA side might
> very well think of that as being just a different recognition resource!***
> *
>
> ** **
>
> In my mind and in the discussions as EMMA was coming to maturity, there is
> no reason that an EMMA result need pass through layers without
> modification.  There are, in fact, mechanisms within EMMA do describe
> intermediate results and their connections.  A speech recognition result
> might detail the phonetic (or sub-phonetic) lattice, the corresponding
> tokens from a grammar, and the semantic meaning as a derivation chain.  A
> hybrid resource might offer separate processing chains and then a unified
> result.  The example that drove much of the discussion was a map
> application with a touch screen.  The user says "How do I drive from here
> to here" with corresponding touches.  The EMMA result could include the
> entire recognition chain (phonetics -> tokens -> semantics) with the
> sequence of touches (touch1, then touch2) and then produce a final result
> (mode=drive + location1 + location2) for passing to a routing application.
> ****
>
> ** **
>
> What is critical is that the application developer have access to the
> information required for their task.  Developers, in my experience, object
> when information is stripped.  Media resource vendors, working in response
> to developer requests, want assurances that additional details that they
> add will be passed through to the developer.  The current language****
>
> ** **
>
> UA implementations for recognizers that supply EMMA *must* pass that EMMA
> structure directly.****
>
>  ** **
>
> is too restrictive.   Let me suggest instead****
>
> ** **
>
>  "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>  ** **
>
> I offered SHOULD or MUST.  I prefer MUST because I believe that the
> contents of a result generated by the recognizer exists for a reason.  I
> can accept SHOULD if there is a strong argument for presenting a simplified
> or altered result.****
>
> ** **
>
>    Milan and Salish, could you elaborate on what you had in mind when you
> raised concerns about the UA modifying the speech recognizer’s EMMA?****
>
>  ** **
>
> The primary reason I added that clause in was to preserve those EMMA
> attributes (emma:process, ...) from the recognizer to JS without calling
> out specific attributes. Since we agreed that instead of calling out
> attributes we'll add use cases as examples, there is lesser reason for this
> clause now and I agree it does enable use cases like what I mentioned
> above. So I'm fine dropping that clause if there are no other strong
> reasons to keep it in.****
>
> ** **
>
> --****
>
> Cheers****
>
> Satish****
>
>  ** **
>
> Makes sense.****
>
> ** **
>
> -=- Jerry****
>
> ** **
>
Received on Tuesday, 19 June 2012 11:27:51 UTC