Re: Review of EMMA usage in the Speech API (first editor's draft) from Glen Shires on 2012-10-05 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Thu, 4 Oct 2012 18:30:36 -0700
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Jerry Carter <jerry@jerrycarter.org>, Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bchP-XecApL4bCk-e=JN56-PUw5+-g-bChj72+v+cNJb3A@mail.gmail.com>
If there's no disagreement with the proposed text below (Sep 12) I will
update the spec tomorrow (Friday).

On Tue, Sep 18, 2012 at 2:28 PM, Young, Milan <Milan.Young@nuance.com>wrote:

>  Sorry Glen, ****
>
> ** **
>
> I got busy with other things and need time to catch up on this tread.
> Please hold off until the end of the week on the change.****
>
> ** **
>
> Thanks****
>
> ** **
>
> ** **
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Tuesday, September 18, 2012 1:18 PM
> *To:* Young, Milan; Jerry Carter; Satish S; public-speech-api@w3.org
>
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ** **
>
> If there's no disagreement with the proposed text below (Sep 12) I will
> update the spec on Wednesday.****
>
> On Wed, Sep 12, 2012 at 8:56 AM, Glen Shires <gshires@google.com> wrote:**
> **
>
> There seems to be agreement on Jerry's wording except for the case that
> Satish raises in which conflicting data comes from both recognizers and
> can't be represented in EMMA (an EMMA attribute that can't be repeated with
> different values).  I propose the following slight modification to Jerry's
> wording that addresses this case. Here's my proposed full definition of the
> EMMA attribute. (The first two sentences are copied from the current
> definition in the spec.)****
>
> ** **
>
> "EMMA 1.0 representation of this result. The contents of this result could
> vary across UAs and recognition engines, but all implementations must
> expose a valid XML document complete with EMMA namespace. UA
> implementations for recognizers that supply EMMA MUST contain all
> annotations and content generated by the recognition resources utilized for
> recognition, except where infeasible due to conflicting attributes.  The UA
> MAY add additional annotations to provide a richer result for the
> developer."****
>
> ** **
>
> /Glen Shires****
>
> ** **
>
> On Wed, Jun 20, 2012 at 10:03 AM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Nice write-up.  I agree with the language.  I also agree that further
> details on running engines in parallel is necessary.****
>
>  ****
>
>  ****
>
> *From:* Jerry Carter [mailto:jerry@jerrycarter.org]
> *Sent:* Tuesday, June 19, 2012 9:22 PM
> *To:* Satish S; Young, Milan
> *Cc:* public-speech-api@w3.org****
>
>
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ****
>
> Been off-line for a few days, so trying to catch up a bit…****
>
>  ****
>
> *Where we agree*****
>
>  ****
>
> It sounds like there is general agreement, but that additional
> word-smithing may be required.  By my reading of this thread, no one has
> objected to language allowing the UA to add additional annotations or to
> present a result which combines annotations from various sources.
>  Likewise, no one has objected to a goal of presenting the full set of
> annotations generated by the resources used.  This is encouraging common
> ground.****
>
>  ****
>
>  ****
>
> *Single resource model: There might be no problem*****
>
>  ****
>
> The point of contention concerns how the UA handles cases where multiple
> resources might be used to build a result.  This may not be a real problem.
>  I do not see any language in the current draft which describes how
> multiple resources would be specified or acquired, so I'm not at all
> surprised that matters are unsettled.  One option is to draw the
> architectural lines such that there is never more than one recognition
> service.  Here the *serviceURI* would be required to point to a single
> logical entity rather than a document describing a collection of resources.
>  The single entity might employ disparate resources under the covers, but
> this would be outside of the Speech API specification.****
>
>  ****
>
> In this case, the proposed language may be fine:****
>
>  ****
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognizer(s).  The UA MAY add additional annotations to provide a
> richer result for the developer."****
>
>   ****
>
> and I can propose further language describing speech recognition services
> in more detail.****
>
>  ****
>
>  ****
>
> *Dealing with Multiple Resources*****
>
>  ****
>
> Where the *serviceURI*  may refer to multiple resources, and at the risk
> of over generalizing, there are two cases.  The first case is where the
> 'start()' method invokes one or more resources and the result is built from
> all of them.  This is the most common case that I've seen in live systems.
>  There appears to be consensus that the result presented to the application
> author should contain the detailed annotations produced by the resources
> involved.****
>
>  ****
>
> So we're left with the second case, where the 'start()' method invokes
> multiple resources but only a subset are used to generate the result.  Here
> at least one of the resources must be optional for result production,
> otherwise a recognition error would be generated.  Perhaps that optional
> resource was temporarily unavailable, or perhaps resource constraints
> prevented it from being used, or perhaps it failed to return a result
> before a timeout occurred.  Whatever the reason, the optional resource did
> not contribute to the recognition.  I have no problem with the EMMA result
> excluding any mention of the disregarded optional resource.  I also see
> nothing wrong with the UA adding annotations to describe any selection
> criteria used to generate the EMMA result.****
>
>  ****
>
> Let me offer a slight variation on my original language and on Milan's
> proposal to capture this case:****
>
>  ****
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognition resources utilized for recognition.  The UA MAY add
> additional annotations to provide a richer result for the developer."****
>
>    ****
>
> Again, further language describing speech recognition services will be
> helpful.****
>
>  ****
>
>  ****
>
> -=- Jerry****
>
>  ****
>
>  ****
>
>  ****
>
> On Jun 19, 2012, at 11:13 AM, Satish S wrote:****
>
> ** **
>
> Since there could be data coming from both recognizers and the UA might
> need to pick one of them and drop the rest (e.g. any EMMA attribute that
> can't be repeated with different values), we can't say the UA MUST send all
> content and all annotations. Using SHOULD allows the UA to implement this
> use case. ****
>
>
> Cheers
> Satish****
>
> On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Good to hear.  So what did you think about the proposed text?  I didn’t
> like the previous SHOULD-based suggestion.****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Tuesday, June 19, 2012 4:27 AM
> *To:* Young, Milan
> *Cc:* Jerry Carter; Deborah Dahl; public-speech-api@w3.org****
>
>
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ****
>
>  ****
>
> Yes I believe that was the use case Jerry mentioned earlier as well. There
> could be data coming from both recognizers and the UA might need to pick
> one of them (e.g. any EMMA attribute that can't be repeated with different
> values).****
>
>
> Cheers
> Satish****
>
> On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Satish, are you thinking of a scenario in which the UA runs multiple
> recognizers in parallel and then selects among the results?  If so, I think
> that’s a reasonable use case, but I’d like to preserve Jerry’s MUST  clause
> wrt annotations.  Could we agree on something like: ****
>
>  ****
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognizer(s) that were used to produce the corresponding result.  The
> UA MAY add additional annotations to provide a richer result for the
> developer."****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Friday, June 15, 2012 4:12 AM
> *To:* Jerry Carter
> *Cc:* Deborah Dahl; public-speech-api@w3.org
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ****
>
> "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>   ****
>
> If we use MUST above it would disallow UAs from selecting content from one
> recognizer over the other. So I think SHOULD would be more relevant.****
>
>  ****
>
> Cheers
> Satish****
>
> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>
> wrote:****
>
>  ****
>
> On Jun 15, 2012, at 5:06 AM, Satish S wrote:****
>
>    Is this roughly what you had in mind?****
>
>   ****
>
> I understood what Jerry wrote as****
>
> - There is a local recognizer, probably with a device specific grammar
> such as contacts and apps****
>
> - There is a remote recognizer that caters to a much wider scope****
>
> The UA would send audio to both and combine the results to deliver to
> Javascript****
>
>  ****
>
> Jerry, could you clarify which use case you meant? The language I proposed
> was aimed towards a recognizer sitting outside the UA and generating EMMA
> data in which case it seemed appropriate that the UA would pass it through
> unmodified. If the UA is indeed generating EMMA data (whether combining
> from multiple recognizers or where the recognizer doesn't give EMMA data)
> it should be allowed to do so.****
>
>   ****
>
> Your understanding is correct.  I've seen different architectures used.  *
> ***
>
>  ****
>
> Most often, there is a single vendor providing a single recognizer (either
> on device or in the cloud).   Also common, a single vendor will provide a
> consolidated product in which on device and in cloud resources are used
> together.  Here the interface may generate a single result which the UA
> would be free to pass along as-is.  In either case, the vendor will argue
> for using their result directly.****
>
>  ****
>
> Then, there are rarer but real cases (e.g. certain Samsung products),
> where multiple venders are used in combination or as alternatives.  When
> used in combination, a consolidated recognition result would be generated
> outside what I think of as the recognition resource.  When used as
> alternatives, the results might be restructured or altered with the goal of
> providing consistent content or formats for developers.  Either way, some
> entity is preparing the final EMMA result.  Coming from the media resource
> side, I think of that entity as the UA.  Someone from the UA side might
> very well think of that as being just a different recognition resource!***
> *
>
>  ****
>
> In my mind and in the discussions as EMMA was coming to maturity, there is
> no reason that an EMMA result need pass through layers without
> modification.  There are, in fact, mechanisms within EMMA do describe
> intermediate results and their connections.  A speech recognition result
> might detail the phonetic (or sub-phonetic) lattice, the corresponding
> tokens from a grammar, and the semantic meaning as a derivation chain.  A
> hybrid resource might offer separate processing chains and then a unified
> result.  The example that drove much of the discussion was a map
> application with a touch screen.  The user says "How do I drive from here
> to here" with corresponding touches.  The EMMA result could include the
> entire recognition chain (phonetics -> tokens -> semantics) with the
> sequence of touches (touch1, then touch2) and then produce a final result
> (mode=drive + location1 + location2) for passing to a routing application.
> ****
>
>  ****
>
> What is critical is that the application developer have access to the
> information required for their task.  Developers, in my experience, object
> when information is stripped.  Media resource vendors, working in response
> to developer requests, want assurances that additional details that they
> add will be passed through to the developer.  The current language****
>
>  ****
>
>  UA implementations for recognizers that supply EMMA *must* pass that
> EMMA structure directly.****
>
>   ****
>
> is too restrictive.   Let me suggest instead****
>
>  ****
>
>  "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>   ****
>
> I offered SHOULD or MUST.  I prefer MUST because I believe that the
> contents of a result generated by the recognizer exists for a reason.  I
> can accept SHOULD if there is a strong argument for presenting a simplified
> or altered result.****
>
>  ****
>
>    Milan and Salish, could you elaborate on what you had in mind when you
> raised concerns about the UA modifying the speech recognizer’s EMMA?****
>
>   ****
>
> The primary reason I added that clause in was to preserve those EMMA
> attributes (emma:process, ...) from the recognizer to JS without calling
> out specific attributes. Since we agreed that instead of calling out
> attributes we'll add use cases as examples, there is lesser reason for this
> clause now and I agree it does enable use cases like what I mentioned
> above. So I'm fine dropping that clause if there are no other strong
> reasons to keep it in.****
>
>  ****
>
> --****
>
> Cheers****
>
> Satish****
>
>   ****
>
> Makes sense.****
>
>  ****
>
> -=- Jerry****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>
> ** **
>
Received on Friday, 5 October 2012 01:31:53 UTC