Re: Review of EMMA usage in the Speech API (first editor's draft) from Glen Shires on 2012-09-12 (public-speech-api@w3.org from September 2012)

From: Glen Shires <gshires@google.com>
Date: Wed, 12 Sep 2012 08:56:47 -0700
To: "Young, Milan" <Milan.Young@nuance.com>, Jerry Carter <jerry@jerrycarter.org>, Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bchpzSZ2YE+Of5j_Xoyx7NQDa1wD0Mvqknb0qYjgLF=ZtQ@mail.gmail.com>
There seems to be agreement on Jerry's wording except for the case that
Satish raises in which conflicting data comes from both recognizers and
can't be represented in EMMA (an EMMA attribute that can't be repeated with
different values).  I propose the following slight modification to Jerry's
wording that addresses this case. Here's my proposed full definition of the
EMMA attribute. (The first two sentences are copied from the current
definition in the spec.)

"EMMA 1.0 representation of this result. The contents of this result could
vary across UAs and recognition engines, but all implementations must
expose a valid XML document complete with EMMA namespace. UA
implementations for recognizers that supply EMMA MUST contain all
annotations and content generated by the recognition resources utilized for
recognition, except where infeasible due to conflicting attributes.  The UA
MAY add additional annotations to provide a richer result for the
developer."

/Glen Shires

On Wed, Jun 20, 2012 at 10:03 AM, Young, Milan <Milan.Young@nuance.com>wrote:

>  Nice write-up.  I agree with the language.  I also agree that further
> details on running engines in parallel is necessary.****
>
> ** **
>
> ** **
>
> *From:* Jerry Carter [mailto:jerry@jerrycarter.org]
> *Sent:* Tuesday, June 19, 2012 9:22 PM
> *To:* Satish S; Young, Milan
> *Cc:* public-speech-api@w3.org
>
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ** **
>
> Been off-line for a few days, so trying to catch up a bit…****
>
> ** **
>
> *Where we agree*****
>
> ** **
>
> It sounds like there is general agreement, but that additional
> word-smithing may be required.  By my reading of this thread, no one has
> objected to language allowing the UA to add additional annotations or to
> present a result which combines annotations from various sources.
>  Likewise, no one has objected to a goal of presenting the full set of
> annotations generated by the resources used.  This is encouraging common
> ground.****
>
> ** **
>
> ** **
>
> *Single resource model: There might be no problem*****
>
> ** **
>
> The point of contention concerns how the UA handles cases where multiple
> resources might be used to build a result.  This may not be a real problem.
>  I do not see any language in the current draft which describes how
> multiple resources would be specified or acquired, so I'm not at all
> surprised that matters are unsettled.  One option is to draw the
> architectural lines such that there is never more than one recognition
> service.  Here the *serviceURI* would be required to point to a single
> logical entity rather than a document describing a collection of resources.
>  The single entity might employ disparate resources under the covers, but
> this would be outside of the Speech API specification.****
>
> ** **
>
> In this case, the proposed language may be fine:****
>
> ** **
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognizer(s).  The UA MAY add additional annotations to provide a
> richer result for the developer."****
>
>  ** **
>
> and I can propose further language describing speech recognition services
> in more detail.****
>
> ** **
>
> ** **
>
> *Dealing with Multiple Resources*****
>
> ** **
>
> Where the *serviceURI*  may refer to multiple resources, and at the risk
> of over generalizing, there are two cases.  The first case is where the
> 'start()' method invokes one or more resources and the result is built from
> all of them.  This is the most common case that I've seen in live systems.
>  There appears to be consensus that the result presented to the application
> author should contain the detailed annotations produced by the resources
> involved.****
>
> ** **
>
> So we're left with the second case, where the 'start()' method invokes
> multiple resources but only a subset are used to generate the result.  Here
> at least one of the resources must be optional for result production,
> otherwise a recognition error would be generated.  Perhaps that optional
> resource was temporarily unavailable, or perhaps resource constraints
> prevented it from being used, or perhaps it failed to return a result
> before a timeout occurred.  Whatever the reason, the optional resource did
> not contribute to the recognition.  I have no problem with the EMMA result
> excluding any mention of the disregarded optional resource.  I also see
> nothing wrong with the UA adding annotations to describe any selection
> criteria used to generate the EMMA result.****
>
> ** **
>
> Let me offer a slight variation on my original language and on Milan's
> proposal to capture this case:****
>
> ** **
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognition resources utilized for recognition.  The UA MAY add
> additional annotations to provide a richer result for the developer."****
>
>   ** **
>
> Again, further language describing speech recognition services will be
> helpful.****
>
> ** **
>
> ** **
>
> -=- Jerry****
>
> ** **
>
> ** **
>
> ** **
>
> On Jun 19, 2012, at 11:13 AM, Satish S wrote:****
>
>
>
> ****
>
> Since there could be data coming from both recognizers and the UA might
> need to pick one of them and drop the rest (e.g. any EMMA attribute that
> can't be repeated with different values), we can't say the UA MUST send all
> content and all annotations. Using SHOULD allows the UA to implement this
> use case. ****
>
>
> Cheers
> Satish
>
> ****
>
> On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:
>
> ****
>
> Good to hear.  So what did you think about the proposed text?  I didn’t
> like the previous SHOULD-based suggestion.****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Tuesday, June 19, 2012 4:27 AM
> *To:* Young, Milan
> *Cc:* Jerry Carter; Deborah Dahl; public-speech-api@w3.org****
>
>
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
> ** **
>
>  ****
>
> Yes I believe that was the use case Jerry mentioned earlier as well. There
> could be data coming from both recognizers and the UA might need to pick
> one of them (e.g. any EMMA attribute that can't be repeated with different
> values).****
>
>
> Cheers
> Satish****
>
> On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> Satish, are you thinking of a scenario in which the UA runs multiple
> recognizers in parallel and then selects among the results?  If so, I think
> that’s a reasonable use case, but I’d like to preserve Jerry’s MUST  clause
> wrt annotations.  Could we agree on something like: ****
>
>  ****
>
> "The EMMA document MUST contain all annotations and content generated by
> the recognizer(s) that were used to produce the corresponding result.  The
> UA MAY add additional annotations to provide a richer result for the
> developer."****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Friday, June 15, 2012 4:12 AM
> *To:* Jerry Carter
> *Cc:* Deborah Dahl; public-speech-api@w3.org
> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
> draft)****
>
>  ****
>
> "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>   ****
>
> If we use MUST above it would disallow UAs from selecting content from one
> recognizer over the other. So I think SHOULD would be more relevant.****
>
>  ****
>
> Cheers
> Satish****
>
> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>
> wrote:****
>
>  ****
>
> On Jun 15, 2012, at 5:06 AM, Satish S wrote:****
>
>    Is this roughly what you had in mind?****
>
>   ****
>
> I understood what Jerry wrote as****
>
> - There is a local recognizer, probably with a device specific grammar
> such as contacts and apps****
>
> - There is a remote recognizer that caters to a much wider scope****
>
> The UA would send audio to both and combine the results to deliver to
> Javascript****
>
>  ****
>
> Jerry, could you clarify which use case you meant? The language I proposed
> was aimed towards a recognizer sitting outside the UA and generating EMMA
> data in which case it seemed appropriate that the UA would pass it through
> unmodified. If the UA is indeed generating EMMA data (whether combining
> from multiple recognizers or where the recognizer doesn't give EMMA data)
> it should be allowed to do so.****
>
>   ****
>
> Your understanding is correct.  I've seen different architectures used.  *
> ***
>
>  ****
>
> Most often, there is a single vendor providing a single recognizer (either
> on device or in the cloud).   Also common, a single vendor will provide a
> consolidated product in which on device and in cloud resources are used
> together.  Here the interface may generate a single result which the UA
> would be free to pass along as-is.  In either case, the vendor will argue
> for using their result directly.****
>
>  ****
>
> Then, there are rarer but real cases (e.g. certain Samsung products),
> where multiple venders are used in combination or as alternatives.  When
> used in combination, a consolidated recognition result would be generated
> outside what I think of as the recognition resource.  When used as
> alternatives, the results might be restructured or altered with the goal of
> providing consistent content or formats for developers.  Either way, some
> entity is preparing the final EMMA result.  Coming from the media resource
> side, I think of that entity as the UA.  Someone from the UA side might
> very well think of that as being just a different recognition resource!***
> *
>
>  ****
>
> In my mind and in the discussions as EMMA was coming to maturity, there is
> no reason that an EMMA result need pass through layers without
> modification.  There are, in fact, mechanisms within EMMA do describe
> intermediate results and their connections.  A speech recognition result
> might detail the phonetic (or sub-phonetic) lattice, the corresponding
> tokens from a grammar, and the semantic meaning as a derivation chain.  A
> hybrid resource might offer separate processing chains and then a unified
> result.  The example that drove much of the discussion was a map
> application with a touch screen.  The user says "How do I drive from here
> to here" with corresponding touches.  The EMMA result could include the
> entire recognition chain (phonetics -> tokens -> semantics) with the
> sequence of touches (touch1, then touch2) and then produce a final result
> (mode=drive + location1 + location2) for passing to a routing application.
> ****
>
>  ****
>
> What is critical is that the application developer have access to the
> information required for their task.  Developers, in my experience, object
> when information is stripped.  Media resource vendors, working in response
> to developer requests, want assurances that additional details that they
> add will be passed through to the developer.  The current language****
>
>  ****
>
>  UA implementations for recognizers that supply EMMA *must* pass that
> EMMA structure directly.****
>
>   ****
>
> is too restrictive.   Let me suggest instead****
>
>  ****
>
>  "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."****
>
>   ****
>
> I offered SHOULD or MUST.  I prefer MUST because I believe that the
> contents of a result generated by the recognizer exists for a reason.  I
> can accept SHOULD if there is a strong argument for presenting a simplified
> or altered result.****
>
>  ****
>
>    Milan and Salish, could you elaborate on what you had in mind when you
> raised concerns about the UA modifying the speech recognizer’s EMMA?****
>
>   ****
>
> The primary reason I added that clause in was to preserve those EMMA
> attributes (emma:process, ...) from the recognizer to JS without calling
> out specific attributes. Since we agreed that instead of calling out
> attributes we'll add use cases as examples, there is lesser reason for this
> clause now and I agree it does enable use cases like what I mentioned
> above. So I'm fine dropping that clause if there are no other strong
> reasons to keep it in.****
>
>  ****
>
> --****
>
> Cheers****
>
> Satish****
>
>   ****
>
> Makes sense.****
>
>  ****
>
> -=- Jerry****
>
>  ****
>
>  ****
>
> ** **
>
> ** **
>
Received on Wednesday, 12 September 2012 15:58:08 UTC