Re: Review of EMMA usage in the Speech API (first editor's draft) from Glen Shires on 2012-10-05 (public-speech-api@w3.org from October 2012)

From: Glen Shires <gshires@google.com>
Date: Fri, 5 Oct 2012 11:34:14 -0700
To: "Young, Milan" <Milan.Young@nuance.com>, Jerry Carter <jerry@jerrycarter.org>, Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bchfSA8dXx8aChnS5Qhu1wiuGtX5_1kmWmWdWOPf_r2bEg@mail.gmail.com>
I've updated the spec with this updated definition for the EMMA attribute.
https://dvcs.w3.org/hg/speech-api/rev/dcc75df666a5

As always, the current draft spec is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

/Glen Shires

On Thu, Oct 4, 2012 at 6:30 PM, Glen Shires <gshires@google.com> wrote:

> If there's no disagreement with the proposed text below (Sep 12) I will
> update the spec tomorrow (Friday).
>
>
> On Tue, Sep 18, 2012 at 2:28 PM, Young, Milan <Milan.Young@nuance.com>wrote:
>
>>  Sorry Glen, ****
>>
>> ** **
>>
>> I got busy with other things and need time to catch up on this tread.
>> Please hold off until the end of the week on the change.****
>>
>> ** **
>>
>> Thanks****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Tuesday, September 18, 2012 1:18 PM
>> *To:* Young, Milan; Jerry Carter; Satish S; public-speech-api@w3.org
>>
>> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
>> draft)****
>>
>>  ** **
>>
>> If there's no disagreement with the proposed text below (Sep 12) I will
>> update the spec on Wednesday.****
>>
>> On Wed, Sep 12, 2012 at 8:56 AM, Glen Shires <gshires@google.com> wrote:*
>> ***
>>
>> There seems to be agreement on Jerry's wording except for the case that
>> Satish raises in which conflicting data comes from both recognizers and
>> can't be represented in EMMA (an EMMA attribute that can't be repeated with
>> different values).  I propose the following slight modification to Jerry's
>> wording that addresses this case. Here's my proposed full definition of the
>> EMMA attribute. (The first two sentences are copied from the current
>> definition in the spec.)****
>>
>> ** **
>>
>> "EMMA 1.0 representation of this result. The contents of this result
>> could vary across UAs and recognition engines, but all implementations must
>> expose a valid XML document complete with EMMA namespace. UA
>> implementations for recognizers that supply EMMA MUST contain all
>> annotations and content generated by the recognition resources utilized for
>> recognition, except where infeasible due to conflicting attributes.  The UA
>> MAY add additional annotations to provide a richer result for the
>> developer."****
>>
>> ** **
>>
>> /Glen Shires****
>>
>> ** **
>>
>> On Wed, Jun 20, 2012 at 10:03 AM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Nice write-up.  I agree with the language.  I also agree that further
>> details on running engines in parallel is necessary.****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Jerry Carter [mailto:jerry@jerrycarter.org]
>> *Sent:* Tuesday, June 19, 2012 9:22 PM
>> *To:* Satish S; Young, Milan
>> *Cc:* public-speech-api@w3.org****
>>
>>
>> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
>> draft)****
>>
>>  ****
>>
>> Been off-line for a few days, so trying to catch up a bit…****
>>
>>  ****
>>
>> *Where we agree*****
>>
>>  ****
>>
>> It sounds like there is general agreement, but that additional
>> word-smithing may be required.  By my reading of this thread, no one has
>> objected to language allowing the UA to add additional annotations or to
>> present a result which combines annotations from various sources.
>>  Likewise, no one has objected to a goal of presenting the full set of
>> annotations generated by the resources used.  This is encouraging common
>> ground.****
>>
>>  ****
>>
>>  ****
>>
>> *Single resource model: There might be no problem*****
>>
>>  ****
>>
>> The point of contention concerns how the UA handles cases where multiple
>> resources might be used to build a result.  This may not be a real problem.
>>  I do not see any language in the current draft which describes how
>> multiple resources would be specified or acquired, so I'm not at all
>> surprised that matters are unsettled.  One option is to draw the
>> architectural lines such that there is never more than one recognition
>> service.  Here the *serviceURI* would be required to point to a single
>> logical entity rather than a document describing a collection of resources.
>>  The single entity might employ disparate resources under the covers, but
>> this would be outside of the Speech API specification.****
>>
>>  ****
>>
>> In this case, the proposed language may be fine:****
>>
>>  ****
>>
>> "The EMMA document MUST contain all annotations and content generated by
>> the recognizer(s).  The UA MAY add additional annotations to provide a
>> richer result for the developer."****
>>
>>   ****
>>
>> and I can propose further language describing speech recognition
>> services in more detail.****
>>
>>  ****
>>
>>  ****
>>
>> *Dealing with Multiple Resources*****
>>
>>  ****
>>
>> Where the *serviceURI*  may refer to multiple resources, and at the risk
>> of over generalizing, there are two cases.  The first case is where the
>> 'start()' method invokes one or more resources and the result is built from
>> all of them.  This is the most common case that I've seen in live systems.
>>  There appears to be consensus that the result presented to the application
>> author should contain the detailed annotations produced by the resources
>> involved.****
>>
>>  ****
>>
>> So we're left with the second case, where the 'start()' method invokes
>> multiple resources but only a subset are used to generate the result.  Here
>> at least one of the resources must be optional for result production,
>> otherwise a recognition error would be generated.  Perhaps that optional
>> resource was temporarily unavailable, or perhaps resource constraints
>> prevented it from being used, or perhaps it failed to return a result
>> before a timeout occurred.  Whatever the reason, the optional resource did
>> not contribute to the recognition.  I have no problem with the EMMA result
>> excluding any mention of the disregarded optional resource.  I also see
>> nothing wrong with the UA adding annotations to describe any selection
>> criteria used to generate the EMMA result.****
>>
>>  ****
>>
>> Let me offer a slight variation on my original language and on Milan's
>> proposal to capture this case:****
>>
>>  ****
>>
>> "The EMMA document MUST contain all annotations and content generated by
>> the recognition resources utilized for recognition.  The UA MAY add
>> additional annotations to provide a richer result for the developer."****
>>
>>    ****
>>
>> Again, further language describing speech recognition services will be
>> helpful.****
>>
>>  ****
>>
>>  ****
>>
>> -=- Jerry****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> On Jun 19, 2012, at 11:13 AM, Satish S wrote:****
>>
>> ** **
>>
>> Since there could be data coming from both recognizers and the UA might
>> need to pick one of them and drop the rest (e.g. any EMMA attribute that
>> can't be repeated with different values), we can't say the UA MUST send all
>> content and all annotations. Using SHOULD allows the UA to implement this
>> use case. ****
>>
>>
>> Cheers
>> Satish****
>>
>> On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Good to hear.  So what did you think about the proposed text?  I didn’t
>> like the previous SHOULD-based suggestion.****
>>
>>  ****
>>
>> Thanks****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Satish S [mailto:satish@google.com]
>> *Sent:* Tuesday, June 19, 2012 4:27 AM
>> *To:* Young, Milan
>> *Cc:* Jerry Carter; Deborah Dahl; public-speech-api@w3.org****
>>
>>
>> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
>> draft)****
>>
>>  ****
>>
>>  ****
>>
>> Yes I believe that was the use case Jerry mentioned earlier as well.
>> There could be data coming from both recognizers and the UA might need to
>> pick one of them (e.g. any EMMA attribute that can't be repeated with
>> different values).****
>>
>>
>> Cheers
>> Satish****
>>
>> On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com>
>> wrote:****
>>
>> Satish, are you thinking of a scenario in which the UA runs multiple
>> recognizers in parallel and then selects among the results?  If so, I think
>> that’s a reasonable use case, but I’d like to preserve Jerry’s MUST  clause
>> wrt annotations.  Could we agree on something like: ****
>>
>>  ****
>>
>> "The EMMA document MUST contain all annotations and content generated by
>> the recognizer(s) that were used to produce the corresponding result.  The
>> UA MAY add additional annotations to provide a richer result for the
>> developer."****
>>
>>  ****
>>
>> Thanks****
>>
>>  ****
>>
>>  ****
>>
>> *From:* Satish S [mailto:satish@google.com]
>> *Sent:* Friday, June 15, 2012 4:12 AM
>> *To:* Jerry Carter
>> *Cc:* Deborah Dahl; public-speech-api@w3.org
>> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's
>> draft)****
>>
>>  ****
>>
>> "The EMMA document MUST/SHOULD contain all annotations and content
>> generated by the recognizer(s).  The UA MAY add additional annotations to
>> provide a richer result for the developer."****
>>
>>   ****
>>
>> If we use MUST above it would disallow UAs from selecting content from
>> one recognizer over the other. So I think SHOULD would be more relevant.*
>> ***
>>
>>  ****
>>
>> Cheers
>> Satish****
>>
>> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>
>> wrote:****
>>
>>  ****
>>
>> On Jun 15, 2012, at 5:06 AM, Satish S wrote:****
>>
>>    Is this roughly what you had in mind?****
>>
>>   ****
>>
>> I understood what Jerry wrote as****
>>
>> - There is a local recognizer, probably with a device specific grammar
>> such as contacts and apps****
>>
>> - There is a remote recognizer that caters to a much wider scope****
>>
>> The UA would send audio to both and combine the results to deliver to
>> Javascript****
>>
>>  ****
>>
>> Jerry, could you clarify which use case you meant? The language I
>> proposed was aimed towards a recognizer sitting outside the UA and
>> generating EMMA data in which case it seemed appropriate that the UA would
>> pass it through unmodified. If the UA is indeed generating EMMA data
>> (whether combining from multiple recognizers or where the recognizer
>> doesn't give EMMA data) it should be allowed to do so.****
>>
>>   ****
>>
>> Your understanding is correct.  I've seen different architectures used.
>> ****
>>
>>  ****
>>
>> Most often, there is a single vendor providing a single recognizer
>> (either on device or in the cloud).   Also common, a single vendor will
>> provide a consolidated product in which on device and in cloud resources
>> are used together.  Here the interface may generate a single result which
>> the UA would be free to pass along as-is.  In either case, the vendor will
>> argue for using their result directly.****
>>
>>  ****
>>
>> Then, there are rarer but real cases (e.g. certain Samsung products),
>> where multiple venders are used in combination or as alternatives.  When
>> used in combination, a consolidated recognition result would be generated
>> outside what I think of as the recognition resource.  When used as
>> alternatives, the results might be restructured or altered with the goal of
>> providing consistent content or formats for developers.  Either way, some
>> entity is preparing the final EMMA result.  Coming from the media resource
>> side, I think of that entity as the UA.  Someone from the UA side might
>> very well think of that as being just a different recognition resource!**
>> **
>>
>>  ****
>>
>> In my mind and in the discussions as EMMA was coming to maturity, there
>> is no reason that an EMMA result need pass through layers without
>> modification.  There are, in fact, mechanisms within EMMA do describe
>> intermediate results and their connections.  A speech recognition result
>> might detail the phonetic (or sub-phonetic) lattice, the corresponding
>> tokens from a grammar, and the semantic meaning as a derivation chain.  A
>> hybrid resource might offer separate processing chains and then a unified
>> result.  The example that drove much of the discussion was a map
>> application with a touch screen.  The user says "How do I drive from here
>> to here" with corresponding touches.  The EMMA result could include the
>> entire recognition chain (phonetics -> tokens -> semantics) with the
>> sequence of touches (touch1, then touch2) and then produce a final result
>> (mode=drive + location1 + location2) for passing to a routing application.
>> ****
>>
>>  ****
>>
>> What is critical is that the application developer have access to the
>> information required for their task.  Developers, in my experience, object
>> when information is stripped.  Media resource vendors, working in response
>> to developer requests, want assurances that additional details that they
>> add will be passed through to the developer.  The current language****
>>
>>  ****
>>
>>  UA implementations for recognizers that supply EMMA *must* pass that
>> EMMA structure directly.****
>>
>>   ****
>>
>> is too restrictive.   Let me suggest instead****
>>
>>  ****
>>
>>  "The EMMA document MUST/SHOULD contain all annotations and content
>> generated by the recognizer(s).  The UA MAY add additional annotations to
>> provide a richer result for the developer."****
>>
>>   ****
>>
>> I offered SHOULD or MUST.  I prefer MUST because I believe that the
>> contents of a result generated by the recognizer exists for a reason.  I
>> can accept SHOULD if there is a strong argument for presenting a simplified
>> or altered result.****
>>
>>  ****
>>
>>    Milan and Salish, could you elaborate on what you had in mind when
>> you raised concerns about the UA modifying the speech recognizer’s EMMA?*
>> ***
>>
>>   ****
>>
>> The primary reason I added that clause in was to preserve those EMMA
>> attributes (emma:process, ...) from the recognizer to JS without calling
>> out specific attributes. Since we agreed that instead of calling out
>> attributes we'll add use cases as examples, there is lesser reason for this
>> clause now and I agree it does enable use cases like what I mentioned
>> above. So I'm fine dropping that clause if there are no other strong
>> reasons to keep it in.****
>>
>>  ****
>>
>> --****
>>
>> Cheers****
>>
>> Satish****
>>
>>   ****
>>
>> Makes sense.****
>>
>>  ****
>>
>> -=- Jerry****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> ** **
>>
>> ** **
>>
>
>
Received on Friday, 5 October 2012 18:35:23 UTC