- From: Glen Shires <gshires@google.com>
- Date: Thu, 4 Oct 2012 18:30:36 -0700
- To: "Young, Milan" <Milan.Young@nuance.com>
- Cc: Jerry Carter <jerry@jerrycarter.org>, Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
- Message-ID: <CAEE5bchP-XecApL4bCk-e=JN56-PUw5+-g-bChj72+v+cNJb3A@mail.gmail.com>
If there's no disagreement with the proposed text below (Sep 12) I will update the spec tomorrow (Friday). On Tue, Sep 18, 2012 at 2:28 PM, Young, Milan <Milan.Young@nuance.com>wrote: > Sorry Glen, **** > > ** ** > > I got busy with other things and need time to catch up on this tread. > Please hold off until the end of the week on the change.**** > > ** ** > > Thanks**** > > ** ** > > ** ** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Tuesday, September 18, 2012 1:18 PM > *To:* Young, Milan; Jerry Carter; Satish S; public-speech-api@w3.org > > *Subject:* Re: Review of EMMA usage in the Speech API (first editor's > draft)**** > > ** ** > > If there's no disagreement with the proposed text below (Sep 12) I will > update the spec on Wednesday.**** > > On Wed, Sep 12, 2012 at 8:56 AM, Glen Shires <gshires@google.com> wrote:** > ** > > There seems to be agreement on Jerry's wording except for the case that > Satish raises in which conflicting data comes from both recognizers and > can't be represented in EMMA (an EMMA attribute that can't be repeated with > different values). I propose the following slight modification to Jerry's > wording that addresses this case. Here's my proposed full definition of the > EMMA attribute. (The first two sentences are copied from the current > definition in the spec.)**** > > ** ** > > "EMMA 1.0 representation of this result. The contents of this result could > vary across UAs and recognition engines, but all implementations must > expose a valid XML document complete with EMMA namespace. UA > implementations for recognizers that supply EMMA MUST contain all > annotations and content generated by the recognition resources utilized for > recognition, except where infeasible due to conflicting attributes. The UA > MAY add additional annotations to provide a richer result for the > developer."**** > > ** ** > > /Glen Shires**** > > ** ** > > On Wed, Jun 20, 2012 at 10:03 AM, Young, Milan <Milan.Young@nuance.com> > wrote:**** > > Nice write-up. I agree with the language. I also agree that further > details on running engines in parallel is necessary.**** > > **** > > **** > > *From:* Jerry Carter [mailto:jerry@jerrycarter.org] > *Sent:* Tuesday, June 19, 2012 9:22 PM > *To:* Satish S; Young, Milan > *Cc:* public-speech-api@w3.org**** > > > *Subject:* Re: Review of EMMA usage in the Speech API (first editor's > draft)**** > > **** > > Been off-line for a few days, so trying to catch up a bit…**** > > **** > > *Where we agree***** > > **** > > It sounds like there is general agreement, but that additional > word-smithing may be required. By my reading of this thread, no one has > objected to language allowing the UA to add additional annotations or to > present a result which combines annotations from various sources. > Likewise, no one has objected to a goal of presenting the full set of > annotations generated by the resources used. This is encouraging common > ground.**** > > **** > > **** > > *Single resource model: There might be no problem***** > > **** > > The point of contention concerns how the UA handles cases where multiple > resources might be used to build a result. This may not be a real problem. > I do not see any language in the current draft which describes how > multiple resources would be specified or acquired, so I'm not at all > surprised that matters are unsettled. One option is to draw the > architectural lines such that there is never more than one recognition > service. Here the *serviceURI* would be required to point to a single > logical entity rather than a document describing a collection of resources. > The single entity might employ disparate resources under the covers, but > this would be outside of the Speech API specification.**** > > **** > > In this case, the proposed language may be fine:**** > > **** > > "The EMMA document MUST contain all annotations and content generated by > the recognizer(s). The UA MAY add additional annotations to provide a > richer result for the developer."**** > > **** > > and I can propose further language describing speech recognition services > in more detail.**** > > **** > > **** > > *Dealing with Multiple Resources***** > > **** > > Where the *serviceURI* may refer to multiple resources, and at the risk > of over generalizing, there are two cases. The first case is where the > 'start()' method invokes one or more resources and the result is built from > all of them. This is the most common case that I've seen in live systems. > There appears to be consensus that the result presented to the application > author should contain the detailed annotations produced by the resources > involved.**** > > **** > > So we're left with the second case, where the 'start()' method invokes > multiple resources but only a subset are used to generate the result. Here > at least one of the resources must be optional for result production, > otherwise a recognition error would be generated. Perhaps that optional > resource was temporarily unavailable, or perhaps resource constraints > prevented it from being used, or perhaps it failed to return a result > before a timeout occurred. Whatever the reason, the optional resource did > not contribute to the recognition. I have no problem with the EMMA result > excluding any mention of the disregarded optional resource. I also see > nothing wrong with the UA adding annotations to describe any selection > criteria used to generate the EMMA result.**** > > **** > > Let me offer a slight variation on my original language and on Milan's > proposal to capture this case:**** > > **** > > "The EMMA document MUST contain all annotations and content generated by > the recognition resources utilized for recognition. The UA MAY add > additional annotations to provide a richer result for the developer."**** > > **** > > Again, further language describing speech recognition services will be > helpful.**** > > **** > > **** > > -=- Jerry**** > > **** > > **** > > **** > > On Jun 19, 2012, at 11:13 AM, Satish S wrote:**** > > ** ** > > Since there could be data coming from both recognizers and the UA might > need to pick one of them and drop the rest (e.g. any EMMA attribute that > can't be repeated with different values), we can't say the UA MUST send all > content and all annotations. Using SHOULD allows the UA to implement this > use case. **** > > > Cheers > Satish**** > > On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com> > wrote:**** > > Good to hear. So what did you think about the proposed text? I didn’t > like the previous SHOULD-based suggestion.**** > > **** > > Thanks**** > > **** > > **** > > *From:* Satish S [mailto:satish@google.com] > *Sent:* Tuesday, June 19, 2012 4:27 AM > *To:* Young, Milan > *Cc:* Jerry Carter; Deborah Dahl; public-speech-api@w3.org**** > > > *Subject:* Re: Review of EMMA usage in the Speech API (first editor's > draft)**** > > **** > > **** > > Yes I believe that was the use case Jerry mentioned earlier as well. There > could be data coming from both recognizers and the UA might need to pick > one of them (e.g. any EMMA attribute that can't be repeated with different > values).**** > > > Cheers > Satish**** > > On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com> > wrote:**** > > Satish, are you thinking of a scenario in which the UA runs multiple > recognizers in parallel and then selects among the results? If so, I think > that’s a reasonable use case, but I’d like to preserve Jerry’s MUST clause > wrt annotations. Could we agree on something like: **** > > **** > > "The EMMA document MUST contain all annotations and content generated by > the recognizer(s) that were used to produce the corresponding result. The > UA MAY add additional annotations to provide a richer result for the > developer."**** > > **** > > Thanks**** > > **** > > **** > > *From:* Satish S [mailto:satish@google.com] > *Sent:* Friday, June 15, 2012 4:12 AM > *To:* Jerry Carter > *Cc:* Deborah Dahl; public-speech-api@w3.org > *Subject:* Re: Review of EMMA usage in the Speech API (first editor's > draft)**** > > **** > > "The EMMA document MUST/SHOULD contain all annotations and content > generated by the recognizer(s). The UA MAY add additional annotations to > provide a richer result for the developer."**** > > **** > > If we use MUST above it would disallow UAs from selecting content from one > recognizer over the other. So I think SHOULD would be more relevant.**** > > **** > > Cheers > Satish**** > > On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org> > wrote:**** > > **** > > On Jun 15, 2012, at 5:06 AM, Satish S wrote:**** > > Is this roughly what you had in mind?**** > > **** > > I understood what Jerry wrote as**** > > - There is a local recognizer, probably with a device specific grammar > such as contacts and apps**** > > - There is a remote recognizer that caters to a much wider scope**** > > The UA would send audio to both and combine the results to deliver to > Javascript**** > > **** > > Jerry, could you clarify which use case you meant? The language I proposed > was aimed towards a recognizer sitting outside the UA and generating EMMA > data in which case it seemed appropriate that the UA would pass it through > unmodified. If the UA is indeed generating EMMA data (whether combining > from multiple recognizers or where the recognizer doesn't give EMMA data) > it should be allowed to do so.**** > > **** > > Your understanding is correct. I've seen different architectures used. * > *** > > **** > > Most often, there is a single vendor providing a single recognizer (either > on device or in the cloud). Also common, a single vendor will provide a > consolidated product in which on device and in cloud resources are used > together. Here the interface may generate a single result which the UA > would be free to pass along as-is. In either case, the vendor will argue > for using their result directly.**** > > **** > > Then, there are rarer but real cases (e.g. certain Samsung products), > where multiple venders are used in combination or as alternatives. When > used in combination, a consolidated recognition result would be generated > outside what I think of as the recognition resource. When used as > alternatives, the results might be restructured or altered with the goal of > providing consistent content or formats for developers. Either way, some > entity is preparing the final EMMA result. Coming from the media resource > side, I think of that entity as the UA. Someone from the UA side might > very well think of that as being just a different recognition resource!*** > * > > **** > > In my mind and in the discussions as EMMA was coming to maturity, there is > no reason that an EMMA result need pass through layers without > modification. There are, in fact, mechanisms within EMMA do describe > intermediate results and their connections. A speech recognition result > might detail the phonetic (or sub-phonetic) lattice, the corresponding > tokens from a grammar, and the semantic meaning as a derivation chain. A > hybrid resource might offer separate processing chains and then a unified > result. The example that drove much of the discussion was a map > application with a touch screen. The user says "How do I drive from here > to here" with corresponding touches. The EMMA result could include the > entire recognition chain (phonetics -> tokens -> semantics) with the > sequence of touches (touch1, then touch2) and then produce a final result > (mode=drive + location1 + location2) for passing to a routing application. > **** > > **** > > What is critical is that the application developer have access to the > information required for their task. Developers, in my experience, object > when information is stripped. Media resource vendors, working in response > to developer requests, want assurances that additional details that they > add will be passed through to the developer. The current language**** > > **** > > UA implementations for recognizers that supply EMMA *must* pass that > EMMA structure directly.**** > > **** > > is too restrictive. Let me suggest instead**** > > **** > > "The EMMA document MUST/SHOULD contain all annotations and content > generated by the recognizer(s). The UA MAY add additional annotations to > provide a richer result for the developer."**** > > **** > > I offered SHOULD or MUST. I prefer MUST because I believe that the > contents of a result generated by the recognizer exists for a reason. I > can accept SHOULD if there is a strong argument for presenting a simplified > or altered result.**** > > **** > > Milan and Salish, could you elaborate on what you had in mind when you > raised concerns about the UA modifying the speech recognizer’s EMMA?**** > > **** > > The primary reason I added that clause in was to preserve those EMMA > attributes (emma:process, ...) from the recognizer to JS without calling > out specific attributes. Since we agreed that instead of calling out > attributes we'll add use cases as examples, there is lesser reason for this > clause now and I agree it does enable use cases like what I mentioned > above. So I'm fine dropping that clause if there are no other strong > reasons to keep it in.**** > > **** > > --**** > > Cheers**** > > Satish**** > > **** > > Makes sense.**** > > **** > > -=- Jerry**** > > **** > > **** > > **** > > **** > > ** ** > > ** ** >
Received on Friday, 5 October 2012 01:31:53 UTC