- From: Glen Shires <gshires@google.com>
- Date: Tue, 18 Sep 2012 13:18:27 -0700
- To: "Young, Milan" <Milan.Young@nuance.com>, Jerry Carter <jerry@jerrycarter.org>, Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
- Message-ID: <CAEE5bcjqcAn8TiYii-Fhd=vsrtJM5MtDvyzdVYObQ_nGq-pwwA@mail.gmail.com>
If there's no disagreement with the proposed text below (Sep 12) I will update the spec on Wednesday. On Wed, Sep 12, 2012 at 8:56 AM, Glen Shires <gshires@google.com> wrote: > There seems to be agreement on Jerry's wording except for the case that > Satish raises in which conflicting data comes from both recognizers and > can't be represented in EMMA (an EMMA attribute that can't be repeated with > different values). I propose the following slight modification to Jerry's > wording that addresses this case. Here's my proposed full definition of the > EMMA attribute. (The first two sentences are copied from the current > definition in the spec.) > > "EMMA 1.0 representation of this result. The contents of this result could > vary across UAs and recognition engines, but all implementations must > expose a valid XML document complete with EMMA namespace. UA > implementations for recognizers that supply EMMA MUST contain all > annotations and content generated by the recognition resources utilized for > recognition, except where infeasible due to conflicting attributes. The UA > MAY add additional annotations to provide a richer result for the > developer." > > /Glen Shires > > > On Wed, Jun 20, 2012 at 10:03 AM, Young, Milan <Milan.Young@nuance.com>wrote: > >> Nice write-up. I agree with the language. I also agree that further >> details on running engines in parallel is necessary.**** >> >> ** ** >> >> ** ** >> >> *From:* Jerry Carter [mailto:jerry@jerrycarter.org] >> *Sent:* Tuesday, June 19, 2012 9:22 PM >> *To:* Satish S; Young, Milan >> *Cc:* public-speech-api@w3.org >> >> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's >> draft)**** >> >> ** ** >> >> Been off-line for a few days, so trying to catch up a bit…**** >> >> ** ** >> >> *Where we agree***** >> >> ** ** >> >> It sounds like there is general agreement, but that additional >> word-smithing may be required. By my reading of this thread, no one has >> objected to language allowing the UA to add additional annotations or to >> present a result which combines annotations from various sources. >> Likewise, no one has objected to a goal of presenting the full set of >> annotations generated by the resources used. This is encouraging common >> ground.**** >> >> ** ** >> >> ** ** >> >> *Single resource model: There might be no problem***** >> >> ** ** >> >> The point of contention concerns how the UA handles cases where multiple >> resources might be used to build a result. This may not be a real problem. >> I do not see any language in the current draft which describes how >> multiple resources would be specified or acquired, so I'm not at all >> surprised that matters are unsettled. One option is to draw the >> architectural lines such that there is never more than one recognition >> service. Here the *serviceURI* would be required to point to a single >> logical entity rather than a document describing a collection of resources. >> The single entity might employ disparate resources under the covers, but >> this would be outside of the Speech API specification.**** >> >> ** ** >> >> In this case, the proposed language may be fine:**** >> >> ** ** >> >> "The EMMA document MUST contain all annotations and content generated by >> the recognizer(s). The UA MAY add additional annotations to provide a >> richer result for the developer."**** >> >> ** ** >> >> and I can propose further language describing speech recognition >> services in more detail.**** >> >> ** ** >> >> ** ** >> >> *Dealing with Multiple Resources***** >> >> ** ** >> >> Where the *serviceURI* may refer to multiple resources, and at the risk >> of over generalizing, there are two cases. The first case is where the >> 'start()' method invokes one or more resources and the result is built from >> all of them. This is the most common case that I've seen in live systems. >> There appears to be consensus that the result presented to the application >> author should contain the detailed annotations produced by the resources >> involved.**** >> >> ** ** >> >> So we're left with the second case, where the 'start()' method invokes >> multiple resources but only a subset are used to generate the result. Here >> at least one of the resources must be optional for result production, >> otherwise a recognition error would be generated. Perhaps that optional >> resource was temporarily unavailable, or perhaps resource constraints >> prevented it from being used, or perhaps it failed to return a result >> before a timeout occurred. Whatever the reason, the optional resource did >> not contribute to the recognition. I have no problem with the EMMA result >> excluding any mention of the disregarded optional resource. I also see >> nothing wrong with the UA adding annotations to describe any selection >> criteria used to generate the EMMA result.**** >> >> ** ** >> >> Let me offer a slight variation on my original language and on Milan's >> proposal to capture this case:**** >> >> ** ** >> >> "The EMMA document MUST contain all annotations and content generated by >> the recognition resources utilized for recognition. The UA MAY add >> additional annotations to provide a richer result for the developer."**** >> >> ** ** >> >> Again, further language describing speech recognition services will be >> helpful.**** >> >> ** ** >> >> ** ** >> >> -=- Jerry**** >> >> ** ** >> >> ** ** >> >> ** ** >> >> On Jun 19, 2012, at 11:13 AM, Satish S wrote:**** >> >> >> >> **** >> >> Since there could be data coming from both recognizers and the UA might >> need to pick one of them and drop the rest (e.g. any EMMA attribute that >> can't be repeated with different values), we can't say the UA MUST send all >> content and all annotations. Using SHOULD allows the UA to implement this >> use case. **** >> >> >> Cheers >> Satish >> >> **** >> >> On Tue, Jun 19, 2012 at 4:07 PM, Young, Milan <Milan.Young@nuance.com> >> wrote: >> >> **** >> >> Good to hear. So what did you think about the proposed text? I didn’t >> like the previous SHOULD-based suggestion.**** >> >> **** >> >> Thanks**** >> >> **** >> >> **** >> >> *From:* Satish S [mailto:satish@google.com] >> *Sent:* Tuesday, June 19, 2012 4:27 AM >> *To:* Young, Milan >> *Cc:* Jerry Carter; Deborah Dahl; public-speech-api@w3.org**** >> >> >> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's >> draft)**** >> >> ** ** >> >> **** >> >> Yes I believe that was the use case Jerry mentioned earlier as well. >> There could be data coming from both recognizers and the UA might need to >> pick one of them (e.g. any EMMA attribute that can't be repeated with >> different values).**** >> >> >> Cheers >> Satish**** >> >> On Tue, Jun 19, 2012 at 1:55 AM, Young, Milan <Milan.Young@nuance.com> >> wrote:**** >> >> Satish, are you thinking of a scenario in which the UA runs multiple >> recognizers in parallel and then selects among the results? If so, I think >> that’s a reasonable use case, but I’d like to preserve Jerry’s MUST clause >> wrt annotations. Could we agree on something like: **** >> >> **** >> >> "The EMMA document MUST contain all annotations and content generated by >> the recognizer(s) that were used to produce the corresponding result. The >> UA MAY add additional annotations to provide a richer result for the >> developer."**** >> >> **** >> >> Thanks**** >> >> **** >> >> **** >> >> *From:* Satish S [mailto:satish@google.com] >> *Sent:* Friday, June 15, 2012 4:12 AM >> *To:* Jerry Carter >> *Cc:* Deborah Dahl; public-speech-api@w3.org >> *Subject:* Re: Review of EMMA usage in the Speech API (first editor's >> draft)**** >> >> **** >> >> "The EMMA document MUST/SHOULD contain all annotations and content >> generated by the recognizer(s). The UA MAY add additional annotations to >> provide a richer result for the developer."**** >> >> **** >> >> If we use MUST above it would disallow UAs from selecting content from >> one recognizer over the other. So I think SHOULD would be more relevant.* >> *** >> >> **** >> >> Cheers >> Satish**** >> >> On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org> >> wrote:**** >> >> **** >> >> On Jun 15, 2012, at 5:06 AM, Satish S wrote:**** >> >> Is this roughly what you had in mind?**** >> >> **** >> >> I understood what Jerry wrote as**** >> >> - There is a local recognizer, probably with a device specific grammar >> such as contacts and apps**** >> >> - There is a remote recognizer that caters to a much wider scope**** >> >> The UA would send audio to both and combine the results to deliver to >> Javascript**** >> >> **** >> >> Jerry, could you clarify which use case you meant? The language I >> proposed was aimed towards a recognizer sitting outside the UA and >> generating EMMA data in which case it seemed appropriate that the UA would >> pass it through unmodified. If the UA is indeed generating EMMA data >> (whether combining from multiple recognizers or where the recognizer >> doesn't give EMMA data) it should be allowed to do so.**** >> >> **** >> >> Your understanding is correct. I've seen different architectures used. >> **** >> >> **** >> >> Most often, there is a single vendor providing a single recognizer >> (either on device or in the cloud). Also common, a single vendor will >> provide a consolidated product in which on device and in cloud resources >> are used together. Here the interface may generate a single result which >> the UA would be free to pass along as-is. In either case, the vendor will >> argue for using their result directly.**** >> >> **** >> >> Then, there are rarer but real cases (e.g. certain Samsung products), >> where multiple venders are used in combination or as alternatives. When >> used in combination, a consolidated recognition result would be generated >> outside what I think of as the recognition resource. When used as >> alternatives, the results might be restructured or altered with the goal of >> providing consistent content or formats for developers. Either way, some >> entity is preparing the final EMMA result. Coming from the media resource >> side, I think of that entity as the UA. Someone from the UA side might >> very well think of that as being just a different recognition resource!** >> ** >> >> **** >> >> In my mind and in the discussions as EMMA was coming to maturity, there >> is no reason that an EMMA result need pass through layers without >> modification. There are, in fact, mechanisms within EMMA do describe >> intermediate results and their connections. A speech recognition result >> might detail the phonetic (or sub-phonetic) lattice, the corresponding >> tokens from a grammar, and the semantic meaning as a derivation chain. A >> hybrid resource might offer separate processing chains and then a unified >> result. The example that drove much of the discussion was a map >> application with a touch screen. The user says "How do I drive from here >> to here" with corresponding touches. The EMMA result could include the >> entire recognition chain (phonetics -> tokens -> semantics) with the >> sequence of touches (touch1, then touch2) and then produce a final result >> (mode=drive + location1 + location2) for passing to a routing application. >> **** >> >> **** >> >> What is critical is that the application developer have access to the >> information required for their task. Developers, in my experience, object >> when information is stripped. Media resource vendors, working in response >> to developer requests, want assurances that additional details that they >> add will be passed through to the developer. The current language**** >> >> **** >> >> UA implementations for recognizers that supply EMMA *must* pass that >> EMMA structure directly.**** >> >> **** >> >> is too restrictive. Let me suggest instead**** >> >> **** >> >> "The EMMA document MUST/SHOULD contain all annotations and content >> generated by the recognizer(s). The UA MAY add additional annotations to >> provide a richer result for the developer."**** >> >> **** >> >> I offered SHOULD or MUST. I prefer MUST because I believe that the >> contents of a result generated by the recognizer exists for a reason. I >> can accept SHOULD if there is a strong argument for presenting a simplified >> or altered result.**** >> >> **** >> >> Milan and Salish, could you elaborate on what you had in mind when >> you raised concerns about the UA modifying the speech recognizer’s EMMA?* >> *** >> >> **** >> >> The primary reason I added that clause in was to preserve those EMMA >> attributes (emma:process, ...) from the recognizer to JS without calling >> out specific attributes. Since we agreed that instead of calling out >> attributes we'll add use cases as examples, there is lesser reason for this >> clause now and I agree it does enable use cases like what I mentioned >> above. So I'm fine dropping that clause if there are no other strong >> reasons to keep it in.**** >> >> **** >> >> --**** >> >> Cheers**** >> >> Satish**** >> >> **** >> >> Makes sense.**** >> >> **** >> >> -=- Jerry**** >> >> **** >> >> **** >> >> ** ** >> >> ** ** >> > >
Received on Tuesday, 18 September 2012 20:19:38 UTC