- From: Satish S <satish@google.com>
- Date: Fri, 15 Jun 2012 12:11:46 +0100
- To: Jerry Carter <jerry@jerrycarter.org>
- Cc: Deborah Dahl <dahl@conversational-technologies.com>, public-speech-api@w3.org
- Message-ID: <CAHZf7RmQNvzJPgnDvp0qKLqk-_bTt6uq_AvDVMVTSPeZ3=Hecg@mail.gmail.com>
> > "The EMMA document MUST/SHOULD contain all annotations and content > generated by the recognizer(s). The UA MAY add additional annotations to > provide a richer result for the developer." If we use MUST above it would disallow UAs from selecting content from one recognizer over the other. So I think SHOULD would be more relevant. Cheers Satish On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>wrote: > > On Jun 15, 2012, at 5:06 AM, Satish S wrote: > > Is this roughly what you had in mind? >> > > I understood what Jerry wrote as > - There is a local recognizer, probably with a device specific grammar > such as contacts and apps > - There is a remote recognizer that caters to a much wider scope > The UA would send audio to both and combine the results to deliver to > Javascript > > Jerry, could you clarify which use case you meant? The language I proposed > was aimed towards a recognizer sitting outside the UA and generating EMMA > data in which case it seemed appropriate that the UA would pass it through > unmodified. If the UA is indeed generating EMMA data (whether combining > from multiple recognizers or where the recognizer doesn't give EMMA data) > it should be allowed to do so. > > > Your understanding is correct. I've seen different architectures used. > > Most often, there is a single vendor providing a single recognizer (either > on device or in the cloud). Also common, a single vendor will provide a > consolidated product in which on device and in cloud resources are used > together. Here the interface may generate a single result which the UA > would be free to pass along as-is. In either case, the vendor will argue > for using their result directly. > > Then, there are rarer but real cases (e.g. certain Samsung products), > where multiple venders are used in combination or as alternatives. When > used in combination, a consolidated recognition result would be generated > outside what I think of as the recognition resource. When used as > alternatives, the results might be restructured or altered with the goal of > providing consistent content or formats for developers. Either way, some > entity is preparing the final EMMA result. Coming from the media resource > side, I think of that entity as the UA. Someone from the UA side might > very well think of that as being just a different recognition resource! > > In my mind and in the discussions as EMMA was coming to maturity, there is > no reason that an EMMA result need pass through layers without > modification. There are, in fact, mechanisms within EMMA do describe > intermediate results and their connections. A speech recognition result > might detail the phonetic (or sub-phonetic) lattice, the corresponding > tokens from a grammar, and the semantic meaning as a derivation chain. A > hybrid resource might offer separate processing chains and then a unified > result. The example that drove much of the discussion was a map > application with a touch screen. The user says "How do I drive from here > to here" with corresponding touches. The EMMA result could include the > entire recognition chain (phonetics -> tokens -> semantics) with the > sequence of touches (touch1, then touch2) and then produce a final result > (mode=drive + location1 + location2) for passing to a routing application. > > What is critical is that the application developer have access to the > information required for their task. Developers, in my experience, object > when information is stripped. Media resource vendors, working in response > to developer requests, want assurances that additional details that they > add will be passed through to the developer. The current language > > UA implementations for recognizers that supply EMMA *must* pass that EMMA > structure directly. > > > is too restrictive. Let me suggest instead > > "The EMMA document MUST/SHOULD contain all annotations and content > generated by the recognizer(s). The UA MAY add additional annotations to > provide a richer result for the developer." > > > I offered SHOULD or MUST. I prefer MUST because I believe that the > contents of a result generated by the recognizer exists for a reason. I > can accept SHOULD if there is a strong argument for presenting a simplified > or altered result. > > Milan and Salish, could you elaborate on what you had in mind when you >> raised concerns about the UA modifying the speech recognizer’s EMMA? >> > > The primary reason I added that clause in was to preserve those EMMA > attributes (emma:process, ...) from the recognizer to JS without calling > out specific attributes. Since we agreed that instead of calling out > attributes we'll add use cases as examples, there is lesser reason for this > clause now and I agree it does enable use cases like what I mentioned > above. So I'm fine dropping that clause if there are no other strong > reasons to keep it in. > > -- > Cheers > Satish > > > Makes sense. > > -=- Jerry >
Received on Friday, 15 June 2012 11:12:25 UTC