Re: Review of EMMA usage in the Speech API (first editor's draft) from Satish S on 2012-06-15 (public-speech-api@w3.org from June 2012)

From: Satish S <satish@google.com>
Date: Fri, 15 Jun 2012 12:11:46 +0100
To: Jerry Carter <jerry@jerrycarter.org>
Cc: Deborah Dahl <dahl@conversational-technologies.com>, public-speech-api@w3.org
Message-ID: <CAHZf7RmQNvzJPgnDvp0qKLqk-_bTt6uq_AvDVMVTSPeZ3=Hecg@mail.gmail.com>
>
> "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."


If we use MUST above it would disallow UAs from selecting content from one
recognizer over the other. So I think SHOULD would be more relevant.

Cheers
Satish


On Fri, Jun 15, 2012 at 12:06 PM, Jerry Carter <jerry@jerrycarter.org>wrote:

>
> On Jun 15, 2012, at 5:06 AM, Satish S wrote:
>
>  Is this roughly what you had in mind?
>>
>
> I understood what Jerry wrote as
> - There is a local recognizer, probably with a device specific grammar
> such as contacts and apps
> - There is a remote recognizer that caters to a much wider scope
> The UA would send audio to both and combine the results to deliver to
> Javascript
>
> Jerry, could you clarify which use case you meant? The language I proposed
> was aimed towards a recognizer sitting outside the UA and generating EMMA
> data in which case it seemed appropriate that the UA would pass it through
> unmodified. If the UA is indeed generating EMMA data (whether combining
> from multiple recognizers or where the recognizer doesn't give EMMA data)
> it should be allowed to do so.
>
>
> Your understanding is correct.  I've seen different architectures used.
>
> Most often, there is a single vendor providing a single recognizer (either
> on device or in the cloud).   Also common, a single vendor will provide a
> consolidated product in which on device and in cloud resources are used
> together.  Here the interface may generate a single result which the UA
> would be free to pass along as-is.  In either case, the vendor will argue
> for using their result directly.
>
> Then, there are rarer but real cases (e.g. certain Samsung products),
> where multiple venders are used in combination or as alternatives.  When
> used in combination, a consolidated recognition result would be generated
> outside what I think of as the recognition resource.  When used as
> alternatives, the results might be restructured or altered with the goal of
> providing consistent content or formats for developers.  Either way, some
> entity is preparing the final EMMA result.  Coming from the media resource
> side, I think of that entity as the UA.  Someone from the UA side might
> very well think of that as being just a different recognition resource!
>
> In my mind and in the discussions as EMMA was coming to maturity, there is
> no reason that an EMMA result need pass through layers without
> modification.  There are, in fact, mechanisms within EMMA do describe
> intermediate results and their connections.  A speech recognition result
> might detail the phonetic (or sub-phonetic) lattice, the corresponding
> tokens from a grammar, and the semantic meaning as a derivation chain.  A
> hybrid resource might offer separate processing chains and then a unified
> result.  The example that drove much of the discussion was a map
> application with a touch screen.  The user says "How do I drive from here
> to here" with corresponding touches.  The EMMA result could include the
> entire recognition chain (phonetics -> tokens -> semantics) with the
> sequence of touches (touch1, then touch2) and then produce a final result
> (mode=drive + location1 + location2) for passing to a routing application.
>
> What is critical is that the application developer have access to the
> information required for their task.  Developers, in my experience, object
> when information is stripped.  Media resource vendors, working in response
> to developer requests, want assurances that additional details that they
> add will be passed through to the developer.  The current language
>
> UA implementations for recognizers that supply EMMA *must* pass that EMMA
> structure directly.
>
>
> is too restrictive.   Let me suggest instead
>
> "The EMMA document MUST/SHOULD contain all annotations and content
> generated by the recognizer(s).  The UA MAY add additional annotations to
> provide a richer result for the developer."
>
>
> I offered SHOULD or MUST.  I prefer MUST because I believe that the
> contents of a result generated by the recognizer exists for a reason.  I
> can accept SHOULD if there is a strong argument for presenting a simplified
> or altered result.
>
> Milan and Salish, could you elaborate on what you had in mind when you
>> raised concerns about the UA modifying the speech recognizer’s EMMA?
>>
>
> The primary reason I added that clause in was to preserve those EMMA
> attributes (emma:process, ...) from the recognizer to JS without calling
> out specific attributes. Since we agreed that instead of calling out
> attributes we'll add use cases as examples, there is lesser reason for this
> clause now and I agree it does enable use cases like what I mentioned
> above. So I'm fine dropping that clause if there are no other strong
> reasons to keep it in.
>
> --
> Cheers
> Satish
>
>
> Makes sense.
>
> -=- Jerry
>
Received on Friday, 15 June 2012 11:12:25 UTC