Review of EMMA usage in the Speech API (first editor's draft) from Jerry Carter on 2012-06-14 (public-speech-api@w3.org from June 2012)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Wed, 13 Jun 2012 22:30:33 -0400
To: public-speech-api@w3.org, Deborah Dahl <dahl@conversational-technologies.com>
Message-Id: <7E6B4B1C-0A58-46AD-8968-3F5DCDF7A606@jerrycarter.org>

The current language is fairly minimal:

> emma
> EMMA 1.0 representation of this result. The contents of this result could vary across UAs and recognition engines, but all implementations must expose a valid XML document complete with EMMA namespace. UA implementations for recognizers that supply EMMA must pass that EMMA structure directly.

I have mixed feelings about whether EMMA is appropriate for this specification.  Arguing against, the EMMA specification is fairly large and rather complex which may adversely impact the usability of the Speech API for many web application developers.  Arguing in favor, EMMA provides a nice framework for representing complex semantic results and their derivations through multiple engines.  I have read the arguments on the list and am encouraged that the consensus has favored the inclusion of EMMA.  At the same time, I hope that future drafts of the Speech API or of supporting documents will help clarify how to user results are represented in EMMA.  I see that Milan has offered a few possibilities for future consideration, but I do not believe these are sufficient.

The second sentence is troublesome.  I do not see any reason that the UA would need to pass EMMA results directly.  In fact, doing so runs counter to the original intent of the EMMA specification.  As my co-editor explained in an earlier post [1]:

> I’m not sure why a web developer would care whether the EMMA they get from the UA is exactly what the speech recognizer supplied. On the other hand, I can think of useful things that the UA could add to the EMMA, for example, something in the <info> tag about the UA  that the request originated from, that the recognizer wouldn’t necessarily know about. In that case you might actually want modified EMMA.

One recurring implementation strategy that I have seen for mobile devices is to combine local signal processing resources with cloud based ones.  Here the result of a recognition would necessarily combine information from the two different resources and it would be inappropriate to return the EMMA result from a single resource.  Much better from the perspective of EMMA would be to build a composite result with separate derivation chains.  [Debbie, I know you later said that a direct result would be okay [2] but you may have been thinking of a simpler architecture.]

Thanks for the discussion to date and for the first draft.

-=- Jerry

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0056.html
[2] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0059.html

Received on Thursday, 14 June 2012 02:31:02 UTC