W3C home > Mailing lists > Public > public-speech-api@w3.org > June 2012

RE: Review of EMMA usage in the Speech API (first editor's draft)

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Thu, 14 Jun 2012 09:13:04 -0400
To: "'Jerry Carter'" <jerry@jerrycarter.org>, <public-speech-api@w3.org>
Message-ID: <00ba01cd4a2f$72a162f0$57e428d0$@conversational-technologies.com>
Thanks for bringing up this very common architecture that combines local
signal processing with cloud-based processing. We should definitely keep
this architecture in mind as we discuss use cases.

To make sure I understand your suggestion,  I think one instantiation of the
process you're suggesting might be

1.       Speech is captured locally and transmitted to the cloud-based

2.       The UA builds part of the EMMA with locally-known information like
the timestamps, a reference to the emma:process, emma:source, emma:grammar,
and possibly a reference to the emma:signal, if it knows that. 

3.       The speech recognizer comes back with its own EMMA representing the
recognition results. 

4.       The local EMMA and the speech recognizer's EMMA are combined by the
UA as a derivation to create the EMMA that's made available as part of the
speech result through the API.

Is this roughly what you had in mind?

I agree with you that it would actually be very convenient if the UA did
this processing, but I don't think anything's preventing the application
from doing it if there are other reasons for the UA not to modify the speech
recognition result.

Milan and Salish, could you elaborate on what you had in mind when you
raised concerns about the UA modifying the speech recognizer's EMMA?


From: Jerry Carter [mailto:jerry@jerrycarter.org] 
Sent: Wednesday, June 13, 2012 10:31 PM
To: public-speech-api@w3.org; Deborah Dahl
Subject: Review of EMMA usage in the Speech API (first editor's draft)



The current language is fairly minimal:



 <http://www.w3.org/TR/emma/> EMMA 1.0 representation of this result. The
contents of this result could vary across UAs and recognition engines, but
all implementations must expose a valid XML document complete with EMMA
namespace. UA implementations for recognizers that supply EMMA must pass
that EMMA structure directly.


I have mixed feelings about whether EMMA is appropriate for this
specification.  Arguing against, the EMMA specification is fairly large and
rather complex which may adversely impact the usability of the Speech API
for many web application developers.  Arguing in favor, EMMA provides a nice
framework for representing complex semantic results and their derivations
through multiple engines.  I have read the arguments on the list and am
encouraged that the consensus has favored the inclusion of EMMA.  At the
same time, I hope that future drafts of the Speech API or of supporting
documents will help clarify how to user results are represented in EMMA.  I
see that Milan has offered a few possibilities for future consideration, but
I do not believe these are sufficient.


The second sentence is troublesome.  I do not see any reason that the UA
would need to pass EMMA results directly.  In fact, doing so runs counter to
the original intent of the EMMA specification.  As my co-editor explained in
an earlier post [1]:


I'm not sure why a web developer would care whether the EMMA they get from
the UA is exactly what the speech recognizer supplied. On the other hand, I
can think of useful things that the UA could add to the EMMA, for example,
something in the <info> tag about the UA  that the request originated from,
that the recognizer wouldn't necessarily know about. In that case you might
actually want modified EMMA.


One recurring implementation strategy that I have seen for mobile devices is
to combine local signal processing resources with cloud based ones.  Here
the result of a recognition would necessarily combine information from the
two different resources and it would be inappropriate to return the EMMA
result from a single resource.  Much better from the perspective of EMMA
would be to build a composite result with separate derivation chains.
[Debbie, I know you later said that a direct result would be okay [2] but
you may have been thinking of a simpler architecture.]


Thanks for the discussion to date and for the first draft.


-=- Jerry


[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0056.html

[2] http://lists.w3.org/Archives/Public/public-speech-api/2012Jun/0059.html
Received on Thursday, 14 June 2012 13:13:47 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:02:26 UTC