RE: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Deborah Dahl on 2010-10-27 (public-xg-htmlspeech@w3.org from October 2010)

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Wed, 27 Oct 2010 10:30:53 -0400
To: "'Dave Burke'" <daveburke@google.com>, "'Satish Sampath'" <satish@google.com>
Cc: "'Michael Bodell'" <mbodell@microsoft.com>, "'Bjorn Bringert'" <bringert@google.com>, "'Dan Burnett'" <dburnett@voxeo.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <00fc01cb75e3$9065d590$b13180b0$@conversational-technologies.com>
I think something like that could work and could be pretty simple. Then when
developers realize that they would like to use, for example, timestamps, if
the recognizer supports timestamps and adds them to its EMMA results,
developers could use them right away, without us having to worry about how
to add them to a future API, and we won't have to spend time talking about
things like whether we should use "start" or "begin" as the name of the
timestamp attribute. 
Speaking for myself, I use a lot of EMMA in desktop applications, and I
always use it as a Java object, so I think it makes sense for the result to
be available as an object.   I find that the XML form is mainly useful for
debugging and logging, where a human might want to look at it. 

> -----Original Message-----
> From: Dave Burke [mailto:daveburke@google.com]
> Sent: Wednesday, October 27, 2010 9:48 AM
> To: Satish Sampath
> Cc: Deborah Dahl; Michael Bodell; Bjorn Bringert; Dan Burnett; public-xg-
> htmlspeech@w3.org
> Subject: Re: R27. Grammars, TTS, media composition, and recognition
results
> should all use standard formats
> 
> Yep - I was thinking basically a "JSON binding of EMMA" (either formally
or in
> spirit) rather than having to expose EMMA in HTML and have recourse to
> DOM APIs to parse it.
> 
> Dave
> 
> 
> On Wed, Oct 27, 2010 at 2:31 PM, Satish Sampath <satish@google.com>
> wrote:
> 
> 
> 	Perhaps it is possible to define a Javascript based object model
(i.e.
> JSON definition) which meets the requirements and use cases that we
> intend to address in the first proposal? JSON is extensible by its very
nature
> and can scale up as the API grows in future iterations of the proposal.
> 
> 	Cheers
> 	Satish
> 
> 
> 
> 	On Wed, Oct 27, 2010 at 2:24 PM, Deborah Dahl
> <dahl@conversational-technologies.com> wrote:
> 
> 
> 		 Developers don't have to understand the complete EMMA
> specification to do
> 		simple things.
> 		I also think it's probably not true that we can
realistically have
> a simpler
> 		JavaScript object that will meet developers' needs, because
> developers'
> 		needs will grow quickly as they start using speech in web
> applications. With
> 		EMMA, more advanced features have already been
> standardized  and are
> 		available for when developers need them. And sooner or
> later, they do
> 		inevitably need more features. This is the progression that
> I've seen --
> 		first, developers think they just want the recognized
string.
> Then they
> 		realize they need semantic tags, then they need
> confidences, then they need
> 		the nbest, and so on and so on. It would be very time-
> consuming to have  to
> 		keep going back to reinvent the more advanced capabilities
> just because we
> 		started out with a limited idea of what developers were
> going to want. And,
> 		after the fact tacking on of more advanced capabilities also
> makes it
> 		difficult to ensure backwards compatibility. We could spend
a
> lot of time
> 		trying to define a simpler JavaScript object and end up with
> something that
> 		either doesn't meet developers' needs and/or is not simple.
> 		Another advantage of EMMA is that it is already available as
> output from a
> 		number of speech recognizers so using it promotes
> interoperability.
> 
> 
> 		> -----Original Message-----
> 		> From: Dave Burke [mailto:daveburke@google.com]
> 		> Sent: Tuesday, October 26, 2010 5:48 PM
> 		> To: Michael Bodell
> 		> Cc: Bjorn Bringert; Dan Burnett; Deborah Dahl; public-xg-
> 		> htmlspeech@w3.org
> 		> Subject: Re: R27. Grammars, TTS, media composition, and
> recognition
> 		results
> 		> should all use standard formats
> 		>
> 		> Seems convoluted to force developers to have to
> understand EMMA when
> 		> we could have a simpler JavaScript object. What does
> EMMA buy the typical
> 		> Web developer?
> 		>
> 		> Dave
> 		>
> 		>
> 		> On Tue, Oct 26, 2010 at 10:43 PM, Michael Bodell
> <mbodell@microsoft.com>
> 		> wrote:
> 		>
> 		>
> 		>       Here's the first EMMA example from the
specification:
> 		>
> 		>       <emma:emma version="1.0"
> 		>          xmlns:emma="http://www.w3.org/2003/04/emma"
> 		>          xmlns:xsi="http://www.w3.org/2001/XMLSchema-
> instance"
> 		>
> xsi:schemaLocation="http://www.w3.org/2003/04/emma
> 		>           http://www.w3.org/TR/2009/REC-emma-
> 20090210/emma.xsd"
> 		>          xmlns="http://www.example.com/example">
> 		>        <emma:one-of id="r1" emma:start="1087995961542"
> 		> emma:end="1087995963542"
> 		>           emma:medium="acoustic" emma:mode="voice">
> 		>          <emma:interpretation id="int1"
> emma:confidence="0.75"
> 		>          emma:tokens="flights from boston to denver">
> 		>            <origin>Boston</origin>
> 		>            <destination>Denver</destination>
> 		>          </emma:interpretation>
> 		>
> 		>          <emma:interpretation id="int2"
> emma:confidence="0.68"
> 		>          emma:tokens="flights from austin to denver">
> 		>            <origin>Austin</origin>
> 		>            <destination>Denver</destination>
> 		>          </emma:interpretation>
> 		>        </emma:one-of>
> 		>       </emma:emma>
> 		>
> 		>       Using something like xpath it is very simple to do
> something like
> 		> '//interpretation[@confidence > 0.6][1]' or
> '//interpretation/origin'.
> 		>
> 		>       Using DOM one could easily do something like
> 		> getElementsById("int1") and inspect that element or else
> 		> getElementsByName("interpretation").
> 		>
> 		>       If you had a more E4X approach you could imagine
> result["one-
> 		> of"].interpretation[0] would give you the first result.
> 		>
> 		>       The JSON representation of content might be: ({'one-
> 		> of':{interpretation:[{origin:"Boston",
> destination:"Denver"},
> 		{origin:"Austin",
> 		> destination:"Denver"}]}}).
> 		>
> 		>       In addition, depending on how the recognition is
> defined there might
> 		> be one or more default bindings of recognition results to
> input elements
> 		in
> 		> HTML such that scripting isn't needed for the "common
> tasks" but the
> 		> scripting is there for the more advanced tasks.
> 		>
> 		>
> 		>       -----Original Message-----
> 		>       From: Bjorn Bringert [mailto:bringert@google.com]
> 		>       Sent: Monday, October 25, 2010 5:43 AM
> 		>       To: Dan Burnett
> 		>
> 		>       Cc: Michael Bodell; Deborah Dahl; public-xg-
> htmlspeech@w3.org
> 		>       Subject: Re: R27. Grammars, TTS, media composition,
> and recognition
> 		> results should all use standard formats
> 		>
> 		>       I haven't used EMMA, but it looks like it could be a
bit
> complex for
> 		a
> 		> script to simply get the top utterance or interpretation
out.
> Are there
> 		any
> 		> shorthands or DOM methods for this? Any Hello World
> examples to show the
> 		> basic usage?
> 		>
> 		>       /Bjorn
> 		>
> 		>       On Mon, Oct 25, 2010 at 1:38 PM, Dan Burnett
> 		> <dburnett@voxeo.com> wrote:
> 		>       > +1
> 		>       > On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
> 		>       >
> 		>       >> I agree that SRGS, SISR, EMMA, and SSML seems
like
> the obvious
> 		> W3C
> 		>       >> standard formats that we should use.
> 		>       >>
> 		>       >> -----Original Message-----
> 		>       >> From: public-xg-htmlspeech-request@w3.org
> 		>       >> [mailto:public-xg-htmlspeech-request@w3.org] On
> Behalf Of
> 		> Deborah
> 		>       >> Dahl
> 		>       >> Sent: Friday, October 22, 2010 6:39 AM
> 		>       >> To: 'Bjorn Bringert'; 'Dan Burnett'
> 		>       >> Cc: public-xg-htmlspeech@w3.org
> 		>       >> Subject: RE: R27. Grammars, TTS, media
composition,
> and
> 		> recognition
> 		>       >> results should all use standard formats
> 		>       >>
> 		>       >> For recognition results, EMMA
> 		>       >> http://www.w3.org/TR/2009/REC-emma-20090210/
> 		>       >> is a much more recent and more complete standard
> than NLSML.
> 		> EMMA has
> 		>       >> a very rich set of capabilities, but most of them
are
> optional,
> 		so
> 		>       >> that using it doesn't have to be complex. Quite a
few
> recognizers
> 		>       >> support it. I think one of the most valuable
aspects of
> EMMA is
> 		> that
> 		>       >> as applications eventually start finding that
they need
> more and
> 		> more
> 		>       >> information about the recognition result, much of
> that more
> 		> advanced
> 		>       >> information has already been worked out and
> standardized in
> 		> EMMA.
> 		>       >>
> 		>       >>> -----Original Message-----
> 		>       >>> From: public-xg-htmlspeech-request@w3.org
> 		>       >>> [mailto:public-xg-htmlspeech- request@w3.org] On
> Behalf Of
> 		> Bjorn
> 		>       >>> Bringert
> 		>       >>> Sent: Friday, October 22, 2010 7:01 AM
> 		>       >>> To: Dan Burnett
> 		>       >>> Cc: public-xg-htmlspeech@w3.org
> 		>       >>> Subject: Re: R27. Grammars, TTS, media
> composition, and
> 		> recognition
> 		>       >>> results should all use standard formats
> 		>       >>>
> 		>       >>> For grammars, SRGS + SISR seems like the obvious
> choice.
> 		>       >>>
> 		>       >>> For TTS, SSML seems like the obvious choice.
> 		>       >>>
> 		>       >>> I'm not exactly what is meant by media
composition
> here. Is it
> 		> using
> 		>       >>> TTS output together with other media? Is there a
> use case for
> 		> this?
> 		>       >>> And is there anything we need to specify here at
> all?
> 		>       >>>
> 		>       >>> For recognition results, there is NLSML, but as
far as
> I can
> 		tell,
> 		>       >>> that hasn't been widely adopted. Also, it seems
like
> it could be
> 		a
> 		>       >>> bit complex for web applications to process.
> 		>       >>>
> 		>       >>> /Bjorn
> 		>       >>>
> 		>       >>> On Fri, Oct 22, 2010 at 1:06 AM, Dan Burnett
> 		> <dburnett@voxeo.com> wrote:
> 		>       >>>>
> 		>       >>>> Group,
> 		>       >>>>
> 		>       >>>> This is the second of the requirements to
discuss
> and
> 		prioritize
> 		>       >>>> based our ranking approach [1].
> 		>       >>>>
> 		>       >>>> This email is the beginning of a thread for
> questions,
> 		discussion,
> 		>       >>>> and opinions regarding our first draft of
> Requirement 27 [2].
> 		>       >>>>
> 		>       >>>> After our discussion and any modifications to
the
> requirement,
> 		> our
> 		>       >>>> goal is to prioritize this requirement as
either
> "Should
> 		Address"
> 		>       >>>> or "For Future Consideration".
> 		>       >>>>
> 		>       >>>> -- dan
> 		>       >>>>
> 		>       >>>> [1]
> 		>       >>>> http://lists.w3.org/Archives/Public/public-xg-
> 		>       >>>
> 		>       >>> htmlspeech/2010Oct/0024.html
> 		>       >>>>
> 		>       >>>> [2]
> 		>       >>>> http://lists.w3.org/Archives/Public/public-xg-
> 		> htmlspeech/2010Oct/at
> 		>       >>>> t
> 		>       >>>> -
> 		>       >>>
> 		>       >>> 0001/speech.html#r27
> 		>       >>>>
> 		>       >>>>
> 		>       >>>
> 		>       >>>
> 		>       >>>
> 		>       >>> --
> 		>       >>> Bjorn Bringert
> 		>       >>> Google UK Limited, Registered Office: Belgrave
> House, 76
> 		> Buckingham
> 		>       >>> Palace Road, London, SW1W 9TQ Registered in
> England Number:
> 		> 3977902
> 		>       >>
> 		>       >>
> 		>       >>
> 		>       >>
> 		>       >
> 		>       >
> 		>
> 		>
> 		>
> 		>       --
> 		>       Bjorn Bringert
> 		>       Google UK Limited, Registered Office: Belgrave
House,
> 76
> 		> Buckingham Palace Road, London, SW1W 9TQ Registered in
> England Number:
> 		> 3977902
> 		>
> 		>
> 		>
> 
> 
> 
> 
> 
>
Received on Wednesday, 27 October 2010 14:31:32 UTC