RE: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Deborah Dahl on 2010-10-29 (public-xg-htmlspeech@w3.org from October 2010)

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Fri, 29 Oct 2010 10:09:42 -0400
To: "'Bjorn Bringert'" <bringert@google.com>, "'Dan Burnett'" <dburnett@voxeo.com>
Cc: <Olli@pettay.fi>, "'Raj$Openstream$'" <raj@openstream.com>, "'Dave Burke'" <daveburke@google.com>, "'Michael Bodell'" <mbodell@microsoft.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <013f01cb7772$f19aeb30$d4d0c190$@conversational-technologies.com>
I think it's good to make the requirements more fine-grained and to separate recognition from TTS. I think it will make them easier to discuss.
We could also apply that strategy to the proposed 27c, so we could have
27c. Recognition results should be based upon a standard such as EMMA
27d. Recognition results should be in an easy to process format such as JSON.

I have to admit that I was never sure what "media composition" meant. If it means synchronized media, for example that you would use SMIL for, then that seems out of scope for this group, but maybe the intended meaning was something else. 

> -----Original Message-----
> From: Bjorn Bringert [mailto:bringert@google.com]
> Sent: Friday, October 29, 2010 4:37 AM
> To: Dan Burnett
> Cc: Olli@pettay.fi; Raj(Openstream); Dave Burke; Michael Bodell; Deborah
> Dahl; public-xg-htmlspeech@w3.org
> Subject: Re: R27. Grammars, TTS, media composition, and recognition results
> should all use standard formats
> 
> Sounds like a good idea. How about going a bit further:
> 
> 27a. Speech recognition grammars should use standard formats such as
> SRGS and SISR.
> 27b. TTS should use standard formats such as SSML.
> 27c. Recognition results should be based upon a standard such as EMMA
> but be in an easy-to-process format such as JSON.
> 
> - This puts recognition and synthesis in separate requirements, since
> we might end up with separate specs for them.
> 
> - This drops media composition. There have been no proposed use cases
> that need it, and no existing standard for it has been proposed on
> this thread.
> 
> /Bjorn
> 
> On Fri, Oct 29, 2010 at 1:40 AM, Dan Burnett <dburnett@voxeo.com> wrote:
> > Because
> > a) we are operating at a requirements level currently,
> > b) we essentially have agreement on SRGS, SISR, and SSML, and
> > c) we are beginning to agree on a direction for the recognition results,
> >
> > I propose we split this requirement into two:
> >
> > 27a.  Grammars, TTS, and media composition should all use standard
> formats
> > such as SRGS, SISR, and SSML.
> > 27b.  Recognition results should be based upon a standard such as EMMA
> but
> > be in an easy-to-process format such as JSON.
> >
> > I suspect this will simplify our determination of which requirements we can
> > list as "Should Address".
> > Thoughts?  Objections?
> >
> > -- dan
> >
> >
> >
> >
> > On Oct 27, 2010, at 1:13 PM, Olli Pettay wrote:
> >
> >> On 10/27/2010 05:09 PM, Bjorn Bringert wrote:
> >>>
> >>> What's the simplest code (e.g. in JavaScript + DOM) needed to extract
> >>> the text of the best utterance from any EMMA document that a
> >>> recognizer might return? Michael's code works for the given example,
> >>> but not for an arbitrary EMMA document.
> >>
> >> Actually, the code might not work, *if* I read EMMA spec correctly,
> >> since it uses getElementById and id attribute is not defined to be ID
> >> in emma:interpretation.
> >> (Though, that would be just a spec bug)
> >>
> >>> I understand that many apps want to do more complex things, but I
> >>> would like the API that we end up with to satisfy both parts of
> >>> "Simple things should be easy and complex things should be possible".
> >>
> >> Totally agree with this.
> >>
> >> I wonder if we could specify some *small* subset of features we need
> >> from  EMMA and expose those as a JSON or some other JS friendly object
> >> in the first version of the becoming API.
> >> Then in the v2 support for full EMMA could be added.
> >> And in the mean while MMI WG could perhaps develop JSON version
> >> of the result format.
> >>
> >>
> >> I'm hoping we could come up some reasonable small and simple API as
> >> version 1 and then do more in the next revisions.
> >> Something similar what is happening with Web Notifications.
> >>
> >> -Olli
> >>
> >>
> >>>
> >>> /Bjorn
> >>>
> >>> On Wed, Oct 27, 2010 at 2:57 PM,
> Raj(Openstream)<raj@openstream.com>
> >>>  wrote:
> >>>>
> >>>> From our developers'  experience, they don't seem to find Javascript
> any
> >>>> simpler than using
> >>>> EMMA....and all of them needless to say are Web developers to being
> >>>> with..
> >>>>
> >>>> Raj
> >>>>
> >>>> ----- Original Message -----
> >>>> From: Dave Burke
> >>>> To: Michael Bodell
> >>>> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ;
> >>>> public-xg-htmlspeech@w3.org
> >>>> Sent: Tuesday, October 26, 2010 5:48 PM
> >>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
> >>>> results
> >>>> should all use standard formats
> >>>> Seems convoluted to force developers to have to understand EMMA
> when we
> >>>> could have a simpler JavaScript object. What does EMMA buy the
> typical
> >>>> Web
> >>>> developer?
> >>>> Dave
> >>>>
> >>>> On Tue, Oct 26, 2010 at 10:43 PM, Michael
> Bodell<mbodell@microsoft.com>
> >>>> wrote:
> >>>>>
> >>>>> Here's the first EMMA example from the specification:
> >>>>>
> >>>>> <emma:emma version="1.0"
> >>>>>   xmlns:emma="http://www.w3.org/2003/04/emma"
> >>>>>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> >>>>>   xsi:schemaLocation="http://www.w3.org/2003/04/emma
> >>>>>    http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
> >>>>>   xmlns="http://www.example.com/example">
> >>>>>  <emma:one-of id="r1" emma:start="1087995961542"
> >>>>> emma:end="1087995963542"
> >>>>>    emma:medium="acoustic" emma:mode="voice">
> >>>>>   <emma:interpretation id="int1" emma:confidence="0.75"
> >>>>>   emma:tokens="flights from boston to denver">
> >>>>>     <origin>Boston</origin>
> >>>>>     <destination>Denver</destination>
> >>>>>   </emma:interpretation>
> >>>>>
> >>>>>   <emma:interpretation id="int2" emma:confidence="0.68"
> >>>>>   emma:tokens="flights from austin to denver">
> >>>>>     <origin>Austin</origin>
> >>>>>     <destination>Denver</destination>
> >>>>>   </emma:interpretation>
> >>>>>  </emma:one-of>
> >>>>> </emma:emma>
> >>>>>
> >>>>> Using something like xpath it is very simple to do something like
> >>>>> '//interpretation[@confidence>  0.6][1]' or '//interpretation/origin'.
> >>>>>
> >>>>> Using DOM one could easily do something like
> getElementsById("int1")
> >>>>> and
> >>>>> inspect that element or else getElementsByName("interpretation").
> >>>>>
> >>>>> If you had a more E4X approach you could imagine
> >>>>> result["one-of"].interpretation[0] would give you the first result.
> >>>>>
> >>>>> The JSON representation of content might be:
> >>>>> ({'one-of':{interpretation:[{origin:"Boston", destination:"Denver"},
> >>>>> {origin:"Austin", destination:"Denver"}]}}).
> >>>>>
> >>>>> In addition, depending on how the recognition is defined there might
> be
> >>>>> one or more default bindings of recognition results to input elements
> >>>>> in
> >>>>> HTML such that scripting isn't needed for the "common tasks" but the
> >>>>> scripting is there for the more advanced tasks.
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Bjorn Bringert [mailto:bringert@google.com]
> >>>>> Sent: Monday, October 25, 2010 5:43 AM
> >>>>> To: Dan Burnett
> >>>>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org
> >>>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
> >>>>> results should all use standard formats
> >>>>>
> >>>>> I haven't used EMMA, but it looks like it could be a bit complex for a
> >>>>> script to simply get the top utterance or interpretation out. Are there
> >>>>> any
> >>>>> shorthands or DOM methods for this? Any Hello World examples to
> show
> >>>>> the
> >>>>> basic usage?
> >>>>>
> >>>>> /Bjorn
> >>>>>
> >>>>> On Mon, Oct 25, 2010 at 1:38 PM, Dan
> Burnett<dburnett@voxeo.com>
> >>>>>  wrote:
> >>>>>>
> >>>>>> +1
> >>>>>> On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
> >>>>>>
> >>>>>>> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious
> W3C
> >>>>>>> standard formats that we should use.
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: public-xg-htmlspeech-request@w3.org
> >>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of
> Deborah
> >>>>>>> Dahl
> >>>>>>> Sent: Friday, October 22, 2010 6:39 AM
> >>>>>>> To: 'Bjorn Bringert'; 'Dan Burnett'
> >>>>>>> Cc: public-xg-htmlspeech@w3.org
> >>>>>>> Subject: RE: R27. Grammars, TTS, media composition, and
> recognition
> >>>>>>> results should all use standard formats
> >>>>>>>
> >>>>>>> For recognition results, EMMA
> >>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/
> >>>>>>> is a much more recent and more complete standard than NLSML.
> EMMA has
> >>>>>>> a very rich set of capabilities, but most of them are optional, so
> >>>>>>> that using it doesn't have to be complex. Quite a few recognizers
> >>>>>>> support it. I think one of the most valuable aspects of EMMA is that
> >>>>>>> as applications eventually start finding that they need more and
> more
> >>>>>>> information about the recognition result, much of that more
> advanced
> >>>>>>> information has already been worked out and standardized in
> EMMA.
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: public-xg-htmlspeech-request@w3.org
> >>>>>>>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of
> Bjorn
> >>>>>>>> Bringert
> >>>>>>>> Sent: Friday, October 22, 2010 7:01 AM
> >>>>>>>> To: Dan Burnett
> >>>>>>>> Cc: public-xg-htmlspeech@w3.org
> >>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and
> recognition
> >>>>>>>> results should all use standard formats
> >>>>>>>>
> >>>>>>>> For grammars, SRGS + SISR seems like the obvious choice.
> >>>>>>>>
> >>>>>>>> For TTS, SSML seems like the obvious choice.
> >>>>>>>>
> >>>>>>>> I'm not exactly what is meant by media composition here. Is it
> using
> >>>>>>>> TTS output together with other media? Is there a use case for
> this?
> >>>>>>>> And is there anything we need to specify here at all?
> >>>>>>>>
> >>>>>>>> For recognition results, there is NLSML, but as far as I can tell,
> >>>>>>>> that hasn't been widely adopted. Also, it seems like it could be a
> >>>>>>>> bit complex for web applications to process.
> >>>>>>>>
> >>>>>>>> /Bjorn
> >>>>>>>>
> >>>>>>>> On Fri, Oct 22, 2010 at 1:06 AM, Dan
> Burnett<dburnett@voxeo.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Group,
> >>>>>>>>>
> >>>>>>>>> This is the second of the requirements to discuss and prioritize
> >>>>>>>>> based our ranking approach [1].
> >>>>>>>>>
> >>>>>>>>> This email is the beginning of a thread for questions, discussion,
> >>>>>>>>> and opinions regarding our first draft of Requirement 27 [2].
> >>>>>>>>>
> >>>>>>>>> After our discussion and any modifications to the requirement,
> our
> >>>>>>>>> goal is to prioritize this requirement as either "Should Address"
> >>>>>>>>> or "For Future Consideration".
> >>>>>>>>>
> >>>>>>>>> -- dan
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
> >>>>>>>>
> >>>>>>>> htmlspeech/2010Oct/0024.html
> >>>>>>>>>
> >>>>>>>>> [2]
> >>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
> htmlspeech/2010Oct/at
> >>>>>>>>> t
> >>>>>>>>> -
> >>>>>>>>
> >>>>>>>> 0001/speech.html#r27
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Bjorn Bringert
> >>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76
> Buckingham
> >>>>>>>> Palace Road, London, SW1W 9TQ Registered in England Number:
> 3977902
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Bjorn Bringert
> >>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> >>>>> Palace
> >>>>> Road, London, SW1W 9TQ Registered in England Number: 3977902
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> 
> 
> 
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
Received on Friday, 29 October 2010 14:10:27 UTC