- From: Dan Burnett <dburnett@voxeo.com>
- Date: Sun, 31 Oct 2010 12:23:10 -0400
- To: Deborah Dahl <dahl@conversational-technologies.com>
- Cc: "'Bjorn Bringert'" <bringert@google.com>, <Olli@pettay.fi>, "'Raj\(Openstream\)'" <raj@openstream.com>, "'Dave Burke'" <daveburke@google.com>, "'Michael Bodell'" <mbodell@microsoft.com>, <public-xg-htmlspeech@w3.org>
I don't think we need to split 27c into the two parts, but I have no real objection to it. I also didn't know what "media composition" meant. I agree with dropping it unless and until someone comes up with a good explanation. If anyone has an objection to the following new requirements derived from and intended to replace the original #27, please speak up: 27a. Speech recognition grammars should use standard formats such as SRGS and SISR. 27b. TTS should use standard formats such as SSML. 27c. Recognition results should be based upon a standard such as EMMA 27d. Recognition results should be in an easy to process format such as JSON. -- dan On Oct 29, 2010, at 10:09 AM, Deborah Dahl wrote: > I think it's good to make the requirements more fine-grained and to > separate recognition from TTS. I think it will make them easier to > discuss. > We could also apply that strategy to the proposed 27c, so we could > have > 27c. Recognition results should be based upon a standard such as EMMA > 27d. Recognition results should be in an easy to process format such > as JSON. > > I have to admit that I was never sure what "media composition" > meant. If it means synchronized media, for example that you would > use SMIL for, then that seems out of scope for this group, but maybe > the intended meaning was something else. > >> -----Original Message----- >> From: Bjorn Bringert [mailto:bringert@google.com] >> Sent: Friday, October 29, 2010 4:37 AM >> To: Dan Burnett >> Cc: Olli@pettay.fi; Raj(Openstream); Dave Burke; Michael Bodell; >> Deborah >> Dahl; public-xg-htmlspeech@w3.org >> Subject: Re: R27. Grammars, TTS, media composition, and recognition >> results >> should all use standard formats >> >> Sounds like a good idea. How about going a bit further: >> >> 27a. Speech recognition grammars should use standard formats such as >> SRGS and SISR. >> 27b. TTS should use standard formats such as SSML. >> 27c. Recognition results should be based upon a standard such as EMMA >> but be in an easy-to-process format such as JSON. >> >> - This puts recognition and synthesis in separate requirements, since >> we might end up with separate specs for them. >> >> - This drops media composition. There have been no proposed use cases >> that need it, and no existing standard for it has been proposed on >> this thread. >> >> /Bjorn >> >> On Fri, Oct 29, 2010 at 1:40 AM, Dan Burnett <dburnett@voxeo.com> >> wrote: >>> Because >>> a) we are operating at a requirements level currently, >>> b) we essentially have agreement on SRGS, SISR, and SSML, and >>> c) we are beginning to agree on a direction for the recognition >>> results, >>> >>> I propose we split this requirement into two: >>> >>> 27a. Grammars, TTS, and media composition should all use standard >> formats >>> such as SRGS, SISR, and SSML. >>> 27b. Recognition results should be based upon a standard such as >>> EMMA >> but >>> be in an easy-to-process format such as JSON. >>> >>> I suspect this will simplify our determination of which >>> requirements we can >>> list as "Should Address". >>> Thoughts? Objections? >>> >>> -- dan >>> >>> >>> >>> >>> On Oct 27, 2010, at 1:13 PM, Olli Pettay wrote: >>> >>>> On 10/27/2010 05:09 PM, Bjorn Bringert wrote: >>>>> >>>>> What's the simplest code (e.g. in JavaScript + DOM) needed to >>>>> extract >>>>> the text of the best utterance from any EMMA document that a >>>>> recognizer might return? Michael's code works for the given >>>>> example, >>>>> but not for an arbitrary EMMA document. >>>> >>>> Actually, the code might not work, *if* I read EMMA spec correctly, >>>> since it uses getElementById and id attribute is not defined to >>>> be ID >>>> in emma:interpretation. >>>> (Though, that would be just a spec bug) >>>> >>>>> I understand that many apps want to do more complex things, but I >>>>> would like the API that we end up with to satisfy both parts of >>>>> "Simple things should be easy and complex things should be >>>>> possible". >>>> >>>> Totally agree with this. >>>> >>>> I wonder if we could specify some *small* subset of features we >>>> need >>>> from EMMA and expose those as a JSON or some other JS friendly >>>> object >>>> in the first version of the becoming API. >>>> Then in the v2 support for full EMMA could be added. >>>> And in the mean while MMI WG could perhaps develop JSON version >>>> of the result format. >>>> >>>> >>>> I'm hoping we could come up some reasonable small and simple API as >>>> version 1 and then do more in the next revisions. >>>> Something similar what is happening with Web Notifications. >>>> >>>> -Olli >>>> >>>> >>>>> >>>>> /Bjorn >>>>> >>>>> On Wed, Oct 27, 2010 at 2:57 PM, >> Raj(Openstream)<raj@openstream.com> >>>>> wrote: >>>>>> >>>>>> From our developers' experience, they don't seem to find >>>>>> Javascript >> any >>>>>> simpler than using >>>>>> EMMA....and all of them needless to say are Web developers to >>>>>> being >>>>>> with.. >>>>>> >>>>>> Raj >>>>>> >>>>>> ----- Original Message ----- >>>>>> From: Dave Burke >>>>>> To: Michael Bodell >>>>>> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ; >>>>>> public-xg-htmlspeech@w3.org >>>>>> Sent: Tuesday, October 26, 2010 5:48 PM >>>>>> Subject: Re: R27. Grammars, TTS, media composition, and >>>>>> recognition >>>>>> results >>>>>> should all use standard formats >>>>>> Seems convoluted to force developers to have to understand EMMA >> when we >>>>>> could have a simpler JavaScript object. What does EMMA buy the >> typical >>>>>> Web >>>>>> developer? >>>>>> Dave >>>>>> >>>>>> On Tue, Oct 26, 2010 at 10:43 PM, Michael >> Bodell<mbodell@microsoft.com> >>>>>> wrote: >>>>>>> >>>>>>> Here's the first EMMA example from the specification: >>>>>>> >>>>>>> <emma:emma version="1.0" >>>>>>> xmlns:emma="http://www.w3.org/2003/04/emma" >>>>>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >>>>>>> xsi:schemaLocation="http://www.w3.org/2003/04/emma >>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd" >>>>>>> xmlns="http://www.example.com/example"> >>>>>>> <emma:one-of id="r1" emma:start="1087995961542" >>>>>>> emma:end="1087995963542" >>>>>>> emma:medium="acoustic" emma:mode="voice"> >>>>>>> <emma:interpretation id="int1" emma:confidence="0.75" >>>>>>> emma:tokens="flights from boston to denver"> >>>>>>> <origin>Boston</origin> >>>>>>> <destination>Denver</destination> >>>>>>> </emma:interpretation> >>>>>>> >>>>>>> <emma:interpretation id="int2" emma:confidence="0.68" >>>>>>> emma:tokens="flights from austin to denver"> >>>>>>> <origin>Austin</origin> >>>>>>> <destination>Denver</destination> >>>>>>> </emma:interpretation> >>>>>>> </emma:one-of> >>>>>>> </emma:emma> >>>>>>> >>>>>>> Using something like xpath it is very simple to do something >>>>>>> like >>>>>>> '//interpretation[@confidence> 0.6][1]' or '//interpretation/ >>>>>>> origin'. >>>>>>> >>>>>>> Using DOM one could easily do something like >> getElementsById("int1") >>>>>>> and >>>>>>> inspect that element or else >>>>>>> getElementsByName("interpretation"). >>>>>>> >>>>>>> If you had a more E4X approach you could imagine >>>>>>> result["one-of"].interpretation[0] would give you the first >>>>>>> result. >>>>>>> >>>>>>> The JSON representation of content might be: >>>>>>> ({'one-of':{interpretation:[{origin:"Boston", >>>>>>> destination:"Denver"}, >>>>>>> {origin:"Austin", destination:"Denver"}]}}). >>>>>>> >>>>>>> In addition, depending on how the recognition is defined there >>>>>>> might >> be >>>>>>> one or more default bindings of recognition results to input >>>>>>> elements >>>>>>> in >>>>>>> HTML such that scripting isn't needed for the "common tasks" >>>>>>> but the >>>>>>> scripting is there for the more advanced tasks. >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Bjorn Bringert [mailto:bringert@google.com] >>>>>>> Sent: Monday, October 25, 2010 5:43 AM >>>>>>> To: Dan Burnett >>>>>>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org >>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and >>>>>>> recognition >>>>>>> results should all use standard formats >>>>>>> >>>>>>> I haven't used EMMA, but it looks like it could be a bit >>>>>>> complex for a >>>>>>> script to simply get the top utterance or interpretation out. >>>>>>> Are there >>>>>>> any >>>>>>> shorthands or DOM methods for this? Any Hello World examples to >> show >>>>>>> the >>>>>>> basic usage? >>>>>>> >>>>>>> /Bjorn >>>>>>> >>>>>>> On Mon, Oct 25, 2010 at 1:38 PM, Dan >> Burnett<dburnett@voxeo.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> +1 >>>>>>>> On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote: >>>>>>>> >>>>>>>>> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious >> W3C >>>>>>>>> standard formats that we should use. >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: public-xg-htmlspeech-request@w3.org >>>>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of >> Deborah >>>>>>>>> Dahl >>>>>>>>> Sent: Friday, October 22, 2010 6:39 AM >>>>>>>>> To: 'Bjorn Bringert'; 'Dan Burnett' >>>>>>>>> Cc: public-xg-htmlspeech@w3.org >>>>>>>>> Subject: RE: R27. Grammars, TTS, media composition, and >> recognition >>>>>>>>> results should all use standard formats >>>>>>>>> >>>>>>>>> For recognition results, EMMA >>>>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/ >>>>>>>>> is a much more recent and more complete standard than NLSML. >> EMMA has >>>>>>>>> a very rich set of capabilities, but most of them are >>>>>>>>> optional, so >>>>>>>>> that using it doesn't have to be complex. Quite a few >>>>>>>>> recognizers >>>>>>>>> support it. I think one of the most valuable aspects of EMMA >>>>>>>>> is that >>>>>>>>> as applications eventually start finding that they need more >>>>>>>>> and >> more >>>>>>>>> information about the recognition result, much of that more >> advanced >>>>>>>>> information has already been worked out and standardized in >> EMMA. >>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: public-xg-htmlspeech-request@w3.org >>>>>>>>>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of >> Bjorn >>>>>>>>>> Bringert >>>>>>>>>> Sent: Friday, October 22, 2010 7:01 AM >>>>>>>>>> To: Dan Burnett >>>>>>>>>> Cc: public-xg-htmlspeech@w3.org >>>>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and >> recognition >>>>>>>>>> results should all use standard formats >>>>>>>>>> >>>>>>>>>> For grammars, SRGS + SISR seems like the obvious choice. >>>>>>>>>> >>>>>>>>>> For TTS, SSML seems like the obvious choice. >>>>>>>>>> >>>>>>>>>> I'm not exactly what is meant by media composition here. Is >>>>>>>>>> it >> using >>>>>>>>>> TTS output together with other media? Is there a use case for >> this? >>>>>>>>>> And is there anything we need to specify here at all? >>>>>>>>>> >>>>>>>>>> For recognition results, there is NLSML, but as far as I >>>>>>>>>> can tell, >>>>>>>>>> that hasn't been widely adopted. Also, it seems like it >>>>>>>>>> could be a >>>>>>>>>> bit complex for web applications to process. >>>>>>>>>> >>>>>>>>>> /Bjorn >>>>>>>>>> >>>>>>>>>> On Fri, Oct 22, 2010 at 1:06 AM, Dan >> Burnett<dburnett@voxeo.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Group, >>>>>>>>>>> >>>>>>>>>>> This is the second of the requirements to discuss and >>>>>>>>>>> prioritize >>>>>>>>>>> based our ranking approach [1]. >>>>>>>>>>> >>>>>>>>>>> This email is the beginning of a thread for questions, >>>>>>>>>>> discussion, >>>>>>>>>>> and opinions regarding our first draft of Requirement 27 >>>>>>>>>>> [2]. >>>>>>>>>>> >>>>>>>>>>> After our discussion and any modifications to the >>>>>>>>>>> requirement, >> our >>>>>>>>>>> goal is to prioritize this requirement as either "Should >>>>>>>>>>> Address" >>>>>>>>>>> or "For Future Consideration". >>>>>>>>>>> >>>>>>>>>>> -- dan >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg- >>>>>>>>>> >>>>>>>>>> htmlspeech/2010Oct/0024.html >>>>>>>>>>> >>>>>>>>>>> [2] >>>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg- >> htmlspeech/2010Oct/at >>>>>>>>>>> t >>>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> 0001/speech.html#r27 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Bjorn Bringert >>>>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76 >> Buckingham >>>>>>>>>> Palace Road, London, SW1W 9TQ Registered in England Number: >> 3977902 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Bjorn Bringert >>>>>>> Google UK Limited, Registered Office: Belgrave House, 76 >>>>>>> Buckingham >>>>>>> Palace >>>>>>> Road, London, SW1W 9TQ Registered in England Number: 3977902 >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> >> >> >> -- >> Bjorn Bringert >> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >> Palace Road, London, SW1W 9TQ >> Registered in England Number: 3977902 > >
Received on Sunday, 31 October 2010 16:23:50 UTC