Re: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Dan Burnett on 2010-10-31 (public-xg-htmlspeech@w3.org from October 2010)

From: Dan Burnett <dburnett@voxeo.com>
Date: Sun, 31 Oct 2010 12:23:10 -0400
To: Deborah Dahl <dahl@conversational-technologies.com>
Cc: "'Bjorn Bringert'" <bringert@google.com>, <Olli@pettay.fi>, "'Raj\(Openstream\)'" <raj@openstream.com>, "'Dave Burke'" <daveburke@google.com>, "'Michael Bodell'" <mbodell@microsoft.com>, <public-xg-htmlspeech@w3.org>
Message-Id: <205637E2-4A66-4623-9FCF-7E1AFC614D00@voxeo.com>
I don't think we need to split 27c into the two parts, but I have no  
real objection to it.

I also didn't know what "media composition" meant.  I agree with  
dropping it unless and until someone comes up with a good explanation.

If anyone has an objection to the following new requirements derived  
from and intended to replace the original #27, please speak up:
27a. Speech recognition grammars should use standard formats such as  
SRGS and SISR.
27b. TTS should use standard formats such as SSML.
27c. Recognition results should be based upon a standard such as EMMA
27d. Recognition results should be in an easy to process format such  
as JSON.

-- dan

On Oct 29, 2010, at 10:09 AM, Deborah Dahl wrote:

> I think it's good to make the requirements more fine-grained and to  
> separate recognition from TTS. I think it will make them easier to  
> discuss.
> We could also apply that strategy to the proposed 27c, so we could  
> have
> 27c. Recognition results should be based upon a standard such as EMMA
> 27d. Recognition results should be in an easy to process format such  
> as JSON.
>
> I have to admit that I was never sure what "media composition"  
> meant. If it means synchronized media, for example that you would  
> use SMIL for, then that seems out of scope for this group, but maybe  
> the intended meaning was something else.
>
>> -----Original Message-----
>> From: Bjorn Bringert [mailto:bringert@google.com]
>> Sent: Friday, October 29, 2010 4:37 AM
>> To: Dan Burnett
>> Cc: Olli@pettay.fi; Raj(Openstream); Dave Burke; Michael Bodell;  
>> Deborah
>> Dahl; public-xg-htmlspeech@w3.org
>> Subject: Re: R27. Grammars, TTS, media composition, and recognition  
>> results
>> should all use standard formats
>>
>> Sounds like a good idea. How about going a bit further:
>>
>> 27a. Speech recognition grammars should use standard formats such as
>> SRGS and SISR.
>> 27b. TTS should use standard formats such as SSML.
>> 27c. Recognition results should be based upon a standard such as EMMA
>> but be in an easy-to-process format such as JSON.
>>
>> - This puts recognition and synthesis in separate requirements, since
>> we might end up with separate specs for them.
>>
>> - This drops media composition. There have been no proposed use cases
>> that need it, and no existing standard for it has been proposed on
>> this thread.
>>
>> /Bjorn
>>
>> On Fri, Oct 29, 2010 at 1:40 AM, Dan Burnett <dburnett@voxeo.com>  
>> wrote:
>>> Because
>>> a) we are operating at a requirements level currently,
>>> b) we essentially have agreement on SRGS, SISR, and SSML, and
>>> c) we are beginning to agree on a direction for the recognition  
>>> results,
>>>
>>> I propose we split this requirement into two:
>>>
>>> 27a.  Grammars, TTS, and media composition should all use standard
>> formats
>>> such as SRGS, SISR, and SSML.
>>> 27b.  Recognition results should be based upon a standard such as  
>>> EMMA
>> but
>>> be in an easy-to-process format such as JSON.
>>>
>>> I suspect this will simplify our determination of which  
>>> requirements we can
>>> list as "Should Address".
>>> Thoughts?  Objections?
>>>
>>> -- dan
>>>
>>>
>>>
>>>
>>> On Oct 27, 2010, at 1:13 PM, Olli Pettay wrote:
>>>
>>>> On 10/27/2010 05:09 PM, Bjorn Bringert wrote:
>>>>>
>>>>> What's the simplest code (e.g. in JavaScript + DOM) needed to  
>>>>> extract
>>>>> the text of the best utterance from any EMMA document that a
>>>>> recognizer might return? Michael's code works for the given  
>>>>> example,
>>>>> but not for an arbitrary EMMA document.
>>>>
>>>> Actually, the code might not work, *if* I read EMMA spec correctly,
>>>> since it uses getElementById and id attribute is not defined to  
>>>> be ID
>>>> in emma:interpretation.
>>>> (Though, that would be just a spec bug)
>>>>
>>>>> I understand that many apps want to do more complex things, but I
>>>>> would like the API that we end up with to satisfy both parts of
>>>>> "Simple things should be easy and complex things should be  
>>>>> possible".
>>>>
>>>> Totally agree with this.
>>>>
>>>> I wonder if we could specify some *small* subset of features we  
>>>> need
>>>> from  EMMA and expose those as a JSON or some other JS friendly  
>>>> object
>>>> in the first version of the becoming API.
>>>> Then in the v2 support for full EMMA could be added.
>>>> And in the mean while MMI WG could perhaps develop JSON version
>>>> of the result format.
>>>>
>>>>
>>>> I'm hoping we could come up some reasonable small and simple API as
>>>> version 1 and then do more in the next revisions.
>>>> Something similar what is happening with Web Notifications.
>>>>
>>>> -Olli
>>>>
>>>>
>>>>>
>>>>> /Bjorn
>>>>>
>>>>> On Wed, Oct 27, 2010 at 2:57 PM,
>> Raj(Openstream)<raj@openstream.com>
>>>>> wrote:
>>>>>>
>>>>>> From our developers'  experience, they don't seem to find  
>>>>>> Javascript
>> any
>>>>>> simpler than using
>>>>>> EMMA....and all of them needless to say are Web developers to  
>>>>>> being
>>>>>> with..
>>>>>>
>>>>>> Raj
>>>>>>
>>>>>> ----- Original Message -----
>>>>>> From: Dave Burke
>>>>>> To: Michael Bodell
>>>>>> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ;
>>>>>> public-xg-htmlspeech@w3.org
>>>>>> Sent: Tuesday, October 26, 2010 5:48 PM
>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and  
>>>>>> recognition
>>>>>> results
>>>>>> should all use standard formats
>>>>>> Seems convoluted to force developers to have to understand EMMA
>> when we
>>>>>> could have a simpler JavaScript object. What does EMMA buy the
>> typical
>>>>>> Web
>>>>>> developer?
>>>>>> Dave
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 10:43 PM, Michael
>> Bodell<mbodell@microsoft.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Here's the first EMMA example from the specification:
>>>>>>>
>>>>>>> <emma:emma version="1.0"
>>>>>>>  xmlns:emma="http://www.w3.org/2003/04/emma"
>>>>>>>  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>>>>  xsi:schemaLocation="http://www.w3.org/2003/04/emma
>>>>>>>   http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>>>>>>>  xmlns="http://www.example.com/example">
>>>>>>> <emma:one-of id="r1" emma:start="1087995961542"
>>>>>>> emma:end="1087995963542"
>>>>>>>   emma:medium="acoustic" emma:mode="voice">
>>>>>>>  <emma:interpretation id="int1" emma:confidence="0.75"
>>>>>>>  emma:tokens="flights from boston to denver">
>>>>>>>    <origin>Boston</origin>
>>>>>>>    <destination>Denver</destination>
>>>>>>>  </emma:interpretation>
>>>>>>>
>>>>>>>  <emma:interpretation id="int2" emma:confidence="0.68"
>>>>>>>  emma:tokens="flights from austin to denver">
>>>>>>>    <origin>Austin</origin>
>>>>>>>    <destination>Denver</destination>
>>>>>>>  </emma:interpretation>
>>>>>>> </emma:one-of>
>>>>>>> </emma:emma>
>>>>>>>
>>>>>>> Using something like xpath it is very simple to do something  
>>>>>>> like
>>>>>>> '//interpretation[@confidence>  0.6][1]' or '//interpretation/ 
>>>>>>> origin'.
>>>>>>>
>>>>>>> Using DOM one could easily do something like
>> getElementsById("int1")
>>>>>>> and
>>>>>>> inspect that element or else  
>>>>>>> getElementsByName("interpretation").
>>>>>>>
>>>>>>> If you had a more E4X approach you could imagine
>>>>>>> result["one-of"].interpretation[0] would give you the first  
>>>>>>> result.
>>>>>>>
>>>>>>> The JSON representation of content might be:
>>>>>>> ({'one-of':{interpretation:[{origin:"Boston",  
>>>>>>> destination:"Denver"},
>>>>>>> {origin:"Austin", destination:"Denver"}]}}).
>>>>>>>
>>>>>>> In addition, depending on how the recognition is defined there  
>>>>>>> might
>> be
>>>>>>> one or more default bindings of recognition results to input  
>>>>>>> elements
>>>>>>> in
>>>>>>> HTML such that scripting isn't needed for the "common tasks"  
>>>>>>> but the
>>>>>>> scripting is there for the more advanced tasks.
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Bjorn Bringert [mailto:bringert@google.com]
>>>>>>> Sent: Monday, October 25, 2010 5:43 AM
>>>>>>> To: Dan Burnett
>>>>>>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org
>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and  
>>>>>>> recognition
>>>>>>> results should all use standard formats
>>>>>>>
>>>>>>> I haven't used EMMA, but it looks like it could be a bit  
>>>>>>> complex for a
>>>>>>> script to simply get the top utterance or interpretation out.  
>>>>>>> Are there
>>>>>>> any
>>>>>>> shorthands or DOM methods for this? Any Hello World examples to
>> show
>>>>>>> the
>>>>>>> basic usage?
>>>>>>>
>>>>>>> /Bjorn
>>>>>>>
>>>>>>> On Mon, Oct 25, 2010 at 1:38 PM, Dan
>> Burnett<dburnett@voxeo.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1
>>>>>>>> On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
>>>>>>>>
>>>>>>>>> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious
>> W3C
>>>>>>>>> standard formats that we should use.
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of
>> Deborah
>>>>>>>>> Dahl
>>>>>>>>> Sent: Friday, October 22, 2010 6:39 AM
>>>>>>>>> To: 'Bjorn Bringert'; 'Dan Burnett'
>>>>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>>>>> Subject: RE: R27. Grammars, TTS, media composition, and
>> recognition
>>>>>>>>> results should all use standard formats
>>>>>>>>>
>>>>>>>>> For recognition results, EMMA
>>>>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/
>>>>>>>>> is a much more recent and more complete standard than NLSML.
>> EMMA has
>>>>>>>>> a very rich set of capabilities, but most of them are  
>>>>>>>>> optional, so
>>>>>>>>> that using it doesn't have to be complex. Quite a few  
>>>>>>>>> recognizers
>>>>>>>>> support it. I think one of the most valuable aspects of EMMA  
>>>>>>>>> is that
>>>>>>>>> as applications eventually start finding that they need more  
>>>>>>>>> and
>> more
>>>>>>>>> information about the recognition result, much of that more
>> advanced
>>>>>>>>> information has already been worked out and standardized in
>> EMMA.
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>>>>>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of
>> Bjorn
>>>>>>>>>> Bringert
>>>>>>>>>> Sent: Friday, October 22, 2010 7:01 AM
>>>>>>>>>> To: Dan Burnett
>>>>>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and
>> recognition
>>>>>>>>>> results should all use standard formats
>>>>>>>>>>
>>>>>>>>>> For grammars, SRGS + SISR seems like the obvious choice.
>>>>>>>>>>
>>>>>>>>>> For TTS, SSML seems like the obvious choice.
>>>>>>>>>>
>>>>>>>>>> I'm not exactly what is meant by media composition here. Is  
>>>>>>>>>> it
>> using
>>>>>>>>>> TTS output together with other media? Is there a use case for
>> this?
>>>>>>>>>> And is there anything we need to specify here at all?
>>>>>>>>>>
>>>>>>>>>> For recognition results, there is NLSML, but as far as I  
>>>>>>>>>> can tell,
>>>>>>>>>> that hasn't been widely adopted. Also, it seems like it  
>>>>>>>>>> could be a
>>>>>>>>>> bit complex for web applications to process.
>>>>>>>>>>
>>>>>>>>>> /Bjorn
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 22, 2010 at 1:06 AM, Dan
>> Burnett<dburnett@voxeo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Group,
>>>>>>>>>>>
>>>>>>>>>>> This is the second of the requirements to discuss and  
>>>>>>>>>>> prioritize
>>>>>>>>>>> based our ranking approach [1].
>>>>>>>>>>>
>>>>>>>>>>> This email is the beginning of a thread for questions,  
>>>>>>>>>>> discussion,
>>>>>>>>>>> and opinions regarding our first draft of Requirement 27  
>>>>>>>>>>> [2].
>>>>>>>>>>>
>>>>>>>>>>> After our discussion and any modifications to the  
>>>>>>>>>>> requirement,
>> our
>>>>>>>>>>> goal is to prioritize this requirement as either "Should  
>>>>>>>>>>> Address"
>>>>>>>>>>> or "For Future Consideration".
>>>>>>>>>>>
>>>>>>>>>>> -- dan
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
>>>>>>>>>>
>>>>>>>>>> htmlspeech/2010Oct/0024.html
>>>>>>>>>>>
>>>>>>>>>>> [2]
>>>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
>> htmlspeech/2010Oct/at
>>>>>>>>>>> t
>>>>>>>>>>> -
>>>>>>>>>>
>>>>>>>>>> 0001/speech.html#r27
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Bjorn Bringert
>>>>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76
>> Buckingham
>>>>>>>>>> Palace Road, London, SW1W 9TQ Registered in England Number:
>> 3977902
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bjorn Bringert
>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76  
>>>>>>> Buckingham
>>>>>>> Palace
>>>>>>> Road, London, SW1W 9TQ Registered in England Number: 3977902
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Bjorn Bringert
>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>> Palace Road, London, SW1W 9TQ
>> Registered in England Number: 3977902
>
>
Received on Sunday, 31 October 2010 16:23:50 UTC