Re: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Dan Burnett on 2010-10-29 (public-xg-htmlspeech@w3.org from October 2010)

From: Dan Burnett <dburnett@voxeo.com>
Date: Thu, 28 Oct 2010 20:40:52 -0400
To: Olli@pettay.fi
Cc: Bjorn Bringert <bringert@google.com>, "Raj(Openstream)" <raj@openstream.com>, Dave Burke <daveburke@google.com>, Michael Bodell <mbodell@microsoft.com>, Deborah Dahl <dahl@conversational-technologies.com>, public-xg-htmlspeech@w3.org
Message-Id: <AD4C3D58-5083-4F46-917B-60518754A794@voxeo.com>
Because
a) we are operating at a requirements level currently,
b) we essentially have agreement on SRGS, SISR, and SSML, and
c) we are beginning to agree on a direction for the recognition results,

I propose we split this requirement into two:

27a.  Grammars, TTS, and media composition should all use standard  
formats such as SRGS, SISR, and SSML.
27b.  Recognition results should be based upon a standard such as EMMA  
but be in an easy-to-process format such as JSON.

I suspect this will simplify our determination of which requirements  
we can list as "Should Address".
Thoughts?  Objections?

-- dan




On Oct 27, 2010, at 1:13 PM, Olli Pettay wrote:

> On 10/27/2010 05:09 PM, Bjorn Bringert wrote:
>> What's the simplest code (e.g. in JavaScript + DOM) needed to extract
>> the text of the best utterance from any EMMA document that a
>> recognizer might return? Michael's code works for the given example,
>> but not for an arbitrary EMMA document.
> Actually, the code might not work, *if* I read EMMA spec correctly,
> since it uses getElementById and id attribute is not defined to be ID
> in emma:interpretation.
> (Though, that would be just a spec bug)
>
>> I understand that many apps want to do more complex things, but I
>> would like the API that we end up with to satisfy both parts of
>> "Simple things should be easy and complex things should be possible".
> Totally agree with this.
>
> I wonder if we could specify some *small* subset of features we need
> from  EMMA and expose those as a JSON or some other JS friendly object
> in the first version of the becoming API.
> Then in the v2 support for full EMMA could be added.
> And in the mean while MMI WG could perhaps develop JSON version
> of the result format.
>
>
> I'm hoping we could come up some reasonable small and simple API as
> version 1 and then do more in the next revisions.
> Something similar what is happening with Web Notifications.
>
> -Olli
>
>
>>
>> /Bjorn
>>
>> On Wed, Oct 27, 2010 at 2:57 PM,  
>> Raj(Openstream)<raj@openstream.com>  wrote:
>>> From our developers'  experience, they don't seem to find  
>>> Javascript any
>>> simpler than using
>>> EMMA....and all of them needless to say are Web developers to  
>>> being with..
>>>
>>> Raj
>>>
>>> ----- Original Message -----
>>> From: Dave Burke
>>> To: Michael Bodell
>>> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ;
>>> public-xg-htmlspeech@w3.org
>>> Sent: Tuesday, October 26, 2010 5:48 PM
>>> Subject: Re: R27. Grammars, TTS, media composition, and  
>>> recognition results
>>> should all use standard formats
>>> Seems convoluted to force developers to have to understand EMMA  
>>> when we
>>> could have a simpler JavaScript object. What does EMMA buy the  
>>> typical Web
>>> developer?
>>> Dave
>>>
>>> On Tue, Oct 26, 2010 at 10:43 PM, Michael Bodell<mbodell@microsoft.com 
>>> >
>>> wrote:
>>>>
>>>> Here's the first EMMA example from the specification:
>>>>
>>>> <emma:emma version="1.0"
>>>>    xmlns:emma="http://www.w3.org/2003/04/emma"
>>>>    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>    xsi:schemaLocation="http://www.w3.org/2003/04/emma
>>>>     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>>>>    xmlns="http://www.example.com/example">
>>>>  <emma:one-of id="r1" emma:start="1087995961542"  
>>>> emma:end="1087995963542"
>>>>     emma:medium="acoustic" emma:mode="voice">
>>>>    <emma:interpretation id="int1" emma:confidence="0.75"
>>>>    emma:tokens="flights from boston to denver">
>>>>      <origin>Boston</origin>
>>>>      <destination>Denver</destination>
>>>>    </emma:interpretation>
>>>>
>>>>    <emma:interpretation id="int2" emma:confidence="0.68"
>>>>    emma:tokens="flights from austin to denver">
>>>>      <origin>Austin</origin>
>>>>      <destination>Denver</destination>
>>>>    </emma:interpretation>
>>>>  </emma:one-of>
>>>> </emma:emma>
>>>>
>>>> Using something like xpath it is very simple to do something like
>>>> '//interpretation[@confidence>  0.6][1]' or '//interpretation/ 
>>>> origin'.
>>>>
>>>> Using DOM one could easily do something like  
>>>> getElementsById("int1") and
>>>> inspect that element or else getElementsByName("interpretation").
>>>>
>>>> If you had a more E4X approach you could imagine
>>>> result["one-of"].interpretation[0] would give you the first result.
>>>>
>>>> The JSON representation of content might be:
>>>> ({'one-of':{interpretation:[{origin:"Boston",  
>>>> destination:"Denver"},
>>>> {origin:"Austin", destination:"Denver"}]}}).
>>>>
>>>> In addition, depending on how the recognition is defined there  
>>>> might be
>>>> one or more default bindings of recognition results to input  
>>>> elements in
>>>> HTML such that scripting isn't needed for the "common tasks" but  
>>>> the
>>>> scripting is there for the more advanced tasks.
>>>>
>>>> -----Original Message-----
>>>> From: Bjorn Bringert [mailto:bringert@google.com]
>>>> Sent: Monday, October 25, 2010 5:43 AM
>>>> To: Dan Burnett
>>>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org
>>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>>>> results should all use standard formats
>>>>
>>>> I haven't used EMMA, but it looks like it could be a bit complex  
>>>> for a
>>>> script to simply get the top utterance or interpretation out. Are  
>>>> there any
>>>> shorthands or DOM methods for this? Any Hello World examples to  
>>>> show the
>>>> basic usage?
>>>>
>>>> /Bjorn
>>>>
>>>> On Mon, Oct 25, 2010 at 1:38 PM, Dan Burnett<dburnett@voxeo.com>   
>>>> wrote:
>>>>> +1
>>>>> On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
>>>>>
>>>>>> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious  
>>>>>> W3C
>>>>>> standard formats that we should use.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Deborah
>>>>>> Dahl
>>>>>> Sent: Friday, October 22, 2010 6:39 AM
>>>>>> To: 'Bjorn Bringert'; 'Dan Burnett'
>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>> Subject: RE: R27. Grammars, TTS, media composition, and  
>>>>>> recognition
>>>>>> results should all use standard formats
>>>>>>
>>>>>> For recognition results, EMMA
>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/
>>>>>> is a much more recent and more complete standard than NLSML.  
>>>>>> EMMA has
>>>>>> a very rich set of capabilities, but most of them are optional,  
>>>>>> so
>>>>>> that using it doesn't have to be complex. Quite a few recognizers
>>>>>> support it. I think one of the most valuable aspects of EMMA is  
>>>>>> that
>>>>>> as applications eventually start finding that they need more  
>>>>>> and more
>>>>>> information about the recognition result, much of that more  
>>>>>> advanced
>>>>>> information has already been worked out and standardized in EMMA.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of Bjorn
>>>>>>> Bringert
>>>>>>> Sent: Friday, October 22, 2010 7:01 AM
>>>>>>> To: Dan Burnett
>>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and  
>>>>>>> recognition
>>>>>>> results should all use standard formats
>>>>>>>
>>>>>>> For grammars, SRGS + SISR seems like the obvious choice.
>>>>>>>
>>>>>>> For TTS, SSML seems like the obvious choice.
>>>>>>>
>>>>>>> I'm not exactly what is meant by media composition here. Is it  
>>>>>>> using
>>>>>>> TTS output together with other media? Is there a use case for  
>>>>>>> this?
>>>>>>> And is there anything we need to specify here at all?
>>>>>>>
>>>>>>> For recognition results, there is NLSML, but as far as I can  
>>>>>>> tell,
>>>>>>> that hasn't been widely adopted. Also, it seems like it could  
>>>>>>> be a
>>>>>>> bit complex for web applications to process.
>>>>>>>
>>>>>>> /Bjorn
>>>>>>>
>>>>>>> On Fri, Oct 22, 2010 at 1:06 AM, Dan Burnett<dburnett@voxeo.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Group,
>>>>>>>>
>>>>>>>> This is the second of the requirements to discuss and  
>>>>>>>> prioritize
>>>>>>>> based our ranking approach [1].
>>>>>>>>
>>>>>>>> This email is the beginning of a thread for questions,  
>>>>>>>> discussion,
>>>>>>>> and opinions regarding our first draft of Requirement 27 [2].
>>>>>>>>
>>>>>>>> After our discussion and any modifications to the  
>>>>>>>> requirement, our
>>>>>>>> goal is to prioritize this requirement as either "Should  
>>>>>>>> Address"
>>>>>>>> or "For Future Consideration".
>>>>>>>>
>>>>>>>> -- dan
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
>>>>>>>
>>>>>>> htmlspeech/2010Oct/0024.html
>>>>>>>>
>>>>>>>> [2]
>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/at
>>>>>>>> t
>>>>>>>> -
>>>>>>>
>>>>>>> 0001/speech.html#r27
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bjorn Bringert
>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76  
>>>>>>> Buckingham
>>>>>>> Palace Road, London, SW1W 9TQ Registered in England Number:  
>>>>>>> 3977902
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76  
>>>> Buckingham Palace
>>>> Road, London, SW1W 9TQ Registered in England Number: 3977902
>>>>
>>>
>>>
>>
>>
>>
>
Received on Friday, 29 October 2010 00:41:27 UTC