Re: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Bjorn Bringert on 2010-10-29 (public-xg-htmlspeech@w3.org from October 2010)

From: Bjorn Bringert <bringert@google.com>
Date: Fri, 29 Oct 2010 09:36:48 +0100
To: Dan Burnett <dburnett@voxeo.com>
Cc: Olli@pettay.fi, "Raj(Openstream)" <raj@openstream.com>, Dave Burke <daveburke@google.com>, Michael Bodell <mbodell@microsoft.com>, Deborah Dahl <dahl@conversational-technologies.com>, public-xg-htmlspeech@w3.org
Message-ID: <AANLkTikp0z36-vr3ioBtiLTpW3iWV3vb=X=O_mPus8YM@mail.gmail.com>
Sounds like a good idea. How about going a bit further:

27a. Speech recognition grammars should use standard formats such as
SRGS and SISR.
27b. TTS should use standard formats such as SSML.
27c. Recognition results should be based upon a standard such as EMMA
but be in an easy-to-process format such as JSON.

- This puts recognition and synthesis in separate requirements, since
we might end up with separate specs for them.

- This drops media composition. There have been no proposed use cases
that need it, and no existing standard for it has been proposed on
this thread.

/Bjorn

On Fri, Oct 29, 2010 at 1:40 AM, Dan Burnett <dburnett@voxeo.com> wrote:
> Because
> a) we are operating at a requirements level currently,
> b) we essentially have agreement on SRGS, SISR, and SSML, and
> c) we are beginning to agree on a direction for the recognition results,
>
> I propose we split this requirement into two:
>
> 27a.  Grammars, TTS, and media composition should all use standard formats
> such as SRGS, SISR, and SSML.
> 27b.  Recognition results should be based upon a standard such as EMMA but
> be in an easy-to-process format such as JSON.
>
> I suspect this will simplify our determination of which requirements we can
> list as "Should Address".
> Thoughts?  Objections?
>
> -- dan
>
>
>
>
> On Oct 27, 2010, at 1:13 PM, Olli Pettay wrote:
>
>> On 10/27/2010 05:09 PM, Bjorn Bringert wrote:
>>>
>>> What's the simplest code (e.g. in JavaScript + DOM) needed to extract
>>> the text of the best utterance from any EMMA document that a
>>> recognizer might return? Michael's code works for the given example,
>>> but not for an arbitrary EMMA document.
>>
>> Actually, the code might not work, *if* I read EMMA spec correctly,
>> since it uses getElementById and id attribute is not defined to be ID
>> in emma:interpretation.
>> (Though, that would be just a spec bug)
>>
>>> I understand that many apps want to do more complex things, but I
>>> would like the API that we end up with to satisfy both parts of
>>> "Simple things should be easy and complex things should be possible".
>>
>> Totally agree with this.
>>
>> I wonder if we could specify some *small* subset of features we need
>> from  EMMA and expose those as a JSON or some other JS friendly object
>> in the first version of the becoming API.
>> Then in the v2 support for full EMMA could be added.
>> And in the mean while MMI WG could perhaps develop JSON version
>> of the result format.
>>
>>
>> I'm hoping we could come up some reasonable small and simple API as
>> version 1 and then do more in the next revisions.
>> Something similar what is happening with Web Notifications.
>>
>> -Olli
>>
>>
>>>
>>> /Bjorn
>>>
>>> On Wed, Oct 27, 2010 at 2:57 PM, Raj(Openstream)<raj@openstream.com>
>>>  wrote:
>>>>
>>>> From our developers'  experience, they don't seem to find Javascript any
>>>> simpler than using
>>>> EMMA....and all of them needless to say are Web developers to being
>>>> with..
>>>>
>>>> Raj
>>>>
>>>> ----- Original Message -----
>>>> From: Dave Burke
>>>> To: Michael Bodell
>>>> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ;
>>>> public-xg-htmlspeech@w3.org
>>>> Sent: Tuesday, October 26, 2010 5:48 PM
>>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>>>> results
>>>> should all use standard formats
>>>> Seems convoluted to force developers to have to understand EMMA when we
>>>> could have a simpler JavaScript object. What does EMMA buy the typical
>>>> Web
>>>> developer?
>>>> Dave
>>>>
>>>> On Tue, Oct 26, 2010 at 10:43 PM, Michael Bodell<mbodell@microsoft.com>
>>>> wrote:
>>>>>
>>>>> Here's the first EMMA example from the specification:
>>>>>
>>>>> <emma:emma version="1.0"
>>>>>   xmlns:emma="http://www.w3.org/2003/04/emma"
>>>>>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>>>   xsi:schemaLocation="http://www.w3.org/2003/04/emma
>>>>>    http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>>>>>   xmlns="http://www.example.com/example">
>>>>>  <emma:one-of id="r1" emma:start="1087995961542"
>>>>> emma:end="1087995963542"
>>>>>    emma:medium="acoustic" emma:mode="voice">
>>>>>   <emma:interpretation id="int1" emma:confidence="0.75"
>>>>>   emma:tokens="flights from boston to denver">
>>>>>     <origin>Boston</origin>
>>>>>     <destination>Denver</destination>
>>>>>   </emma:interpretation>
>>>>>
>>>>>   <emma:interpretation id="int2" emma:confidence="0.68"
>>>>>   emma:tokens="flights from austin to denver">
>>>>>     <origin>Austin</origin>
>>>>>     <destination>Denver</destination>
>>>>>   </emma:interpretation>
>>>>>  </emma:one-of>
>>>>> </emma:emma>
>>>>>
>>>>> Using something like xpath it is very simple to do something like
>>>>> '//interpretation[@confidence>  0.6][1]' or '//interpretation/origin'.
>>>>>
>>>>> Using DOM one could easily do something like getElementsById("int1")
>>>>> and
>>>>> inspect that element or else getElementsByName("interpretation").
>>>>>
>>>>> If you had a more E4X approach you could imagine
>>>>> result["one-of"].interpretation[0] would give you the first result.
>>>>>
>>>>> The JSON representation of content might be:
>>>>> ({'one-of':{interpretation:[{origin:"Boston", destination:"Denver"},
>>>>> {origin:"Austin", destination:"Denver"}]}}).
>>>>>
>>>>> In addition, depending on how the recognition is defined there might be
>>>>> one or more default bindings of recognition results to input elements
>>>>> in
>>>>> HTML such that scripting isn't needed for the "common tasks" but the
>>>>> scripting is there for the more advanced tasks.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Bjorn Bringert [mailto:bringert@google.com]
>>>>> Sent: Monday, October 25, 2010 5:43 AM
>>>>> To: Dan Burnett
>>>>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org
>>>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>>>>> results should all use standard formats
>>>>>
>>>>> I haven't used EMMA, but it looks like it could be a bit complex for a
>>>>> script to simply get the top utterance or interpretation out. Are there
>>>>> any
>>>>> shorthands or DOM methods for this? Any Hello World examples to show
>>>>> the
>>>>> basic usage?
>>>>>
>>>>> /Bjorn
>>>>>
>>>>> On Mon, Oct 25, 2010 at 1:38 PM, Dan Burnett<dburnett@voxeo.com>
>>>>>  wrote:
>>>>>>
>>>>>> +1
>>>>>> On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
>>>>>>
>>>>>>> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious W3C
>>>>>>> standard formats that we should use.
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>>> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Deborah
>>>>>>> Dahl
>>>>>>> Sent: Friday, October 22, 2010 6:39 AM
>>>>>>> To: 'Bjorn Bringert'; 'Dan Burnett'
>>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>>> Subject: RE: R27. Grammars, TTS, media composition, and recognition
>>>>>>> results should all use standard formats
>>>>>>>
>>>>>>> For recognition results, EMMA
>>>>>>> http://www.w3.org/TR/2009/REC-emma-20090210/
>>>>>>> is a much more recent and more complete standard than NLSML. EMMA has
>>>>>>> a very rich set of capabilities, but most of them are optional, so
>>>>>>> that using it doesn't have to be complex. Quite a few recognizers
>>>>>>> support it. I think one of the most valuable aspects of EMMA is that
>>>>>>> as applications eventually start finding that they need more and more
>>>>>>> information about the recognition result, much of that more advanced
>>>>>>> information has already been worked out and standardized in EMMA.
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>>>>>>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of Bjorn
>>>>>>>> Bringert
>>>>>>>> Sent: Friday, October 22, 2010 7:01 AM
>>>>>>>> To: Dan Burnett
>>>>>>>> Cc: public-xg-htmlspeech@w3.org
>>>>>>>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>>>>>>>> results should all use standard formats
>>>>>>>>
>>>>>>>> For grammars, SRGS + SISR seems like the obvious choice.
>>>>>>>>
>>>>>>>> For TTS, SSML seems like the obvious choice.
>>>>>>>>
>>>>>>>> I'm not exactly what is meant by media composition here. Is it using
>>>>>>>> TTS output together with other media? Is there a use case for this?
>>>>>>>> And is there anything we need to specify here at all?
>>>>>>>>
>>>>>>>> For recognition results, there is NLSML, but as far as I can tell,
>>>>>>>> that hasn't been widely adopted. Also, it seems like it could be a
>>>>>>>> bit complex for web applications to process.
>>>>>>>>
>>>>>>>> /Bjorn
>>>>>>>>
>>>>>>>> On Fri, Oct 22, 2010 at 1:06 AM, Dan Burnett<dburnett@voxeo.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Group,
>>>>>>>>>
>>>>>>>>> This is the second of the requirements to discuss and prioritize
>>>>>>>>> based our ranking approach [1].
>>>>>>>>>
>>>>>>>>> This email is the beginning of a thread for questions, discussion,
>>>>>>>>> and opinions regarding our first draft of Requirement 27 [2].
>>>>>>>>>
>>>>>>>>> After our discussion and any modifications to the requirement, our
>>>>>>>>> goal is to prioritize this requirement as either "Should Address"
>>>>>>>>> or "For Future Consideration".
>>>>>>>>>
>>>>>>>>> -- dan
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-
>>>>>>>>
>>>>>>>> htmlspeech/2010Oct/0024.html
>>>>>>>>>
>>>>>>>>> [2]
>>>>>>>>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/at
>>>>>>>>> t
>>>>>>>>> -
>>>>>>>>
>>>>>>>> 0001/speech.html#r27
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Bjorn Bringert
>>>>>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>>>>>> Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjorn Bringert
>>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>>> Palace
>>>>> Road, London, SW1W 9TQ Registered in England Number: 3977902
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Friday, 29 October 2010 08:37:19 UTC