Re: R27. Grammars, TTS, media composition, and recognition results should all use standard formats from Bjorn Bringert on 2010-10-27 (public-xg-htmlspeech@w3.org from October 2010)

From: Bjorn Bringert <bringert@google.com>
Date: Wed, 27 Oct 2010 15:09:43 +0100
To: "Raj(Openstream)" <raj@openstream.com>
Cc: Dave Burke <daveburke@google.com>, Michael Bodell <mbodell@microsoft.com>, Dan Burnett <dburnett@voxeo.com>, Deborah Dahl <dahl@conversational-technologies.com>, public-xg-htmlspeech@w3.org
Message-ID: <AANLkTikf04saNHLKcBE-MAK_pkXybuF9EwoJ7pMPw_z-@mail.gmail.com>
What's the simplest code (e.g. in JavaScript + DOM) needed to extract
the text of the best utterance from any EMMA document that a
recognizer might return? Michael's code works for the given example,
but not for an arbitrary EMMA document.

I understand that many apps want to do more complex things, but I
would like the API that we end up with to satisfy both parts of
"Simple things should be easy and complex things should be possible".

/Bjorn

On Wed, Oct 27, 2010 at 2:57 PM, Raj(Openstream) <raj@openstream.com> wrote:
> From our developers'  experience, they don't seem to find Javascript any
> simpler than using
> EMMA....and all of them needless to say are Web developers to being with..
>
> Raj
>
> ----- Original Message -----
> From: Dave Burke
> To: Michael Bodell
> Cc: Bjorn Bringert ; Dan Burnett ; Deborah Dahl ;
> public-xg-htmlspeech@w3.org
> Sent: Tuesday, October 26, 2010 5:48 PM
> Subject: Re: R27. Grammars, TTS, media composition, and recognition results
> should all use standard formats
> Seems convoluted to force developers to have to understand EMMA when we
> could have a simpler JavaScript object. What does EMMA buy the typical Web
> developer?
> Dave
>
> On Tue, Oct 26, 2010 at 10:43 PM, Michael Bodell <mbodell@microsoft.com>
> wrote:
>>
>> Here's the first EMMA example from the specification:
>>
>> <emma:emma version="1.0"
>>    xmlns:emma="http://www.w3.org/2003/04/emma"
>>    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>    xsi:schemaLocation="http://www.w3.org/2003/04/emma
>>     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
>>    xmlns="http://www.example.com/example">
>>  <emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"
>>     emma:medium="acoustic" emma:mode="voice">
>>    <emma:interpretation id="int1" emma:confidence="0.75"
>>    emma:tokens="flights from boston to denver">
>>      <origin>Boston</origin>
>>      <destination>Denver</destination>
>>    </emma:interpretation>
>>
>>    <emma:interpretation id="int2" emma:confidence="0.68"
>>    emma:tokens="flights from austin to denver">
>>      <origin>Austin</origin>
>>      <destination>Denver</destination>
>>    </emma:interpretation>
>>  </emma:one-of>
>> </emma:emma>
>>
>> Using something like xpath it is very simple to do something like
>> '//interpretation[@confidence > 0.6][1]' or '//interpretation/origin'.
>>
>> Using DOM one could easily do something like getElementsById("int1") and
>> inspect that element or else getElementsByName("interpretation").
>>
>> If you had a more E4X approach you could imagine
>> result["one-of"].interpretation[0] would give you the first result.
>>
>> The JSON representation of content might be:
>> ({'one-of':{interpretation:[{origin:"Boston", destination:"Denver"},
>> {origin:"Austin", destination:"Denver"}]}}).
>>
>> In addition, depending on how the recognition is defined there might be
>> one or more default bindings of recognition results to input elements in
>> HTML such that scripting isn't needed for the "common tasks" but the
>> scripting is there for the more advanced tasks.
>>
>> -----Original Message-----
>> From: Bjorn Bringert [mailto:bringert@google.com]
>> Sent: Monday, October 25, 2010 5:43 AM
>> To: Dan Burnett
>> Cc: Michael Bodell; Deborah Dahl; public-xg-htmlspeech@w3.org
>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>> results should all use standard formats
>>
>> I haven't used EMMA, but it looks like it could be a bit complex for a
>> script to simply get the top utterance or interpretation out. Are there any
>> shorthands or DOM methods for this? Any Hello World examples to show the
>> basic usage?
>>
>> /Bjorn
>>
>> On Mon, Oct 25, 2010 at 1:38 PM, Dan Burnett <dburnett@voxeo.com> wrote:
>> > +1
>> > On Oct 22, 2010, at 2:57 PM, Michael Bodell wrote:
>> >
>> >> I agree that SRGS, SISR, EMMA, and SSML seems like the obvious W3C
>> >> standard formats that we should use.
>> >>
>> >> -----Original Message-----
>> >> From: public-xg-htmlspeech-request@w3.org
>> >> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Deborah
>> >> Dahl
>> >> Sent: Friday, October 22, 2010 6:39 AM
>> >> To: 'Bjorn Bringert'; 'Dan Burnett'
>> >> Cc: public-xg-htmlspeech@w3.org
>> >> Subject: RE: R27. Grammars, TTS, media composition, and recognition
>> >> results should all use standard formats
>> >>
>> >> For recognition results, EMMA
>> >> http://www.w3.org/TR/2009/REC-emma-20090210/
>> >> is a much more recent and more complete standard than NLSML. EMMA has
>> >> a very rich set of capabilities, but most of them are optional, so
>> >> that using it doesn't have to be complex. Quite a few recognizers
>> >> support it. I think one of the most valuable aspects of EMMA is that
>> >> as applications eventually start finding that they need more and more
>> >> information about the recognition result, much of that more advanced
>> >> information has already been worked out and standardized in EMMA.
>> >>
>> >>> -----Original Message-----
>> >>> From: public-xg-htmlspeech-request@w3.org
>> >>> [mailto:public-xg-htmlspeech- request@w3.org] On Behalf Of Bjorn
>> >>> Bringert
>> >>> Sent: Friday, October 22, 2010 7:01 AM
>> >>> To: Dan Burnett
>> >>> Cc: public-xg-htmlspeech@w3.org
>> >>> Subject: Re: R27. Grammars, TTS, media composition, and recognition
>> >>> results should all use standard formats
>> >>>
>> >>> For grammars, SRGS + SISR seems like the obvious choice.
>> >>>
>> >>> For TTS, SSML seems like the obvious choice.
>> >>>
>> >>> I'm not exactly what is meant by media composition here. Is it using
>> >>> TTS output together with other media? Is there a use case for this?
>> >>> And is there anything we need to specify here at all?
>> >>>
>> >>> For recognition results, there is NLSML, but as far as I can tell,
>> >>> that hasn't been widely adopted. Also, it seems like it could be a
>> >>> bit complex for web applications to process.
>> >>>
>> >>> /Bjorn
>> >>>
>> >>> On Fri, Oct 22, 2010 at 1:06 AM, Dan Burnett <dburnett@voxeo.com>
>> >>> wrote:
>> >>>>
>> >>>> Group,
>> >>>>
>> >>>> This is the second of the requirements to discuss and prioritize
>> >>>> based our ranking approach [1].
>> >>>>
>> >>>> This email is the beginning of a thread for questions, discussion,
>> >>>> and opinions regarding our first draft of Requirement 27 [2].
>> >>>>
>> >>>> After our discussion and any modifications to the requirement, our
>> >>>> goal is to prioritize this requirement as either "Should Address"
>> >>>> or "For Future Consideration".
>> >>>>
>> >>>> -- dan
>> >>>>
>> >>>> [1]
>> >>>> http://lists.w3.org/Archives/Public/public-xg-
>> >>>
>> >>> htmlspeech/2010Oct/0024.html
>> >>>>
>> >>>> [2]
>> >>>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Oct/at
>> >>>> t
>> >>>> -
>> >>>
>> >>> 0001/speech.html#r27
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Bjorn Bringert
>> >>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>> >>> Palace Road, London, SW1W 9TQ Registered in England Number: 3977902
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Bjorn Bringert
>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham Palace
>> Road, London, SW1W 9TQ Registered in England Number: 3977902
>>
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Wednesday, 27 October 2010 14:10:45 UTC