Re: SpeechRecognitionAlternative.interpretation when interpretation can't be provided

Debbie,
In my proposal, the single emma document is updated with each
new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma
document is populated in "real time" as the user speaks each field, without
waiting for the user to finish speaking. A JavaScript author could use this
to populate a form in "real time".


Also, I now realize that the SpeechRecognitionEvent.transcript is not
useful in continuous = false mode because only one final result is
returned, and thus SpeechRecognitionEvent.results[0].transcript always
contains the same string (no concatenation needed).  I also don't see it as
very useful in continuous = true mode because if an author is using this
mode, it's presumably because he wants to show continuous final results
(and perhaps interim as well). Since the author is already writing code to
concatenate results to display them "real-time", there's little or no
savings with this new attribute.  So I now retract that portion of my
proposal.

So to clarify, here's my proposed changes to the spec. If there's no
disagreement by the end of the week I'll add it to the spec...


Delete SpeechRecognitionAlternative.interpretation

Delete SpeechRecognitionResult.emma

Add interpretation and emma attributes to SpeechRecognitionEvent.
 Specifically:

    interface SpeechRecognitionEvent : Event {
        readonly attribute short resultIndex;
        readonly attribute SpeechRecognitionResultList results;
        readonly attribute any interpretation;
        readonly attribute Document emma;
    };

I do not propose to change the definitions of interpretation and emma at
this time (because there is on-going discussion), but rather to simply move
their current definitions to the new heading: "5.1.8 Speech Recognition
Event".

/Glen Shires


On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl <
dahl@conversational-technologies.com> wrote:

> Hi Glenn,****
>
> I agree that a single cumulative emma document is preferable to multiple
> emma documents in general, although I think that there might be use cases
> where it would be convenient to have both.  For example, you want to
> populate a form in real time as the user speaks each field, without waiting
> for the user to finish speaking. After the result is final the application
> could send the cumulative result to the server, but seeing the interim
> results would be helpful feedback to the user.****
>
> Debbie****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 2:57 PM
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
> ** **
>
> I believe the same is true for emma, a single, cumulative emma document is
> preferable to multiple emma documents. ****
>
> ** **
>
> I propose the following changes to the spec:****
>
> ** **
>
> Delete SpeechRecognitionAlternative.interpretation****
>
> ** **
>
> Delete SpeechRecognitionResult.emma****
>
> ** **
>
> Add interpretation and emma attributes to SpeechRecognitionEvent.
>  Specifically:****
>
> ** **
>
>     interface SpeechRecognitionEvent : Event {****
>
>         readonly attribute short resultIndex;****
>
>         readonly attribute SpeechRecognitionResultList results;****
>
>         readonly attribute DOMString transcript;****
>
>         readonly attribute any interpretation;****
>
>         readonly attribute Document emma;****
>
>     };****
>
> ** **
>
> I do not propose to change the definitions of interpretation and emma at
> this time (because there is on-going discussion), but rather to simply move
> their current definitions to the new heading: "5.1.8 Speech Recognition
> Event".****
>
> ** **
>
> I also propose adding transcript attribute to SpeechRecognitionEvent (but
> also retaining SpeechRecognitionAlternative.transcript). This provides a
> simple option for JavaScript authors to get at the full, cumulative
> transcript.  I propose the definition under "5.1.8 Speech Recognition
> Event" be:****
>
> ** **
>
> transcript****
>
> The transcript string represents the raw words that the user spoke. This
> is a concatenation of the first (highest confidence) alternative of all
> final SpeechRecognitionAlternative.transcript strings.****
>
> ** **
>
> /Glen Shires ****
>
> ** **
>
> ** **
>
> On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I agree with having a single interpretation that represents the cumulative
> interpretation of the utterance so far. ****
>
> I think an example of what Jim is talking about, when the interpretation
> wouldn’t be final even if the transcript is, might be the utterance “from
> Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and
> returns “from: ORD”, because most people don’t bother to say “O’Hare”, but
> then it hears “Midway” and changes the interpretation to “from: MDW”.
>  However, “from Chicago” is still the transcript. ****
>
> Also the problem that Glenn points out is bad enough with two slots, but
> it gets even worse as the number of slots gets bigger. For example, you
> might have a pizza-ordering utterance with five or six ingredients (“I want
> a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would
> be very cumbersome to have to go back through all the results to fill in
> the slots separately.****
>
>  ****
>
> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com]
> *Sent:* Wednesday, August 29, 2012 12:37 PM
> *To:* Glen Shires; Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
> *Subject:* RE: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I agree with the idea of having a single interpretation.  There is no
> guarantee that the different parts of the string have independent
> interpretations.  For example, even if the transcription “from New York” is
> final,  its interpretation may not  be, since it may depend on the
> remaining parts of the utterance (that depends on how complicated the
> grammar is, of course.)  ****
>
>  ****
>
> -          Jim****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 11:44 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> How should interpretation work with continuous speech?****
>
>  ****
>
> Specifically, as each portion becomes final (each SpeechRecognitionResult
> with final=true), the corresponding alternative(s) for transcription and
> interpretation become final.****
>
>  ****
>
> It's easy for the JavaScript author to handle the consecutive list of
> transcription strings - simply concatenate them.****
>
>  ****
>
> However, if the interpretation returns a semantic structure (such as the
> depart/arrive example), it's unclear to me how they should be returned.
>  For example, if the first final result was "from New York" and the second
> "to San Francisco", then:****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.results[1].item[0].interpretation = {****
>
>   depart: null,****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> If so, this makes using the interpretation structure very messy for the
> author because he needs to loop through all the results to find each
> interpretation slot that he needs.****
>
>  ****
>
> I suggest that we instead consider changing the spec to provide a single
> interpretation that always represents the most current interpretation.****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> This not only makes it simple for the author to process the
> interpretation, it also solves the problem that the interpretation may not
> be available at the same point in time that the transcription becomes
> final.  If alternative interpretations are important, then it's easy to add
> them to the interpretation structure that is returned, and this format far
> easier for the author to process than
> multiple SpeechRecognitionAlternative.interpretations.  For example:****
>
>  ****
>
> event.interpretation = {****
>
>   depart: ["New York", "Newark"],****
>
>   arrive: ["San Francisco", "San Bernardino"],****
>
> };****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I don’t think there’s a big difference in complexity in this use case, but
> here’s another one, that I think might be more common.****
>
> Suppose the application is something like search or composing email, and
> the transcript alone would serve the application's purposes. However, some
> implementations might also provide useful normalizations like converting
> text numbers to digits or capitalization that would make the dictated text
> look more like written language, and this normalization fills the
> "interpretation slot". If the developer can count on the "interpretation"
> slot being filled by the transcript if there's nothing better, then the
> developer only has to ask for the interpretation. ****
>
> e.g. ****
>
> document.write(interpretation)****
>
>  ****
>
> vs. ****
>
> if(intepretation)****
>
>                 document.write(interpretation)****
>
> else****
>
>                 document.write(transcript)****
>
>  ****
>
> which I think is simpler. The developer doesn’t have to worry about type
> checking because in this application the “interpretation” will always be a
> string.****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Tuesday, August 28, 2012 10:44 PM
> *To:* Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> Looking at this from the viewpoint of what is easier for the JavaScript
> author, I believe:****
>
>  ****
>
> SpeechRecognitionAlternative.transcript must return a string (even if an
> empty string). Thus, an author wishing to use the transcript doesn't need
> to perform any type checking.****
>
>  ****
>
> SpeechRecognitionAlternative.interpretation must be null if no
> interpretation is provided.  This simplifies the required conditional by
> eliminating type checking.  For example:****
>
>  ****
>
> transcript = "from New York to San Francisco";****
>
>  ****
>
> interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> if (interpretation)  // this works if interpretation is present or if null
> ****
>
>   document.write("Depart " + interpretation.depart + " and arrive in " +
> interpretation.arrive);****
>
> else****
>
>   document.write(transcript);****
>
> fi****
>
>  ****
>
>  ****
>
> Whereas, if the interpretation contains the transcript string when no
> interpretation is present, the condition would have to be:****
>
>  ****
>
> if (typeof(interpretation) != "string")****
>
>  ****
>
> Which is more complex, and more prone to errors (e.g. if spell "string"
> wrong).****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
>  ****
>
> On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Hi Glenn,****
>
> In the case of an SLM, if there’s a classification, I think the
> classification would be the interpretation. If the SLM is just used to
> improve dictation results, without classification, then the interpretation
> would be whatever we say it is – either the transcript, null, or undefined.
> ****
>
> My point about stating that the “transcript” attribute is required or
> optional wasn’t whether or not there was a use case where it would be
> desirable not to return a transcript. My point was that the spec needs to
> be explicit about the optional/required status of every feature. It’s
> fine to postpone that decision if there’s any controversy, but if we all
> agree we might as well add it to the spec. ****
>
> I can’t think of any cases where it would be bad to return a transcript,
> although I can think of use cases where the developer wouldn’t choose to do
> anything with the transcript (like multi-slot form filling – all the end
> user really needs to see is the correctly filled slots). ****
>
> Debbie****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, August 23, 2012 3:48 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> I agree with the need to support SLMs. This implies that, in some cases,
> the author may not specify semantic information, and thus there would not
> be an interpretation.****
>
>  ****
>
> Under what circumstances (except error conditions) do you envision that a
> transcript would not be returned?****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Actually, Satish's comment made me think that we probably have a few other
> things to agree on before we decide what the default value of
> "interpretation" should be, because we haven't settled on a lot of issues
> about what is required and what is optional.
> Satish's argument is only relevant if we require SRGS/SISR for grammars and
> semantic interpretation, but we actually don't require either of those
> right
> now, so it doesn't matter what they do as far as the current spec goes.
> (Although it's worth noting that  SRGS doesn't require anything to be
> returned at all, even the transcript
> http://www.w3.org/TR/speech-grammar/#S1.10).
> So I think we first need to decide and explicitly state in the spec ---
>
> 1. what we want to say about grammar formats (which are allowed/required,
> or
> is the grammar format open). It probably needs to be somewhat open because
> of SLM's.
> 2. what we want to say about semantic tag formats (are proprietary formats
> allowed, is SISR required or is the semantic tag format just whatever the
> grammar format uses)
> 3. is "transcript" required?
> 4. is "interpretation" required?
>
> Debbie****
>
>
> > -----Original Message-----
> > From: Hans Wennborg [mailto:hwennborg@google.com]
> > Sent: Tuesday, August 21, 2012 12:50 PM
> > To: Glen Shires
> > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org
> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > interpretation can't be provided
> >
> > Björn, Deborah, are you ok with this as well? I.e. that the spec
> > shouldn't mandate a "default" value for the interpretation attribute,
> > but rather return null when there is no interpretation?
> >
> > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> wrote:
> > > I agree, return "null" (not "undefined") in such cases.
> > >
> > >
> > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote:
> > >>
> > >> > I may have missed something, but I don’t see in the spec where it
> says
> > >> > that “interpretation” is optional.
> > >>
> > >> Developers specify the interpretation value with SISR and if they
> don't
> > >> specify there is no 'default' interpretation available. In that sense
> it is
> > >> optional because grammars don't mandate it. So I think this API
> shouldn't
> > >> mandate providing a default value if the engine did not provide one,
> and
> > >> return null in such cases.
>
>
>
> > >>
> > >> Cheers
> > >> Satish
> > >>
> > >>
> > >>
> > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl
> > >> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> I may have missed something, but I don’t see in the spec where it
> says
> > >>> that “interpretation” is optional.
> > >>>
> > >>> From: Satish S [mailto:satish@google.com]
> > >>> Sent: Thursday, August 16, 2012 7:38 PM
> > >>> To: Deborah Dahl
> > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org
> > >>>
> > >>>
> > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> interpretation can't be provided
> > >>>
> > >>>
> > >>>
> > >>> 'interpretation' is an optional attribute because engines are not
> > >>> required to provide an interpretation on their own (unlike
> 'transcript').
> > As
> > >>> such I think it should return null when there isn't a value to be
> returned
> > >>> as that is the convention for optional attributes, not 'undefined' or
> a
> > copy
> > >>> of some other attribute.
> > >>>
> > >>>
> > >>>
> > >>> If an engine chooses to return the same value for 'transcript' and
> > >>> 'interpretation' or do textnorm of the value and return in
> 'interpretation'
> > >>> that will be an implementation detail of the engine. But in the
> absence
> > of
> > >>> any such value for 'interpretation' from the engine I think the UA
> should
> > >>> return null.
> > >>>
> > >>>
> > >>> Cheers
> > >>> Satish
> > >>>
> > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl
> > >>> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> That's a good point. There are lots of use cases where some simple
> > >>> normalization is extremely useful, as in your example, or collapsing
> all
> > the
> > >>> ways that the user might say "yes" or "no". However, you could say
> that
> > once
> > >>> the implementation has modified or normalized the transcript that
> > means it
> > >>> has some kind of interpretation, so putting a normalized value in the
> > >>> interpretation slot should be fine. Nothing says that the
> "interpretation"
> > >>> has to be a particularly fine-grained interpretation, or one with a
> lot of
> > >>> structure.
> > >>>
> > >>>
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: Bjorn Bringert [mailto:bringert@google.com]
> > >>> > Sent: Thursday, August 16, 2012 9:09 AM
> > >>> > To: Hans Wennborg
> > >>> > Cc: Conversational; public-speech-api@w3.org
> > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> > interpretation can't be provided
> > >>> >
> > >>> > I'm not sure that it has to be that strict in requiring that the
> value
> > >>> > is the same as the "transcript" attribute. For example, an engine
> > >>> > might return the words recognized in "transcript" and apply some
> > extra
> > >>> > textnorm to the text that it returns in "interpretation", e.g.
> > >>> > converting digit words to digits ("three" -> "3"). Not sure if
> that's
> > >>> > useful though.
> > >>> >
> > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg
> > >>> > <hwennborg@google.com> wrote:
> > >>> > > Yes, the raw text is in the 'transcript' attribute.
> > >>> > >
> > >>> > > The description of 'interpretation' is currently: "The
> interpretation
> > >>> > > represents the semantic meaning from what the user said. This
> > might
> > >>> > > be
> > >>> > > determined, for instance, through the SISR specification of
> semantics
> > >>> > > in a grammar."
> > >>> > >
> > >>> > > I propose that we change it to "The interpretation represents the
> > >>> > > semantic meaning from what the user said. This might be
> > determined,
> > >>> > > for instance, through the SISR specification of semantics in a
> > >>> > > grammar. If no semantic meaning can be determined, the attribute
> > must
> > >>> > > be a string with the same value as the 'transcript' attribute."
> > >>> > >
> > >>> > > Does that sound good to everyone? If there are no objections,
> I'll
> > >>> > > make the change to the draft next week.
> > >>> > >
> > >>> > > Thanks,
> > >>> > > Hans
> > >>> > >
> > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational
> > >>> > > <dahl@conversational-technologies.com> wrote:
> > >>> > >> I can't check the spec right now, but I assume there's already
> an
> > >>> > >> attribute
> > >>> > that currently is defined to contain the raw text. So I think we
> could
> > >>> > say that
> > >>> > if there's no interpretation the value of the interpretation
> attribute
> > >>> > would be
> > >>> > the same as the value of the "raw string" attribute,
> > >>> > >>
> > >>> > >> Sent from my iPhone
> > >>> > >>
> > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg
> > <hwennborg@google.com>
> > >>> > wrote:
> > >>> > >>
> > >>> > >>> OK, that would work I suppose.
> > >>> > >>>
> > >>> > >>> What would the spec text look like? Something like "[...] If no
> > >>> > >>> semantic meaning can be determined, the attribute will a string
> > >>> > >>> representing the raw words that the user spoke."?
> > >>> > >>>
> > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert
> > >>> > <bringert@google.com> wrote:
> > >>> > >>>> Yeah, that would be my preference too.
> > >>> > >>>>
> > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational
> > >>> > >>>> <dahl@conversational-technologies.com> wrote:
> > >>> > >>>>> If there isn't an interpretation I think it would make the
> most
> > >>> > >>>>> sense
> > >>> > for the attribute to contain the literal string result. I believe
> this
> > >>> > is what
> > >>> > happens in VoiceXML.
> > >>> > >>>>>
> > >>> > >>>>>> My question is: for implementations that cannot provide an
> > >>> > >>>>>> interpretation, what should the attribute's value be? null?
> > >>> > undefined?
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > Bjorn Bringert
> > >>> > Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> > >>> > Palace Road, London, SW1W 9TQ
> > >>> > Registered in England Number: 3977902
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>

Received on Thursday, 30 August 2012 16:46:13 UTC