W3C home > Mailing lists > Public > public-speech-api@w3.org > September 2012

Re: SpeechRecognitionAlternative.interpretation when interpretation can't be provided

From: Glen Shires <gshires@google.com>
Date: Tue, 4 Sep 2012 11:04:00 -0700
Message-ID: <CAEE5bcgKKtNtpec47XxEFs5pKjLcQt-pk9vBi_V2Ob+7Ms=xbQ@mail.gmail.com>
To: Deborah Dahl <dahl@conversational-technologies.com>
Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
I've updated the spec with this change (moved interpretation and emma
attributes to SpeechRecognitionEvent):
https://dvcs.w3.org/hg/speech-api/rev/48a58e558fcc

As always, the current draft spec is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

/Glen Shires

On Thu, Aug 30, 2012 at 10:07 AM, Deborah Dahl <
dahl@conversational-technologies.com> wrote:

> Thanks for the clarification, that makes sense.  When each new version of
> the emma document arrives in a  SpeechRecognitionEvent, the author can just
> repopulate all the  earlier form fields, as well as the newest one, with
> the data from the most recent emma version. ****
>
> ** **
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, August 30, 2012 12:45 PM
>
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
> ** **
>
> Debbie,****
>
> In my proposal, the single emma document is updated with each
> new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma
> document is populated in "real time" as the user speaks each field, without
> waiting for the user to finish speaking. A JavaScript author could use this
> to populate a form in "real time".****
>
> ** **
>
> ** **
>
> Also, I now realize that the SpeechRecognitionEvent.transcript is not
> useful in continuous = false mode because only one final result is
> returned, and thus SpeechRecognitionEvent.results[0].transcript always
> contains the same string (no concatenation needed).  I also don't see it as
> very useful in continuous = true mode because if an author is using this
> mode, it's presumably because he wants to show continuous final results
> (and perhaps interim as well). Since the author is already writing code to
> concatenate results to display them "real-time", there's little or no
> savings with this new attribute.  So I now retract that portion of my
> proposal.****
>
> ** **
>
> So to clarify, here's my proposed changes to the spec. If there's no
> disagreement by the end of the week I'll add it to the spec...****
>
> ** **
>
> ** **
>
> Delete SpeechRecognitionAlternative.interpretation****
>
> ** **
>
> Delete SpeechRecognitionResult.emma****
>
> ** **
>
> Add interpretation and emma attributes to SpeechRecognitionEvent.
>  Specifically:****
>
> ** **
>
>     interface SpeechRecognitionEvent : Event {****
>
>         readonly attribute short resultIndex;****
>
>         readonly attribute SpeechRecognitionResultList results;****
>
>         readonly attribute any interpretation;****
>
>         readonly attribute Document emma;****
>
>     };****
>
> ** **
>
> I do not propose to change the definitions of interpretation and emma at
> this time (because there is on-going discussion), but rather to simply move
> their current definitions to the new heading: "5.1.8 Speech Recognition
> Event".****
>
> ** **
>
> /Glen Shires****
>
> ** **
>
> ** **
>
> On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Hi Glenn,****
>
> I agree that a single cumulative emma document is preferable to multiple
> emma documents in general, although I think that there might be use cases
> where it would be convenient to have both.  For example, you want to
> populate a form in real time as the user speaks each field, without waiting
> for the user to finish speaking. After the result is final the application
> could send the cumulative result to the server, but seeing the interim
> results would be helpful feedback to the user.****
>
> Debbie****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 2:57 PM
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org****
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I believe the same is true for emma, a single, cumulative emma document is
> preferable to multiple emma documents. ****
>
>  ****
>
> I propose the following changes to the spec:****
>
>  ****
>
> Delete SpeechRecognitionAlternative.interpretation****
>
>  ****
>
> Delete SpeechRecognitionResult.emma****
>
>  ****
>
> Add interpretation and emma attributes to SpeechRecognitionEvent.
>  Specifically:****
>
>  ****
>
>     interface SpeechRecognitionEvent : Event {****
>
>         readonly attribute short resultIndex;****
>
>         readonly attribute SpeechRecognitionResultList results;****
>
>         readonly attribute DOMString transcript;****
>
>         readonly attribute any interpretation;****
>
>         readonly attribute Document emma;****
>
>     };****
>
>  ****
>
> I do not propose to change the definitions of interpretation and emma at
> this time (because there is on-going discussion), but rather to simply move
> their current definitions to the new heading: "5.1.8 Speech Recognition
> Event".****
>
>  ****
>
> I also propose adding transcript attribute to SpeechRecognitionEvent (but
> also retaining SpeechRecognitionAlternative.transcript). This provides a
> simple option for JavaScript authors to get at the full, cumulative
> transcript.  I propose the definition under "5.1.8 Speech Recognition
> Event" be:****
>
>  ****
>
> transcript****
>
> The transcript string represents the raw words that the user spoke. This
> is a concatenation of the first (highest confidence) alternative of all
> final SpeechRecognitionAlternative.transcript strings.****
>
>  ****
>
> /Glen Shires ****
>
>  ****
>
>  ****
>
> On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I agree with having a single interpretation that represents the cumulative
> interpretation of the utterance so far. ****
>
> I think an example of what Jim is talking about, when the interpretation
> wouldn’t be final even if the transcript is, might be the utterance “from
> Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and
> returns “from: ORD”, because most people don’t bother to say “O’Hare”, but
> then it hears “Midway” and changes the interpretation to “from: MDW”.
>  However, “from Chicago” is still the transcript. ****
>
> Also the problem that Glenn points out is bad enough with two slots, but
> it gets even worse as the number of slots gets bigger. For example, you
> might have a pizza-ordering utterance with five or six ingredients (“I want
> a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would
> be very cumbersome to have to go back through all the results to fill in
> the slots separately.****
>
>  ****
>
> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com]
> *Sent:* Wednesday, August 29, 2012 12:37 PM
> *To:* Glen Shires; Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
> *Subject:* RE: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I agree with the idea of having a single interpretation.  There is no
> guarantee that the different parts of the string have independent
> interpretations.  For example, even if the transcription “from New York” is
> final,  its interpretation may not  be, since it may depend on the
> remaining parts of the utterance (that depends on how complicated the
> grammar is, of course.)  ****
>
>  ****
>
> -          Jim****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 11:44 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> How should interpretation work with continuous speech?****
>
>  ****
>
> Specifically, as each portion becomes final (each SpeechRecognitionResult
> with final=true), the corresponding alternative(s) for transcription and
> interpretation become final.****
>
>  ****
>
> It's easy for the JavaScript author to handle the consecutive list of
> transcription strings - simply concatenate them.****
>
>  ****
>
> However, if the interpretation returns a semantic structure (such as the
> depart/arrive example), it's unclear to me how they should be returned.
>  For example, if the first final result was "from New York" and the second
> "to San Francisco", then:****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.results[1].item[0].interpretation = {****
>
>   depart: null,****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> If so, this makes using the interpretation structure very messy for the
> author because he needs to loop through all the results to find each
> interpretation slot that he needs.****
>
>  ****
>
> I suggest that we instead consider changing the spec to provide a single
> interpretation that always represents the most current interpretation.****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> This not only makes it simple for the author to process the
> interpretation, it also solves the problem that the interpretation may not
> be available at the same point in time that the transcription becomes
> final.  If alternative interpretations are important, then it's easy to add
> them to the interpretation structure that is returned, and this format far
> easier for the author to process than
> multiple SpeechRecognitionAlternative.interpretations.  For example:****
>
>  ****
>
> event.interpretation = {****
>
>   depart: ["New York", "Newark"],****
>
>   arrive: ["San Francisco", "San Bernardino"],****
>
> };****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I don’t think there’s a big difference in complexity in this use case, but
> here’s another one, that I think might be more common.****
>
> Suppose the application is something like search or composing email, and
> the transcript alone would serve the application's purposes. However, some
> implementations might also provide useful normalizations like converting
> text numbers to digits or capitalization that would make the dictated text
> look more like written language, and this normalization fills the
> "interpretation slot". If the developer can count on the "interpretation"
> slot being filled by the transcript if there's nothing better, then the
> developer only has to ask for the interpretation. ****
>
> e.g. ****
>
> document.write(interpretation)****
>
>  ****
>
> vs. ****
>
> if(intepretation)****
>
>                 document.write(interpretation)****
>
> else****
>
>                 document.write(transcript)****
>
>  ****
>
> which I think is simpler. The developer doesn’t have to worry about type
> checking because in this application the “interpretation” will always be a
> string.****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Tuesday, August 28, 2012 10:44 PM
> *To:* Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> Looking at this from the viewpoint of what is easier for the JavaScript
> author, I believe:****
>
>  ****
>
> SpeechRecognitionAlternative.transcript must return a string (even if an
> empty string). Thus, an author wishing to use the transcript doesn't need
> to perform any type checking.****
>
>  ****
>
> SpeechRecognitionAlternative.interpretation must be null if no
> interpretation is provided.  This simplifies the required conditional by
> eliminating type checking.  For example:****
>
>  ****
>
> transcript = "from New York to San Francisco";****
>
>  ****
>
> interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> if (interpretation)  // this works if interpretation is present or if null
> ****
>
>   document.write("Depart " + interpretation.depart + " and arrive in " +
> interpretation.arrive);****
>
> else****
>
>   document.write(transcript);****
>
> fi****
>
>  ****
>
>  ****
>
> Whereas, if the interpretation contains the transcript string when no
> interpretation is present, the condition would have to be:****
>
>  ****
>
> if (typeof(interpretation) != "string")****
>
>  ****
>
> Which is more complex, and more prone to errors (e.g. if spell "string"
> wrong).****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
>  ****
>
> On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Hi Glenn,****
>
> In the case of an SLM, if there’s a classification, I think the
> classification would be the interpretation. If the SLM is just used to
> improve dictation results, without classification, then the interpretation
> would be whatever we say it is – either the transcript, null, or undefined.
> ****
>
> My point about stating that the “transcript” attribute is required or
> optional wasn’t whether or not there was a use case where it would be
> desirable not to return a transcript. My point was that the spec needs to
> be explicit about the optional/required status of every feature. It’s
> fine to postpone that decision if there’s any controversy, but if we all
> agree we might as well add it to the spec. ****
>
> I can’t think of any cases where it would be bad to return a transcript,
> although I can think of use cases where the developer wouldn’t choose to do
> anything with the transcript (like multi-slot form filling – all the end
> user really needs to see is the correctly filled slots). ****
>
> Debbie****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, August 23, 2012 3:48 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> I agree with the need to support SLMs. This implies that, in some cases,
> the author may not specify semantic information, and thus there would not
> be an interpretation.****
>
>  ****
>
> Under what circumstances (except error conditions) do you envision that a
> transcript would not be returned?****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Actually, Satish's comment made me think that we probably have a few other
> things to agree on before we decide what the default value of
> "interpretation" should be, because we haven't settled on a lot of issues
> about what is required and what is optional.
> Satish's argument is only relevant if we require SRGS/SISR for grammars and
> semantic interpretation, but we actually don't require either of those
> right
> now, so it doesn't matter what they do as far as the current spec goes.
> (Although it's worth noting that  SRGS doesn't require anything to be
> returned at all, even the transcript
> http://www.w3.org/TR/speech-grammar/#S1.10).
> So I think we first need to decide and explicitly state in the spec ---
>
> 1. what we want to say about grammar formats (which are allowed/required,
> or
> is the grammar format open). It probably needs to be somewhat open because
> of SLM's.
> 2. what we want to say about semantic tag formats (are proprietary formats
> allowed, is SISR required or is the semantic tag format just whatever the
> grammar format uses)
> 3. is "transcript" required?
> 4. is "interpretation" required?
>
> Debbie****
>
>
> > -----Original Message-----
> > From: Hans Wennborg [mailto:hwennborg@google.com]
> > Sent: Tuesday, August 21, 2012 12:50 PM
> > To: Glen Shires
> > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org
> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > interpretation can't be provided
> >
> > Björn, Deborah, are you ok with this as well? I.e. that the spec
> > shouldn't mandate a "default" value for the interpretation attribute,
> > but rather return null when there is no interpretation?
> >
> > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> wrote:
> > > I agree, return "null" (not "undefined") in such cases.
> > >
> > >
> > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote:
> > >>
> > >> > I may have missed something, but I don’t see in the spec where it
> says
> > >> > that “interpretation” is optional.
> > >>
> > >> Developers specify the interpretation value with SISR and if they
> don't
> > >> specify there is no 'default' interpretation available. In that sense
> it is
> > >> optional because grammars don't mandate it. So I think this API
> shouldn't
> > >> mandate providing a default value if the engine did not provide one,
> and
> > >> return null in such cases.
>
>
>
> > >>
> > >> Cheers
> > >> Satish
> > >>
> > >>
> > >>
> > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl
> > >> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> I may have missed something, but I don’t see in the spec where it
> says
> > >>> that “interpretation” is optional.
> > >>>
> > >>> From: Satish S [mailto:satish@google.com]
> > >>> Sent: Thursday, August 16, 2012 7:38 PM
> > >>> To: Deborah Dahl
> > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org
> > >>>
> > >>>
> > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> interpretation can't be provided
> > >>>
> > >>>
> > >>>
> > >>> 'interpretation' is an optional attribute because engines are not
> > >>> required to provide an interpretation on their own (unlike
> 'transcript').
> > As
> > >>> such I think it should return null when there isn't a value to be
> returned
> > >>> as that is the convention for optional attributes, not 'undefined' or
> a
> > copy
> > >>> of some other attribute.
> > >>>
> > >>>
> > >>>
> > >>> If an engine chooses to return the same value for 'transcript' and
> > >>> 'interpretation' or do textnorm of the value and return in
> 'interpretation'
> > >>> that will be an implementation detail of the engine. But in the
> absence
> > of
> > >>> any such value for 'interpretation' from the engine I think the UA
> should
> > >>> return null.
> > >>>
> > >>>
> > >>> Cheers
> > >>> Satish
> > >>>
> > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl
> > >>> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> That's a good point. There are lots of use cases where some simple
> > >>> normalization is extremely useful, as in your example, or collapsing
> all
> > the
> > >>> ways that the user might say "yes" or "no". However, you could say
> that
> > once
> > >>> the implementation has modified or normalized the transcript that
> > means it
> > >>> has some kind of interpretation, so putting a normalized value in the
> > >>> interpretation slot should be fine. Nothing says that the
> "interpretation"
> > >>> has to be a particularly fine-grained interpretation, or one with a
> lot of
> > >>> structure.
> > >>>
> > >>>
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: Bjorn Bringert [mailto:bringert@google.com]
> > >>> > Sent: Thursday, August 16, 2012 9:09 AM
> > >>> > To: Hans Wennborg
> > >>> > Cc: Conversational; public-speech-api@w3.org
> > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> > interpretation can't be provided
> > >>> >
> > >>> > I'm not sure that it has to be that strict in requiring that the
> value
> > >>> > is the same as the "transcript" attribute. For example, an engine
> > >>> > might return the words recognized in "transcript" and apply some
> > extra
> > >>> > textnorm to the text that it returns in "interpretation", e.g.
> > >>> > converting digit words to digits ("three" -> "3"). Not sure if
> that's
> > >>> > useful though.
> > >>> >
> > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg
> > >>> > <hwennborg@google.com> wrote:
> > >>> > > Yes, the raw text is in the 'transcript' attribute.
> > >>> > >
> > >>> > > The description of 'interpretation' is currently: "The
> interpretation
> > >>> > > represents the semantic meaning from what the user said. This
> > might
> > >>> > > be
> > >>> > > determined, for instance, through the SISR specification of
> semantics
> > >>> > > in a grammar."
> > >>> > >
> > >>> > > I propose that we change it to "The interpretation represents the
> > >>> > > semantic meaning from what the user said. This might be
> > determined,
> > >>> > > for instance, through the SISR specification of semantics in a
> > >>> > > grammar. If no semantic meaning can be determined, the attribute
> > must
> > >>> > > be a string with the same value as the 'transcript' attribute."
> > >>> > >
> > >>> > > Does that sound good to everyone? If there are no objections,
> I'll
> > >>> > > make the change to the draft next week.
> > >>> > >
> > >>> > > Thanks,
> > >>> > > Hans
> > >>> > >
> > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational
> > >>> > > <dahl@conversational-technologies.com> wrote:
> > >>> > >> I can't check the spec right now, but I assume there's already
> an
> > >>> > >> attribute
> > >>> > that currently is defined to contain the raw text. So I think we
> could
> > >>> > say that
> > >>> > if there's no interpretation the value of the interpretation
> attribute
> > >>> > would be
> > >>> > the same as the value of the "raw string" attribute,
> > >>> > >>
> > >>> > >> Sent from my iPhone
> > >>> > >>
> > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg
> > <hwennborg@google.com>
> > >>> > wrote:
> > >>> > >>
> > >>> > >>> OK, that would work I suppose.
> > >>> > >>>
> > >>> > >>> What would the spec text look like? Something like "[...] If no
> > >>> > >>> semantic meaning can be determined, the attribute will a string
> > >>> > >>> representing the raw words that the user spoke."?
> > >>> > >>>
> > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert
> > >>> > <bringert@google.com> wrote:
> > >>> > >>>> Yeah, that would be my preference too.
> > >>> > >>>>
> > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational
> > >>> > >>>> <dahl@conversational-technologies.com> wrote:
> > >>> > >>>>> If there isn't an interpretation I think it would make the
> most
> > >>> > >>>>> sense
> > >>> > for the attribute to contain the literal string result. I believe
> this
> > >>> > is what
> > >>> > happens in VoiceXML.
> > >>> > >>>>>
> > >>> > >>>>>> My question is: for implementations that cannot provide an
> > >>> > >>>>>> interpretation, what should the attribute's value be? null?
> > >>> > undefined?
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > Bjorn Bringert
> > >>> > Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> > >>> > Palace Road, London, SW1W 9TQ
> > >>> > Registered in England Number: 3977902
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>
Received on Tuesday, 4 September 2012 18:05:14 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:02:28 UTC