- From: Glen Shires <gshires@google.com>
- Date: Wed, 29 Aug 2012 11:56:42 -0700
- To: Deborah Dahl <dahl@conversational-technologies.com>
- Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
- Message-ID: <CAEE5bcgNhX3pcF0nAQBy2dB4qgVaB05pPiRsvB9=QY_bdoBEbQ@mail.gmail.com>
I believe the same is true for emma, a single, cumulative emma document is preferable to multiple emma documents. I propose the following changes to the spec: Delete SpeechRecognitionAlternative.interpretation Delete SpeechRecognitionResult.emma Add interpretation and emma attributes to SpeechRecognitionEvent. Specifically: interface SpeechRecognitionEvent : Event { readonly attribute short resultIndex; readonly attribute SpeechRecognitionResultList results; readonly attribute DOMString transcript; readonly attribute any interpretation; readonly attribute Document emma; }; I do not propose to change the definitions of interpretation and emma at this time (because there is on-going discussion), but rather to simply move their current definitions to the new heading: "5.1.8 Speech Recognition Event". I also propose adding transcript attribute to SpeechRecognitionEvent (but also retaining SpeechRecognitionAlternative.transcript). This provides a simple option for JavaScript authors to get at the full, cumulative transcript. I propose the definition under "5.1.8 Speech Recognition Event" be: transcript The transcript string represents the raw words that the user spoke. This is a concatenation of the first (highest confidence) alternative of all final SpeechRecognitionAlternative.transcript strings. /Glen Shires On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl < dahl@conversational-technologies.com> wrote: > I agree with having a single interpretation that represents the cumulative > interpretation of the utterance so far. **** > > I think an example of what Jim is talking about, when the interpretation > wouldn’t be final even if the transcript is, might be the utterance “from > Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and > returns “from: ORD”, because most people don’t bother to say “O’Hare”, but > then it hears “Midway” and changes the interpretation to “from: MDW”. > However, “from Chicago” is still the transcript. **** > > Also the problem that Glenn points out is bad enough with two slots, but > it gets even worse as the number of slots gets bigger. For example, you > might have a pizza-ordering utterance with five or six ingredients (“I want > a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would > be very cumbersome to have to go back through all the results to fill in > the slots separately.**** > > ** ** > > *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com] > *Sent:* Wednesday, August 29, 2012 12:37 PM > *To:* Glen Shires; Deborah Dahl > > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org > *Subject:* RE: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > ** ** > > I agree with the idea of having a single interpretation. There is no > guarantee that the different parts of the string have independent > interpretations. For example, even if the transcription “from New York” is > final, its interpretation may not be, since it may depend on the > remaining parts of the utterance (that depends on how complicated the > grammar is, of course.) **** > > ** ** > > **- **Jim**** > > ** ** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Wednesday, August 29, 2012 11:44 AM > *To:* Deborah Dahl > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > ** ** > > How should interpretation work with continuous speech?**** > > ** ** > > Specifically, as each portion becomes final (each SpeechRecognitionResult > with final=true), the corresponding alternative(s) for transcription and > interpretation become final.**** > > ** ** > > It's easy for the JavaScript author to handle the consecutive list of > transcription strings - simply concatenate them.**** > > ** ** > > However, if the interpretation returns a semantic structure (such as the > depart/arrive example), it's unclear to me how they should be returned. > For example, if the first final result was "from New York" and the second > "to San Francisco", then:**** > > ** ** > > After the first final result, the list is:**** > > ** ** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[0].item[0].interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > ** ** > > After the second final result, the list is:**** > > ** ** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[0].item[0].interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > ** ** > > event.results[1].item[0].transcription = "to San Francisco"**** > > event.results[1].item[0].interpretation = {**** > > depart: null,**** > > arrive: "San Francisco"**** > > };**** > > ** ** > > If so, this makes using the interpretation structure very messy for the > author because he needs to loop through all the results to find each > interpretation slot that he needs.**** > > ** ** > > I suggest that we instead consider changing the spec to provide a single > interpretation that always represents the most current interpretation.**** > > ** ** > > After the first final result, the list is:**** > > ** ** > > event.results[0].item[0].transcription = "from New York"**** > > event.interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > ** ** > > After the second final result, the list is:**** > > ** ** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[1].item[0].transcription = "to San Francisco"**** > > event.interpretation = {**** > > depart: "New York",**** > > arrive: "San Francisco"**** > > };**** > > ** ** > > This not only makes it simple for the author to process the > interpretation, it also solves the problem that the interpretation may not > be available at the same point in time that the transcription becomes > final. If alternative interpretations are important, then it's easy to add > them to the interpretation structure that is returned, and this format far > easier for the author to process than > multiple SpeechRecognitionAlternative.interpretations. For example:**** > > ** ** > > event.interpretation = {**** > > depart: ["New York", "Newark"],**** > > arrive: ["San Francisco", "San Bernardino"],**** > > };**** > > ** ** > > /Glen Shires**** > > ** ** > > On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > I don’t think there’s a big difference in complexity in this use case, but > here’s another one, that I think might be more common.**** > > Suppose the application is something like search or composing email, and > the transcript alone would serve the application's purposes. However, some > implementations might also provide useful normalizations like converting > text numbers to digits or capitalization that would make the dictated text > look more like written language, and this normalization fills the > "interpretation slot". If the developer can count on the "interpretation" > slot being filled by the transcript if there's nothing better, then the > developer only has to ask for the interpretation. **** > > e.g. **** > > document.write(interpretation)**** > > **** > > vs. **** > > if(intepretation)**** > > document.write(interpretation)**** > > else**** > > document.write(transcript)**** > > **** > > which I think is simpler. The developer doesn’t have to worry about type > checking because in this application the “interpretation” will always be a > string.**** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Tuesday, August 28, 2012 10:44 PM > *To:* Deborah Dahl**** > > > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > Debbie,**** > > Looking at this from the viewpoint of what is easier for the JavaScript > author, I believe:**** > > **** > > SpeechRecognitionAlternative.transcript must return a string (even if an > empty string). Thus, an author wishing to use the transcript doesn't need > to perform any type checking.**** > > **** > > SpeechRecognitionAlternative.interpretation must be null if no > interpretation is provided. This simplifies the required conditional by > eliminating type checking. For example:**** > > **** > > transcript = "from New York to San Francisco";**** > > **** > > interpretation = {**** > > depart: "New York",**** > > arrive: "San Francisco"**** > > };**** > > **** > > if (interpretation) // this works if interpretation is present or if null > **** > > document.write("Depart " + interpretation.depart + " and arrive in " + > interpretation.arrive);**** > > else**** > > document.write(transcript);**** > > fi**** > > **** > > **** > > Whereas, if the interpretation contains the transcript string when no > interpretation is present, the condition would have to be:**** > > **** > > if (typeof(interpretation) != "string")**** > > **** > > Which is more complex, and more prone to errors (e.g. if spell "string" > wrong).**** > > **** > > /Glen Shires**** > > **** > > **** > > On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Hi Glenn,**** > > In the case of an SLM, if there’s a classification, I think the > classification would be the interpretation. If the SLM is just used to > improve dictation results, without classification, then the interpretation > would be whatever we say it is – either the transcript, null, or undefined. > **** > > My point about stating that the “transcript” attribute is required or > optional wasn’t whether or not there was a use case where it would be > desirable not to return a transcript. My point was that the spec needs to > be explicit about the optional/required status of every feature. It’s > fine to postpone that decision if there’s any controversy, but if we all > agree we might as well add it to the spec. **** > > I can’t think of any cases where it would be bad to return a transcript, > although I can think of use cases where the developer wouldn’t choose to do > anything with the transcript (like multi-slot form filling – all the end > user really needs to see is the correctly filled slots). **** > > Debbie**** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Thursday, August 23, 2012 3:48 AM > *To:* Deborah Dahl > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org*** > * > > > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > Debbie,**** > > I agree with the need to support SLMs. This implies that, in some cases, > the author may not specify semantic information, and thus there would not > be an interpretation.**** > > **** > > Under what circumstances (except error conditions) do you envision that a > transcript would not be returned?**** > > **** > > /Glen Shires**** > > **** > > On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Actually, Satish's comment made me think that we probably have a few other > things to agree on before we decide what the default value of > "interpretation" should be, because we haven't settled on a lot of issues > about what is required and what is optional. > Satish's argument is only relevant if we require SRGS/SISR for grammars and > semantic interpretation, but we actually don't require either of those > right > now, so it doesn't matter what they do as far as the current spec goes. > (Although it's worth noting that SRGS doesn't require anything to be > returned at all, even the transcript > http://www.w3.org/TR/speech-grammar/#S1.10). > So I think we first need to decide and explicitly state in the spec --- > > 1. what we want to say about grammar formats (which are allowed/required, > or > is the grammar format open). It probably needs to be somewhat open because > of SLM's. > 2. what we want to say about semantic tag formats (are proprietary formats > allowed, is SISR required or is the semantic tag format just whatever the > grammar format uses) > 3. is "transcript" required? > 4. is "interpretation" required? > > Debbie**** > > > > -----Original Message----- > > From: Hans Wennborg [mailto:hwennborg@google.com] > > Sent: Tuesday, August 21, 2012 12:50 PM > > To: Glen Shires > > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org > > Subject: Re: SpeechRecognitionAlternative.interpretation when > > interpretation can't be provided > > > > Björn, Deborah, are you ok with this as well? I.e. that the spec > > shouldn't mandate a "default" value for the interpretation attribute, > > but rather return null when there is no interpretation? > > > > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> wrote: > > > I agree, return "null" (not "undefined") in such cases. > > > > > > > > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote: > > >> > > >> > I may have missed something, but I don’t see in the spec where it > says > > >> > that “interpretation” is optional. > > >> > > >> Developers specify the interpretation value with SISR and if they > don't > > >> specify there is no 'default' interpretation available. In that sense > it is > > >> optional because grammars don't mandate it. So I think this API > shouldn't > > >> mandate providing a default value if the engine did not provide one, > and > > >> return null in such cases. > > > > > >> > > >> Cheers > > >> Satish > > >> > > >> > > >> > > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl > > >> <dahl@conversational-technologies.com> wrote: > > >>> > > >>> I may have missed something, but I don’t see in the spec where it > says > > >>> that “interpretation” is optional. > > >>> > > >>> From: Satish S [mailto:satish@google.com] > > >>> Sent: Thursday, August 16, 2012 7:38 PM > > >>> To: Deborah Dahl > > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org > > >>> > > >>> > > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when > > >>> interpretation can't be provided > > >>> > > >>> > > >>> > > >>> 'interpretation' is an optional attribute because engines are not > > >>> required to provide an interpretation on their own (unlike > 'transcript'). > > As > > >>> such I think it should return null when there isn't a value to be > returned > > >>> as that is the convention for optional attributes, not 'undefined' or > a > > copy > > >>> of some other attribute. > > >>> > > >>> > > >>> > > >>> If an engine chooses to return the same value for 'transcript' and > > >>> 'interpretation' or do textnorm of the value and return in > 'interpretation' > > >>> that will be an implementation detail of the engine. But in the > absence > > of > > >>> any such value for 'interpretation' from the engine I think the UA > should > > >>> return null. > > >>> > > >>> > > >>> Cheers > > >>> Satish > > >>> > > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl > > >>> <dahl@conversational-technologies.com> wrote: > > >>> > > >>> That's a good point. There are lots of use cases where some simple > > >>> normalization is extremely useful, as in your example, or collapsing > all > > the > > >>> ways that the user might say "yes" or "no". However, you could say > that > > once > > >>> the implementation has modified or normalized the transcript that > > means it > > >>> has some kind of interpretation, so putting a normalized value in the > > >>> interpretation slot should be fine. Nothing says that the > "interpretation" > > >>> has to be a particularly fine-grained interpretation, or one with a > lot of > > >>> structure. > > >>> > > >>> > > >>> > > >>> > -----Original Message----- > > >>> > From: Bjorn Bringert [mailto:bringert@google.com] > > >>> > Sent: Thursday, August 16, 2012 9:09 AM > > >>> > To: Hans Wennborg > > >>> > Cc: Conversational; public-speech-api@w3.org > > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when > > >>> > interpretation can't be provided > > >>> > > > >>> > I'm not sure that it has to be that strict in requiring that the > value > > >>> > is the same as the "transcript" attribute. For example, an engine > > >>> > might return the words recognized in "transcript" and apply some > > extra > > >>> > textnorm to the text that it returns in "interpretation", e.g. > > >>> > converting digit words to digits ("three" -> "3"). Not sure if > that's > > >>> > useful though. > > >>> > > > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg > > >>> > <hwennborg@google.com> wrote: > > >>> > > Yes, the raw text is in the 'transcript' attribute. > > >>> > > > > >>> > > The description of 'interpretation' is currently: "The > interpretation > > >>> > > represents the semantic meaning from what the user said. This > > might > > >>> > > be > > >>> > > determined, for instance, through the SISR specification of > semantics > > >>> > > in a grammar." > > >>> > > > > >>> > > I propose that we change it to "The interpretation represents the > > >>> > > semantic meaning from what the user said. This might be > > determined, > > >>> > > for instance, through the SISR specification of semantics in a > > >>> > > grammar. If no semantic meaning can be determined, the attribute > > must > > >>> > > be a string with the same value as the 'transcript' attribute." > > >>> > > > > >>> > > Does that sound good to everyone? If there are no objections, > I'll > > >>> > > make the change to the draft next week. > > >>> > > > > >>> > > Thanks, > > >>> > > Hans > > >>> > > > > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational > > >>> > > <dahl@conversational-technologies.com> wrote: > > >>> > >> I can't check the spec right now, but I assume there's already > an > > >>> > >> attribute > > >>> > that currently is defined to contain the raw text. So I think we > could > > >>> > say that > > >>> > if there's no interpretation the value of the interpretation > attribute > > >>> > would be > > >>> > the same as the value of the "raw string" attribute, > > >>> > >> > > >>> > >> Sent from my iPhone > > >>> > >> > > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg > > <hwennborg@google.com> > > >>> > wrote: > > >>> > >> > > >>> > >>> OK, that would work I suppose. > > >>> > >>> > > >>> > >>> What would the spec text look like? Something like "[...] If no > > >>> > >>> semantic meaning can be determined, the attribute will a string > > >>> > >>> representing the raw words that the user spoke."? > > >>> > >>> > > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert > > >>> > <bringert@google.com> wrote: > > >>> > >>>> Yeah, that would be my preference too. > > >>> > >>>> > > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational > > >>> > >>>> <dahl@conversational-technologies.com> wrote: > > >>> > >>>>> If there isn't an interpretation I think it would make the > most > > >>> > >>>>> sense > > >>> > for the attribute to contain the literal string result. I believe > this > > >>> > is what > > >>> > happens in VoiceXML. > > >>> > >>>>> > > >>> > >>>>>> My question is: for implementations that cannot provide an > > >>> > >>>>>> interpretation, what should the attribute's value be? null? > > >>> > undefined? > > >>> > > > >>> > > > >>> > > > >>> > -- > > >>> > Bjorn Bringert > > >>> > Google UK Limited, Registered Office: Belgrave House, 76 Buckingham > > >>> > Palace Road, London, SW1W 9TQ > > >>> > Registered in England Number: 3977902 > > >>> > > >>> > > >>> > > >> > > >> > > >**** > > **** > > **** > > ** ** >
Received on Wednesday, 29 August 2012 18:57:53 UTC