- From: Glen Shires <gshires@google.com>
- Date: Thu, 13 Sep 2012 10:42:09 -0700
- To: Deborah Dahl <dahl@conversational-technologies.com>
- Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
- Message-ID: <CAEE5bciXvsQ88RLVa-vfYU+96MqNE7Z1sZrtCQOCGHK9RpKKQQ@mail.gmail.com>
Debbie, Yes, I like the text you propose. If there's no disagreement, I'll add it to the spec on Monday. On Thu, Sep 13, 2012 at 10:36 AM, Deborah Dahl < dahl@conversational-technologies.com> wrote: > Having a more specific error like “TAG_FORMAT_NOT_SUPPORTED” would be more > informative, but I think using BAD_GRAMMAR is ok. If so, the text should > probably say something like "There was an error in the speech recognition > grammar or semantic tags, or the grammar format or tag format is > unsupported." **** > > ** ** > > ** ** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Thursday, September 13, 2012 12:40 PM > > *To:* Deborah Dahl > *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > ** ** > > The spec already defines SpeechRecognitionError BAD_GRAMMAR. I propose we > use this same error for bad tag formats, since they're so related (and in > fact there may be some edge-cases in which it's not clear whether the error > is parsed as a grammar error or a semantic tag error.)**** > > ** ** > > The current definition in the spec for BAD_GRAMMAR is:**** > > "There was an error in the speech recognition grammar."**** > > ** ** > > I propose changing this to:**** > > "There was an error in the speech recognition grammar or semantic tags."** > ** > > ** ** > > /Glen Shires**** > > **** > > ** ** > > On Thu, Sep 13, 2012 at 8:53 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > The example of the author supplying semantics that the recognizer can’t > interpret I think is Bjorn’s “open question B” in his email --**** > > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html*** > * > > **** > > I proposed that this situation should raise an error in this email, but I > don’t think there’s been any other discussion, so we should discuss this at > some point.**** > > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0072.html*** > * > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Wednesday, September 12, 2012 5:58 PM**** > > > *To:* Deborah Dahl > *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > > ...any use cases where the interpretation is of interest to the > developer and it’s not known whether the interpretation is an object or a > string. What would be an example of that? **** > > **** > > An example is: if the author supplies semantics that the recognizer can't > interpret, then the recognizer might return a normalized result.**** > > **** > > > I also think that the third use case would be very rare, since it would > involve asking the user to make a decision about whether they want a > normalized or non-normalized version of the result, and it’s not clear when > the user would actually be interested in making that kind of choice.**** > > **** > > If the user is shown alternatives, one option might be normalized. I > provided an example of this, where the non-normalized might be preferred by > the user.**** > > **** > > transcript: "Like I've done one million times before."**** > > normalized: "Like I've done 1,000,000 times before."**** > > **** > > I understand that this may be a rare use case, but regardless of that, I > still don't know of any use case in which returning a copy of the > transcript is preferable to null. **** > > **** > > I'd prefer that we put the specific behavior in the spec, but if all we > can agree on at this point is: “The group is currently discussing options > for the value of the interpretation attribute when no interpretation has > been returned by the recognizer. Current options are ‘null’ or a copy of > the transcript.”, then I will agree to that.**** > > **** > > I too would like to hear others' opinions.**** > > /Glen Shires**** > > On Wed, Sep 12, 2012 at 2:16 PM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > I’m not sure I can think of any use cases where the interpretation is of > interest to the developer and it’s not known whether the interpretation is > an object or a string. What would be an example of that? I also think that > the third use case would be very rare, since it would involve asking the > user to make a decision about whether they want a normalized or > non-normalized version of the result, and it’s not clear when the user > would actually be interested in making that kind of choice.**** > > I think it would be good at this point to get some other opinions about > this. **** > > Also, in the interest of moving forward, I think it’s perfectly fine to > have language in the spec that just says “The group is currently discussing > options for the value of the interpretation attribute when no > interpretation has been returned by the recognizer. Current options are > ‘null’ or a copy of the transcript.” This may also serve to encourage > external comments from developers who have an opinion about this. **** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Wednesday, September 12, 2012 4:21 PM**** > > > *To:* Deborah Dahl > *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > I disagree with the code [1] for this use case. Since the interpretation > may be a non-string object, good defensive coding practice is:**** > > **** > > if (typeof(interpretation) == "string") {**** > > document.write(interpretation)**** > > } else {**** > > document.write(transcript);**** > > }**** > > **** > > Thus, for this use case it doesn't matter. The code is identical for > either definition of what the interpretation attributes returns when there > is no interpretation. (That is, whether interpretation is defined to return > null or to returns a copy of transcript.)**** > > **** > > In contrast, [2] shows a use case where it does matter, the code is > simpler and less error-prone if the interpretation attributes returns null > when there is no interpretation.**** > > **** > > Below a third use case where it also matters. Since interpretation may > return a normalized string, an author may wish to show both the normalized > string and the transcript string to the user, and let them choose which one > to use. For example:**** > > **** > > interpretation: "Like I've done 1,000,000 times before."**** > > transcript: "Like I've done one million times before."**** > > **** > > (The author might also add transcript alternatives to this choice list, > but I'll omit that to keep the example simple.)**** > > **** > > For the option where interpretation returns a copy of transcript when > there is no interpretation:**** > > **** > > var choices;**** > > if (typeof(interpretation) == "string" && interpretation != transcript) {* > *** > > choices.push(interpretation);**** > > }**** > > choices.push(transcript);**** > > if (choices.length > 1) {**** > > AskUserToDisambiguate(choices);**** > > }**** > > **** > > **** > > For the option where interpretation returns a null when there is no > interpretation:**** > > **** > > var choices;**** > > if (typeof(interpretation) == "string") {**** > > choices.push(interpretation);**** > > }**** > > choices.push(transcript);**** > > if (choices.length > 1) {**** > > AskUserToDisambiguate(choices);**** > > }**** > > **** > > **** > > So there's clearly use cases in which returning null allows for simpler > and less error-prone code, whereas it's not clear to me there is any use > case in which returning a copy of the transcript simplifies the code. > Together, these use cases cover all the scenarios:**** > > **** > > - where there is an interpretation that contains a complex object**** > > - where there is an interpretation that contains a string, and**** > > - where there is no interpretation.**** > > **** > > So, I continue to propose adding one additional sentence.**** > > **** > > "If no interpretation is available, this attribute MUST return null."* > *** > > **** > > If there's no disagreement, I will add this sentence to the spec on Friday. > **** > > **** > > (Please note, this is very different than the reasoning behind requiring > the emma attribute to never be null. "emma" is always of type "Document" > and always returns a valid emma document, not simply a copy of some other > attribute. Here, "interpretation" is an attribute of type "any", so it > must always be type-checked.)**** > > **** > > /Glen Shires**** > > **** > > [1] > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.html*** > * > > [2] > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0107.html*** > * > > **** > > On Wed, Sep 12, 2012 at 6:44 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > I would still prefer that the interpretation slot always be filled, at > least by the transcript if there’s nothing better. I think that the use > case I described in **** > > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.htmlis going to be pretty common and in that case being able to rely on > something other than null being in the interpretation field is very > convenient. On the other hand, if the application really depends on the > availability of a more complex interpretation object, the developer is > going to have to make sure that a specific speech service that can provide > that kind of interpretation is used. In that case, I don’t see how there > can be a transcript without an interpretation. **** > > On a related topic, I think we should also include some of the points that > Bjorn made about support for grammars and semantic tagging as discussed in > this thread -- > http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html.** > ** > > **** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Tuesday, September 11, 2012 8:33 PM > *To:* Deborah Dahl; Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org**** > > > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > The current definition of interpretation in the spec is:**** > > **** > > "The interpretation represents the semantic meaning from what the user > said. This might be determined, for instance, through the SISR > specification of semantics in a grammar."**** > > **** > > I propose adding an additional sentence at the end.**** > > **** > > "If no interpretation is available, this attribute MUST return null."* > *** > > **** > > My reasoning (based on this lengthy thread):**** > > - If an SISR / etc interpretation is available, the UA must return it.* > *** > - If an alternative string interpretation is available, such as > a normalization, the UA may return it.**** > - If there's no more information available than in the transcript, > then "null" provides a very simple way for the author to check for this > condition. The author avoids a clumsy conditional (typeof(interpretation) > != "string") and the author can easily distinguish between the case when > the interpretation returns a normalization string as opposed to if it had > just copied the transcript verbatim.**** > - "null" is more commonly used than "undefined" in these circumstances. > **** > > If there's no disagreement, I will add this sentence to the spec on > Thursday.**** > > /Glen Shires**** > > **** > > **** > > On Tue, Sep 4, 2012 at 11:04 AM, Glen Shires <gshires@google.com> wrote:** > ** > > I've updated the spec with this change (moved interpretation and emma > attributes to SpeechRecognitionEvent):**** > > https://dvcs.w3.org/hg/speech-api/rev/48a58e558fcc**** > > **** > > As always, the current draft spec is at:**** > > http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html**** > > **** > > /Glen Shires**** > > **** > > On Thu, Aug 30, 2012 at 10:07 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Thanks for the clarification, that makes sense. When each new version of > the emma document arrives in a SpeechRecognitionEvent, the author can just > repopulate all the earlier form fields, as well as the newest one, with > the data from the most recent emma version. **** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Thursday, August 30, 2012 12:45 PM**** > > > *To:* Deborah Dahl > *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > Debbie,**** > > In my proposal, the single emma document is updated with each > new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma > document is populated in "real time" as the user speaks each field, without > waiting for the user to finish speaking. A JavaScript author could use this > to populate a form in "real time".**** > > **** > > **** > > Also, I now realize that the SpeechRecognitionEvent.transcript is not > useful in continuous = false mode because only one final result is > returned, and thus SpeechRecognitionEvent.results[0].transcript always > contains the same string (no concatenation needed). I also don't see it as > very useful in continuous = true mode because if an author is using this > mode, it's presumably because he wants to show continuous final results > (and perhaps interim as well). Since the author is already writing code to > concatenate results to display them "real-time", there's little or no > savings with this new attribute. So I now retract that portion of my > proposal.**** > > **** > > So to clarify, here's my proposed changes to the spec. If there's no > disagreement by the end of the week I'll add it to the spec...**** > > **** > > **** > > Delete SpeechRecognitionAlternative.interpretation**** > > **** > > Delete SpeechRecognitionResult.emma**** > > **** > > Add interpretation and emma attributes to SpeechRecognitionEvent. > Specifically:**** > > **** > > interface SpeechRecognitionEvent : Event {**** > > readonly attribute short resultIndex;**** > > readonly attribute SpeechRecognitionResultList results;**** > > readonly attribute any interpretation;**** > > readonly attribute Document emma;**** > > };**** > > **** > > I do not propose to change the definitions of interpretation and emma at > this time (because there is on-going discussion), but rather to simply move > their current definitions to the new heading: "5.1.8 Speech Recognition > Event".**** > > **** > > /Glen Shires**** > > **** > > **** > > On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Hi Glenn,**** > > I agree that a single cumulative emma document is preferable to multiple > emma documents in general, although I think that there might be use cases > where it would be convenient to have both. For example, you want to > populate a form in real time as the user speaks each field, without waiting > for the user to finish speaking. After the result is final the application > could send the cumulative result to the server, but seeing the interim > results would be helpful feedback to the user.**** > > Debbie**** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Wednesday, August 29, 2012 2:57 PM > *To:* Deborah Dahl > *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; > public-speech-api@w3.org**** > > > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > I believe the same is true for emma, a single, cumulative emma document is > preferable to multiple emma documents. **** > > **** > > I propose the following changes to the spec:**** > > **** > > Delete SpeechRecognitionAlternative.interpretation**** > > **** > > Delete SpeechRecognitionResult.emma**** > > **** > > Add interpretation and emma attributes to SpeechRecognitionEvent. > Specifically:**** > > **** > > interface SpeechRecognitionEvent : Event {**** > > readonly attribute short resultIndex;**** > > readonly attribute SpeechRecognitionResultList results;**** > > readonly attribute DOMString transcript;**** > > readonly attribute any interpretation;**** > > readonly attribute Document emma;**** > > };**** > > **** > > I do not propose to change the definitions of interpretation and emma at > this time (because there is on-going discussion), but rather to simply move > their current definitions to the new heading: "5.1.8 Speech Recognition > Event".**** > > **** > > I also propose adding transcript attribute to SpeechRecognitionEvent (but > also retaining SpeechRecognitionAlternative.transcript). This provides a > simple option for JavaScript authors to get at the full, cumulative > transcript. I propose the definition under "5.1.8 Speech Recognition > Event" be:**** > > **** > > transcript**** > > The transcript string represents the raw words that the user spoke. This > is a concatenation of the first (highest confidence) alternative of all > final SpeechRecognitionAlternative.transcript strings.**** > > **** > > /Glen Shires **** > > **** > > **** > > On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > I agree with having a single interpretation that represents the cumulative > interpretation of the utterance so far. **** > > I think an example of what Jim is talking about, when the interpretation > wouldn’t be final even if the transcript is, might be the utterance “from > Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and > returns “from: ORD”, because most people don’t bother to say “O’Hare”, but > then it hears “Midway” and changes the interpretation to “from: MDW”. > However, “from Chicago” is still the transcript. **** > > Also the problem that Glenn points out is bad enough with two slots, but > it gets even worse as the number of slots gets bigger. For example, you > might have a pizza-ordering utterance with five or six ingredients (“I want > a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would > be very cumbersome to have to go back through all the results to fill in > the slots separately.**** > > **** > > *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com] > *Sent:* Wednesday, August 29, 2012 12:37 PM > *To:* Glen Shires; Deborah Dahl**** > > > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org*** > * > > *Subject:* RE: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > I agree with the idea of having a single interpretation. There is no > guarantee that the different parts of the string have independent > interpretations. For example, even if the transcription “from New York” is > final, its interpretation may not be, since it may depend on the > remaining parts of the utterance (that depends on how complicated the > grammar is, of course.) **** > > **** > > - Jim**** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Wednesday, August 29, 2012 11:44 AM > *To:* Deborah Dahl > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > How should interpretation work with continuous speech?**** > > **** > > Specifically, as each portion becomes final (each SpeechRecognitionResult > with final=true), the corresponding alternative(s) for transcription and > interpretation become final.**** > > **** > > It's easy for the JavaScript author to handle the consecutive list of > transcription strings - simply concatenate them.**** > > **** > > However, if the interpretation returns a semantic structure (such as the > depart/arrive example), it's unclear to me how they should be returned. > For example, if the first final result was "from New York" and the second > "to San Francisco", then:**** > > **** > > After the first final result, the list is:**** > > **** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[0].item[0].interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > **** > > After the second final result, the list is:**** > > **** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[0].item[0].interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > **** > > event.results[1].item[0].transcription = "to San Francisco"**** > > event.results[1].item[0].interpretation = {**** > > depart: null,**** > > arrive: "San Francisco"**** > > };**** > > **** > > If so, this makes using the interpretation structure very messy for the > author because he needs to loop through all the results to find each > interpretation slot that he needs.**** > > **** > > I suggest that we instead consider changing the spec to provide a single > interpretation that always represents the most current interpretation.**** > > **** > > After the first final result, the list is:**** > > **** > > event.results[0].item[0].transcription = "from New York"**** > > event.interpretation = {**** > > depart: "New York",**** > > arrive: null**** > > };**** > > **** > > After the second final result, the list is:**** > > **** > > event.results[0].item[0].transcription = "from New York"**** > > event.results[1].item[0].transcription = "to San Francisco"**** > > event.interpretation = {**** > > depart: "New York",**** > > arrive: "San Francisco"**** > > };**** > > **** > > This not only makes it simple for the author to process the > interpretation, it also solves the problem that the interpretation may not > be available at the same point in time that the transcription becomes > final. If alternative interpretations are important, then it's easy to add > them to the interpretation structure that is returned, and this format far > easier for the author to process than > multiple SpeechRecognitionAlternative.interpretations. For example:**** > > **** > > event.interpretation = {**** > > depart: ["New York", "Newark"],**** > > arrive: ["San Francisco", "San Bernardino"],**** > > };**** > > **** > > /Glen Shires**** > > **** > > On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > I don’t think there’s a big difference in complexity in this use case, but > here’s another one, that I think might be more common.**** > > Suppose the application is something like search or composing email, and > the transcript alone would serve the application's purposes. However, some > implementations might also provide useful normalizations like converting > text numbers to digits or capitalization that would make the dictated text > look more like written language, and this normalization fills the > "interpretation slot". If the developer can count on the "interpretation" > slot being filled by the transcript if there's nothing better, then the > developer only has to ask for the interpretation. **** > > e.g. **** > > document.write(interpretation)**** > > **** > > vs. **** > > if(intepretation)**** > > document.write(interpretation)**** > > else**** > > document.write(transcript)**** > > **** > > which I think is simpler. The developer doesn’t have to worry about type > checking because in this application the “interpretation” will always be a > string.**** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Tuesday, August 28, 2012 10:44 PM > *To:* Deborah Dahl**** > > > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > Debbie,**** > > Looking at this from the viewpoint of what is easier for the JavaScript > author, I believe:**** > > **** > > SpeechRecognitionAlternative.transcript must return a string (even if an > empty string). Thus, an author wishing to use the transcript doesn't need > to perform any type checking.**** > > **** > > SpeechRecognitionAlternative.interpretation must be null if no > interpretation is provided. This simplifies the required conditional by > eliminating type checking. For example:**** > > **** > > transcript = "from New York to San Francisco";**** > > **** > > interpretation = {**** > > depart: "New York",**** > > arrive: "San Francisco"**** > > };**** > > **** > > if (interpretation) // this works if interpretation is present or if null > **** > > document.write("Depart " + interpretation.depart + " and arrive in " + > interpretation.arrive);**** > > else**** > > document.write(transcript);**** > > fi**** > > **** > > **** > > Whereas, if the interpretation contains the transcript string when no > interpretation is present, the condition would have to be:**** > > **** > > if (typeof(interpretation) != "string")**** > > **** > > Which is more complex, and more prone to errors (e.g. if spell "string" > wrong).**** > > **** > > /Glen Shires**** > > **** > > **** > > On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Hi Glenn,**** > > In the case of an SLM, if there’s a classification, I think the > classification would be the interpretation. If the SLM is just used to > improve dictation results, without classification, then the interpretation > would be whatever we say it is – either the transcript, null, or undefined. > **** > > My point about stating that the “transcript” attribute is required or > optional wasn’t whether or not there was a use case where it would be > desirable not to return a transcript. My point was that the spec needs to > be explicit about the optional/required status of every feature. It’s > fine to postpone that decision if there’s any controversy, but if we all > agree we might as well add it to the spec. **** > > I can’t think of any cases where it would be bad to return a transcript, > although I can think of use cases where the developer wouldn’t choose to do > anything with the transcript (like multi-slot form filling – all the end > user really needs to see is the correctly filled slots). **** > > Debbie**** > > **** > > *From:* Glen Shires [mailto:gshires@google.com] > *Sent:* Thursday, August 23, 2012 3:48 AM > *To:* Deborah Dahl > *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org*** > * > > > *Subject:* Re: SpeechRecognitionAlternative.interpretation when > interpretation can't be provided**** > > **** > > Debbie,**** > > I agree with the need to support SLMs. This implies that, in some cases, > the author may not specify semantic information, and thus there would not > be an interpretation.**** > > **** > > Under what circumstances (except error conditions) do you envision that a > transcript would not be returned?**** > > **** > > /Glen Shires**** > > **** > > On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote:**** > > Actually, Satish's comment made me think that we probably have a few other > things to agree on before we decide what the default value of > "interpretation" should be, because we haven't settled on a lot of issues > about what is required and what is optional. > Satish's argument is only relevant if we require SRGS/SISR for grammars and > semantic interpretation, but we actually don't require either of those > right > now, so it doesn't matter what they do as far as the current spec goes. > (Although it's worth noting that SRGS doesn't require anything to be > returned at all, even the transcript > http://www.w3.org/TR/speech-grammar/#S1.10). > So I think we first need to decide and explicitly state in the spec --- > > 1. what we want to say about grammar formats (which are allowed/required, > or > is the grammar format open). It probably needs to be somewhat open because > of SLM's. > 2. what we want to say about semantic tag formats (are proprietary formats > allowed, is SISR required or is the semantic tag format just whatever the > grammar format uses) > 3. is "transcript" required? > 4. is "interpretation" required? > > Debbie**** > > > > -----Original Message----- > > From: Hans Wennborg [mailto:hwennborg@google.com] > > Sent: Tuesday, August 21, 2012 12:50 PM > > To: Glen Shires > > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org > > Subject: Re: SpeechRecognitionAlternative.interpretation when > > interpretation can't be provided > > > > Björn, Deborah, are you ok with this as well? I.e. that the spec > > shouldn't mandate a "default" value for the interpretation attribute, > > but rather return null when there is no interpretation? > > > > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> wrote: > > > I agree, return "null" (not "undefined") in such cases. > > > > > > > > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote: > > >> > > >> > I may have missed something, but I don’t see in the spec where it > says > > >> > that “interpretation” is optional. > > >> > > >> Developers specify the interpretation value with SISR and if they > don't > > >> specify there is no 'default' interpretation available. In that sense > it is > > >> optional because grammars don't mandate it. So I think this API > shouldn't > > >> mandate providing a default value if the engine did not provide one, > and > > >> return null in such cases. > > > > > >> > > >> Cheers > > >> Satish > > >> > > >> > > >> > > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl > > >> <dahl@conversational-technologies.com> wrote: > > >>> > > >>> I may have missed something, but I don’t see in the spec where it > says > > >>> that “interpretation” is optional. > > >>> > > >>> From: Satish S [mailto:satish@google.com] > > >>> Sent: Thursday, August 16, 2012 7:38 PM > > >>> To: Deborah Dahl > > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org > > >>> > > >>> > > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when > > >>> interpretation can't be provided > > >>> > > >>> > > >>> > > >>> 'interpretation' is an optional attribute because engines are not > > >>> required to provide an interpretation on their own (unlike > 'transcript'). > > As > > >>> such I think it should return null when there isn't a value to be > returned > > >>> as that is the convention for optional attributes, not 'undefined' or > a > > copy > > >>> of some other attribute. > > >>> > > >>> > > >>> > > >>> If an engine chooses to return the same value for 'transcript' and > > >>> 'interpretation' or do textnorm of the value and return in > 'interpretation' > > >>> that will be an implementation detail of the engine. But in the > absence > > of > > >>> any such value for 'interpretation' from the engine I think the UA > should > > >>> return null. > > >>> > > >>> > > >>> Cheers > > >>> Satish > > >>> > > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl > > >>> <dahl@conversational-technologies.com> wrote: > > >>> > > >>> That's a good point. There are lots of use cases where some simple > > >>> normalization is extremely useful, as in your example, or collapsing > all > > the > > >>> ways that the user might say "yes" or "no". However, you could say > that > > once > > >>> the implementation has modified or normalized the transcript that > > means it > > >>> has some kind of interpretation, so putting a normalized value in the > > >>> interpretation slot should be fine. Nothing says that the > "interpretation" > > >>> has to be a particularly fine-grained interpretation, or one with a > lot of > > >>> structure. > > >>> > > >>> > > >>> > > >>> > -----Original Message----- > > >>> > From: Bjorn Bringert [mailto:bringert@google.com] > > >>> > Sent: Thursday, August 16, 2012 9:09 AM > > >>> > To: Hans Wennborg > > >>> > Cc: Conversational; public-speech-api@w3.org > > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when > > >>> > interpretation can't be provided > > >>> > > > >>> > I'm not sure that it has to be that strict in requiring that the > value > > >>> > is the same as the "transcript" attribute. For example, an engine > > >>> > might return the words recognized in "transcript" and apply some > > extra > > >>> > textnorm to the text that it returns in "interpretation", e.g. > > >>> > converting digit words to digits ("three" -> "3"). Not sure if > that's > > >>> > useful though. > > >>> > > > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg > > >>> > <hwennborg@google.com> wrote: > > >>> > > Yes, the raw text is in the 'transcript' attribute. > > >>> > > > > >>> > > The description of 'interpretation' is currently: "The > interpretation > > >>> > > represents the semantic meaning from what the user said. This > > might > > >>> > > be > > >>> > > determined, for instance, through the SISR specification of > semantics > > >>> > > in a grammar." > > >>> > > > > >>> > > I propose that we change it to "The interpretation represents the > > >>> > > semantic meaning from what the user said. This might be > > determined, > > >>> > > for instance, through the SISR specification of semantics in a > > >>> > > grammar. If no semantic meaning can be determined, the attribute > > must > > >>> > > be a string with the same value as the 'transcript' attribute." > > >>> > > > > >>> > > Does that sound good to everyone? If there are no objections, > I'll > > >>> > > make the change to the draft next week. > > >>> > > > > >>> > > Thanks, > > >>> > > Hans > > >>> > > > > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational > > >>> > > <dahl@conversational-technologies.com> wrote: > > >>> > >> I can't check the spec right now, but I assume there's already > an > > >>> > >> attribute > > >>> > that currently is defined to contain the raw text. So I think we > could > > >>> > say that > > >>> > if there's no interpretation the value of the interpretation > attribute > > >>> > would be > > >>> > the same as the value of the "raw string" attribute, > > >>> > >> > > >>> > >> Sent from my iPhone > > >>> > >> > > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg > > <hwennborg@google.com> > > >>> > wrote: > > >>> > >> > > >>> > >>> OK, that would work I suppose. > > >>> > >>> > > >>> > >>> What would the spec text look like? Something like "[...] If no > > >>> > >>> semantic meaning can be determined, the attribute will a string > > >>> > >>> representing the raw words that the user spoke."? > > >>> > >>> > > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert > > >>> > <bringert@google.com> wrote: > > >>> > >>>> Yeah, that would be my preference too. > > >>> > >>>> > > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational > > >>> > >>>> <dahl@conversational-technologies.com> wrote: > > >>> > >>>>> If there isn't an interpretation I think it would make the > most > > >>> > >>>>> sense > > >>> > for the attribute to contain the literal string result. I believe > this > > >>> > is what > > >>> > happens in VoiceXML. > > >>> > >>>>> > > >>> > >>>>>> My question is: for implementations that cannot provide an > > >>> > >>>>>> interpretation, what should the attribute's value be? null? > > >>> > undefined? > > >>> > > > >>> > > > >>> > > > >>> > -- > > >>> > Bjorn Bringert > > >>> > Google UK Limited, Registered Office: Belgrave House, 76 Buckingham > > >>> > Palace Road, London, SW1W 9TQ > > >>> > Registered in England Number: 3977902 > > >>> > > >>> > > >>> > > >> > > >> > > >**** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > **** > > ** ** >
Received on Thursday, 13 September 2012 17:43:23 UTC