- From: Glen Shires <gshires@google.com>
- Date: Mon, 17 Sep 2012 22:31:32 -0700
- To: Deborah Dahl <dahl@conversational-technologies.com>
- Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
- Message-ID: <CAEE5bcj0X8Xf22vfOQYBwV=oKQJfUbWMV1DNcrUYKGaMObROAA@mail.gmail.com>
I've updated the spec with this change: https://dvcs.w3.org/hg/speech-api/rev/5e2d87e7d977 As always, the current draft spec is at: http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html On Thu, Sep 13, 2012 at 10:42 AM, Glen Shires <gshires@google.com> wrote: > Debbie, > Yes, I like the text you propose. If there's no disagreement, I'll add it > to the spec on Monday. > > > On Thu, Sep 13, 2012 at 10:36 AM, Deborah Dahl < > dahl@conversational-technologies.com> wrote: > >> Having a more specific error like “TAG_FORMAT_NOT_SUPPORTED” would be >> more informative, but I think using BAD_GRAMMAR is ok. If so, the text >> should probably say something like "There was an error in the speech >> recognition grammar or semantic tags, or the grammar format or tag format >> is unsupported." **** >> >> ** ** >> >> ** ** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Thursday, September 13, 2012 12:40 PM >> >> *To:* Deborah Dahl >> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; >> public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> ** ** >> >> The spec already defines SpeechRecognitionError BAD_GRAMMAR. I propose >> we use this same error for bad tag formats, since they're so related (and >> in fact there may be some edge-cases in which it's not clear whether the >> error is parsed as a grammar error or a semantic tag error.)**** >> >> ** ** >> >> The current definition in the spec for BAD_GRAMMAR is:**** >> >> "There was an error in the speech recognition grammar."**** >> >> ** ** >> >> I propose changing this to:**** >> >> "There was an error in the speech recognition grammar or semantic tags."* >> *** >> >> ** ** >> >> /Glen Shires**** >> >> **** >> >> ** ** >> >> On Thu, Sep 13, 2012 at 8:53 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> The example of the author supplying semantics that the recognizer can’t >> interpret I think is Bjorn’s “open question B” in his email --**** >> >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html** >> ** >> >> **** >> >> I proposed that this situation should raise an error in this email, but I >> don’t think there’s been any other discussion, so we should discuss this at >> some point.**** >> >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0072.html** >> ** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, September 12, 2012 5:58 PM**** >> >> >> *To:* Deborah Dahl >> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; >> public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> > ...any use cases where the interpretation is of interest to the >> developer and it’s not known whether the interpretation is an object or a >> string. What would be an example of that? **** >> >> **** >> >> An example is: if the author supplies semantics that the recognizer can't >> interpret, then the recognizer might return a normalized result.**** >> >> **** >> >> > I also think that the third use case would be very rare, since it would >> involve asking the user to make a decision about whether they want a >> normalized or non-normalized version of the result, and it’s not clear when >> the user would actually be interested in making that kind of choice.**** >> >> **** >> >> If the user is shown alternatives, one option might be normalized. I >> provided an example of this, where the non-normalized might be preferred by >> the user.**** >> >> **** >> >> transcript: "Like I've done one million times before."**** >> >> normalized: "Like I've done 1,000,000 times before."**** >> >> **** >> >> I understand that this may be a rare use case, but regardless of that, I >> still don't know of any use case in which returning a copy of the >> transcript is preferable to null. **** >> >> **** >> >> I'd prefer that we put the specific behavior in the spec, but if all we >> can agree on at this point is: “The group is currently discussing options >> for the value of the interpretation attribute when no interpretation has >> been returned by the recognizer. Current options are ‘null’ or a copy of >> the transcript.”, then I will agree to that.**** >> >> **** >> >> I too would like to hear others' opinions.**** >> >> /Glen Shires**** >> >> On Wed, Sep 12, 2012 at 2:16 PM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> I’m not sure I can think of any use cases where the interpretation is of >> interest to the developer and it’s not known whether the interpretation is >> an object or a string. What would be an example of that? I also think that >> the third use case would be very rare, since it would involve asking the >> user to make a decision about whether they want a normalized or >> non-normalized version of the result, and it’s not clear when the user >> would actually be interested in making that kind of choice.**** >> >> I think it would be good at this point to get some other opinions about >> this. **** >> >> Also, in the interest of moving forward, I think it’s perfectly fine to >> have language in the spec that just says “The group is currently discussing >> options for the value of the interpretation attribute when no >> interpretation has been returned by the recognizer. Current options are >> ‘null’ or a copy of the transcript.” This may also serve to encourage >> external comments from developers who have an opinion about this. **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, September 12, 2012 4:21 PM**** >> >> >> *To:* Deborah Dahl >> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; >> public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> I disagree with the code [1] for this use case. Since the interpretation >> may be a non-string object, good defensive coding practice is:**** >> >> **** >> >> if (typeof(interpretation) == "string") {**** >> >> document.write(interpretation)**** >> >> } else {**** >> >> document.write(transcript);**** >> >> }**** >> >> **** >> >> Thus, for this use case it doesn't matter. The code is identical for >> either definition of what the interpretation attributes returns when there >> is no interpretation. (That is, whether interpretation is defined to return >> null or to returns a copy of transcript.)**** >> >> **** >> >> In contrast, [2] shows a use case where it does matter, the code is >> simpler and less error-prone if the interpretation attributes returns null >> when there is no interpretation.**** >> >> **** >> >> Below a third use case where it also matters. Since interpretation may >> return a normalized string, an author may wish to show both the normalized >> string and the transcript string to the user, and let them choose which one >> to use. For example:**** >> >> **** >> >> interpretation: "Like I've done 1,000,000 times before."**** >> >> transcript: "Like I've done one million times before."**** >> >> **** >> >> (The author might also add transcript alternatives to this choice list, >> but I'll omit that to keep the example simple.)**** >> >> **** >> >> For the option where interpretation returns a copy of transcript when >> there is no interpretation:**** >> >> **** >> >> var choices;**** >> >> if (typeof(interpretation) == "string" && interpretation != transcript) { >> **** >> >> choices.push(interpretation);**** >> >> }**** >> >> choices.push(transcript);**** >> >> if (choices.length > 1) {**** >> >> AskUserToDisambiguate(choices);**** >> >> }**** >> >> **** >> >> **** >> >> For the option where interpretation returns a null when there is no >> interpretation:**** >> >> **** >> >> var choices;**** >> >> if (typeof(interpretation) == "string") {**** >> >> choices.push(interpretation);**** >> >> }**** >> >> choices.push(transcript);**** >> >> if (choices.length > 1) {**** >> >> AskUserToDisambiguate(choices);**** >> >> }**** >> >> **** >> >> **** >> >> So there's clearly use cases in which returning null allows for simpler >> and less error-prone code, whereas it's not clear to me there is any use >> case in which returning a copy of the transcript simplifies the code. >> Together, these use cases cover all the scenarios:**** >> >> **** >> >> - where there is an interpretation that contains a complex object**** >> >> - where there is an interpretation that contains a string, and**** >> >> - where there is no interpretation.**** >> >> **** >> >> So, I continue to propose adding one additional sentence.**** >> >> **** >> >> "If no interpretation is available, this attribute MUST return null." >> **** >> >> **** >> >> If there's no disagreement, I will add this sentence to the spec on >> Friday.**** >> >> **** >> >> (Please note, this is very different than the reasoning behind requiring >> the emma attribute to never be null. "emma" is always of type "Document" >> and always returns a valid emma document, not simply a copy of some other >> attribute. Here, "interpretation" is an attribute of type "any", so it >> must always be type-checked.)**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> [1] >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.html** >> ** >> >> [2] >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0107.html** >> ** >> >> **** >> >> On Wed, Sep 12, 2012 at 6:44 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> I would still prefer that the interpretation slot always be filled, at >> least by the transcript if there’s nothing better. I think that the use >> case I described in **** >> >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.htmlis going to be pretty common and in that case being able to rely on >> something other than null being in the interpretation field is very >> convenient. On the other hand, if the application really depends on the >> availability of a more complex interpretation object, the developer is >> going to have to make sure that a specific speech service that can provide >> that kind of interpretation is used. In that case, I don’t see how there >> can be a transcript without an interpretation. **** >> >> On a related topic, I think we should also include some of the points >> that Bjorn made about support for grammars and semantic tagging as >> discussed in this thread -- >> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html.* >> *** >> >> **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Tuesday, September 11, 2012 8:33 PM >> *To:* Deborah Dahl; Jim Barnett; Hans Wennborg; Satish S; Bjorn >> Bringert; public-speech-api@w3.org**** >> >> >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> The current definition of interpretation in the spec is:**** >> >> **** >> >> "The interpretation represents the semantic meaning from what the >> user said. This might be determined, for instance, through the SISR >> specification of semantics in a grammar."**** >> >> **** >> >> I propose adding an additional sentence at the end.**** >> >> **** >> >> "If no interpretation is available, this attribute MUST return null." >> **** >> >> **** >> >> My reasoning (based on this lengthy thread):**** >> >> - If an SISR / etc interpretation is available, the UA must return it. >> **** >> - If an alternative string interpretation is available, such as >> a normalization, the UA may return it.**** >> - If there's no more information available than in the transcript, >> then "null" provides a very simple way for the author to check for this >> condition. The author avoids a clumsy conditional (typeof(interpretation) >> != "string") and the author can easily distinguish between the case when >> the interpretation returns a normalization string as opposed to if it had >> just copied the transcript verbatim.**** >> - "null" is more commonly used than "undefined" in these >> circumstances.**** >> >> If there's no disagreement, I will add this sentence to the spec on >> Thursday.**** >> >> /Glen Shires**** >> >> **** >> >> **** >> >> On Tue, Sep 4, 2012 at 11:04 AM, Glen Shires <gshires@google.com> wrote:* >> *** >> >> I've updated the spec with this change (moved interpretation and emma >> attributes to SpeechRecognitionEvent):**** >> >> https://dvcs.w3.org/hg/speech-api/rev/48a58e558fcc**** >> >> **** >> >> As always, the current draft spec is at:**** >> >> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> On Thu, Aug 30, 2012 at 10:07 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> Thanks for the clarification, that makes sense. When each new version of >> the emma document arrives in a SpeechRecognitionEvent, the author can just >> repopulate all the earlier form fields, as well as the newest one, with >> the data from the most recent emma version. **** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Thursday, August 30, 2012 12:45 PM**** >> >> >> *To:* Deborah Dahl >> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; >> public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> Debbie,**** >> >> In my proposal, the single emma document is updated with each >> new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma >> document is populated in "real time" as the user speaks each field, without >> waiting for the user to finish speaking. A JavaScript author could use this >> to populate a form in "real time".**** >> >> **** >> >> **** >> >> Also, I now realize that the SpeechRecognitionEvent.transcript is not >> useful in continuous = false mode because only one final result is >> returned, and thus SpeechRecognitionEvent.results[0].transcript always >> contains the same string (no concatenation needed). I also don't see it as >> very useful in continuous = true mode because if an author is using this >> mode, it's presumably because he wants to show continuous final results >> (and perhaps interim as well). Since the author is already writing code to >> concatenate results to display them "real-time", there's little or no >> savings with this new attribute. So I now retract that portion of my >> proposal.**** >> >> **** >> >> So to clarify, here's my proposed changes to the spec. If there's no >> disagreement by the end of the week I'll add it to the spec...**** >> >> **** >> >> **** >> >> Delete SpeechRecognitionAlternative.interpretation**** >> >> **** >> >> Delete SpeechRecognitionResult.emma**** >> >> **** >> >> Add interpretation and emma attributes to SpeechRecognitionEvent. >> Specifically:**** >> >> **** >> >> interface SpeechRecognitionEvent : Event {**** >> >> readonly attribute short resultIndex;**** >> >> readonly attribute SpeechRecognitionResultList results;**** >> >> readonly attribute any interpretation;**** >> >> readonly attribute Document emma;**** >> >> };**** >> >> **** >> >> I do not propose to change the definitions of interpretation and emma at >> this time (because there is on-going discussion), but rather to simply move >> their current definitions to the new heading: "5.1.8 Speech Recognition >> Event".**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> **** >> >> On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> Hi Glenn,**** >> >> I agree that a single cumulative emma document is preferable to multiple >> emma documents in general, although I think that there might be use cases >> where it would be convenient to have both. For example, you want to >> populate a form in real time as the user speaks each field, without waiting >> for the user to finish speaking. After the result is final the application >> could send the cumulative result to the server, but seeing the interim >> results would be helpful feedback to the user.**** >> >> Debbie**** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, August 29, 2012 2:57 PM >> *To:* Deborah Dahl >> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert; >> public-speech-api@w3.org**** >> >> >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> I believe the same is true for emma, a single, cumulative emma document >> is preferable to multiple emma documents. **** >> >> **** >> >> I propose the following changes to the spec:**** >> >> **** >> >> Delete SpeechRecognitionAlternative.interpretation**** >> >> **** >> >> Delete SpeechRecognitionResult.emma**** >> >> **** >> >> Add interpretation and emma attributes to SpeechRecognitionEvent. >> Specifically:**** >> >> **** >> >> interface SpeechRecognitionEvent : Event {**** >> >> readonly attribute short resultIndex;**** >> >> readonly attribute SpeechRecognitionResultList results;**** >> >> readonly attribute DOMString transcript;**** >> >> readonly attribute any interpretation;**** >> >> readonly attribute Document emma;**** >> >> };**** >> >> **** >> >> I do not propose to change the definitions of interpretation and emma at >> this time (because there is on-going discussion), but rather to simply move >> their current definitions to the new heading: "5.1.8 Speech Recognition >> Event".**** >> >> **** >> >> I also propose adding transcript attribute to SpeechRecognitionEvent (but >> also retaining SpeechRecognitionAlternative.transcript). This provides a >> simple option for JavaScript authors to get at the full, cumulative >> transcript. I propose the definition under "5.1.8 Speech Recognition >> Event" be:**** >> >> **** >> >> transcript**** >> >> The transcript string represents the raw words that the user spoke. This >> is a concatenation of the first (highest confidence) alternative of all >> final SpeechRecognitionAlternative.transcript strings.**** >> >> **** >> >> /Glen Shires **** >> >> **** >> >> **** >> >> On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> I agree with having a single interpretation that represents the >> cumulative interpretation of the utterance so far. **** >> >> I think an example of what Jim is talking about, when the interpretation >> wouldn’t be final even if the transcript is, might be the utterance “from >> Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and >> returns “from: ORD”, because most people don’t bother to say “O’Hare”, but >> then it hears “Midway” and changes the interpretation to “from: MDW”. >> However, “from Chicago” is still the transcript. **** >> >> Also the problem that Glenn points out is bad enough with two slots, but >> it gets even worse as the number of slots gets bigger. For example, you >> might have a pizza-ordering utterance with five or six ingredients (“I want >> a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would >> be very cumbersome to have to go back through all the results to fill in >> the slots separately.**** >> >> **** >> >> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com] >> *Sent:* Wednesday, August 29, 2012 12:37 PM >> *To:* Glen Shires; Deborah Dahl**** >> >> >> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org** >> ** >> >> *Subject:* RE: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> I agree with the idea of having a single interpretation. There is no >> guarantee that the different parts of the string have independent >> interpretations. For example, even if the transcription “from New York” is >> final, its interpretation may not be, since it may depend on the >> remaining parts of the utterance (that depends on how complicated the >> grammar is, of course.) **** >> >> **** >> >> - Jim**** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Wednesday, August 29, 2012 11:44 AM >> *To:* Deborah Dahl >> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> How should interpretation work with continuous speech?**** >> >> **** >> >> Specifically, as each portion becomes final (each SpeechRecognitionResult >> with final=true), the corresponding alternative(s) for transcription and >> interpretation become final.**** >> >> **** >> >> It's easy for the JavaScript author to handle the consecutive list of >> transcription strings - simply concatenate them.**** >> >> **** >> >> However, if the interpretation returns a semantic structure (such as the >> depart/arrive example), it's unclear to me how they should be returned. >> For example, if the first final result was "from New York" and the second >> "to San Francisco", then:**** >> >> **** >> >> After the first final result, the list is:**** >> >> **** >> >> event.results[0].item[0].transcription = "from New York"**** >> >> event.results[0].item[0].interpretation = {**** >> >> depart: "New York",**** >> >> arrive: null**** >> >> };**** >> >> **** >> >> After the second final result, the list is:**** >> >> **** >> >> event.results[0].item[0].transcription = "from New York"**** >> >> event.results[0].item[0].interpretation = {**** >> >> depart: "New York",**** >> >> arrive: null**** >> >> };**** >> >> **** >> >> event.results[1].item[0].transcription = "to San Francisco"**** >> >> event.results[1].item[0].interpretation = {**** >> >> depart: null,**** >> >> arrive: "San Francisco"**** >> >> };**** >> >> **** >> >> If so, this makes using the interpretation structure very messy for the >> author because he needs to loop through all the results to find each >> interpretation slot that he needs.**** >> >> **** >> >> I suggest that we instead consider changing the spec to provide a single >> interpretation that always represents the most current interpretation.*** >> * >> >> **** >> >> After the first final result, the list is:**** >> >> **** >> >> event.results[0].item[0].transcription = "from New York"**** >> >> event.interpretation = {**** >> >> depart: "New York",**** >> >> arrive: null**** >> >> };**** >> >> **** >> >> After the second final result, the list is:**** >> >> **** >> >> event.results[0].item[0].transcription = "from New York"**** >> >> event.results[1].item[0].transcription = "to San Francisco"**** >> >> event.interpretation = {**** >> >> depart: "New York",**** >> >> arrive: "San Francisco"**** >> >> };**** >> >> **** >> >> This not only makes it simple for the author to process the >> interpretation, it also solves the problem that the interpretation may not >> be available at the same point in time that the transcription becomes >> final. If alternative interpretations are important, then it's easy to add >> them to the interpretation structure that is returned, and this format far >> easier for the author to process than >> multiple SpeechRecognitionAlternative.interpretations. For example:**** >> >> **** >> >> event.interpretation = {**** >> >> depart: ["New York", "Newark"],**** >> >> arrive: ["San Francisco", "San Bernardino"],**** >> >> };**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> I don’t think there’s a big difference in complexity in this use case, >> but here’s another one, that I think might be more common.**** >> >> Suppose the application is something like search or composing email, and >> the transcript alone would serve the application's purposes. However, some >> implementations might also provide useful normalizations like converting >> text numbers to digits or capitalization that would make the dictated text >> look more like written language, and this normalization fills the >> "interpretation slot". If the developer can count on the "interpretation" >> slot being filled by the transcript if there's nothing better, then the >> developer only has to ask for the interpretation. **** >> >> e.g. **** >> >> document.write(interpretation)**** >> >> **** >> >> vs. **** >> >> if(intepretation)**** >> >> document.write(interpretation)**** >> >> else**** >> >> document.write(transcript)**** >> >> **** >> >> which I think is simpler. The developer doesn’t have to worry about type >> checking because in this application the “interpretation” will always be a >> string.**** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Tuesday, August 28, 2012 10:44 PM >> *To:* Deborah Dahl**** >> >> >> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> Debbie,**** >> >> Looking at this from the viewpoint of what is easier for the JavaScript >> author, I believe:**** >> >> **** >> >> SpeechRecognitionAlternative.transcript must return a string (even if an >> empty string). Thus, an author wishing to use the transcript doesn't need >> to perform any type checking.**** >> >> **** >> >> SpeechRecognitionAlternative.interpretation must be null if no >> interpretation is provided. This simplifies the required conditional by >> eliminating type checking. For example:**** >> >> **** >> >> transcript = "from New York to San Francisco";**** >> >> **** >> >> interpretation = {**** >> >> depart: "New York",**** >> >> arrive: "San Francisco"**** >> >> };**** >> >> **** >> >> if (interpretation) // this works if interpretation is present or if null >> **** >> >> document.write("Depart " + interpretation.depart + " and arrive in " + >> interpretation.arrive);**** >> >> else**** >> >> document.write(transcript);**** >> >> fi**** >> >> **** >> >> **** >> >> Whereas, if the interpretation contains the transcript string when no >> interpretation is present, the condition would have to be:**** >> >> **** >> >> if (typeof(interpretation) != "string")**** >> >> **** >> >> Which is more complex, and more prone to errors (e.g. if spell "string" >> wrong).**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> **** >> >> On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> Hi Glenn,**** >> >> In the case of an SLM, if there’s a classification, I think the >> classification would be the interpretation. If the SLM is just used to >> improve dictation results, without classification, then the interpretation >> would be whatever we say it is – either the transcript, null, or undefined. >> **** >> >> My point about stating that the “transcript” attribute is required or >> optional wasn’t whether or not there was a use case where it would be >> desirable not to return a transcript. My point was that the spec needs to >> be explicit about the optional/required status of every feature. It’s >> fine to postpone that decision if there’s any controversy, but if we all >> agree we might as well add it to the spec. **** >> >> I can’t think of any cases where it would be bad to return a transcript, >> although I can think of use cases where the developer wouldn’t choose to do >> anything with the transcript (like multi-slot form filling – all the end >> user really needs to see is the correctly filled slots). **** >> >> Debbie**** >> >> **** >> >> *From:* Glen Shires [mailto:gshires@google.com] >> *Sent:* Thursday, August 23, 2012 3:48 AM >> *To:* Deborah Dahl >> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org** >> ** >> >> >> *Subject:* Re: SpeechRecognitionAlternative.interpretation when >> interpretation can't be provided**** >> >> **** >> >> Debbie,**** >> >> I agree with the need to support SLMs. This implies that, in some cases, >> the author may not specify semantic information, and thus there would not >> be an interpretation.**** >> >> **** >> >> Under what circumstances (except error conditions) do you envision that a >> transcript would not be returned?**** >> >> **** >> >> /Glen Shires**** >> >> **** >> >> On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl < >> dahl@conversational-technologies.com> wrote:**** >> >> Actually, Satish's comment made me think that we probably have a few other >> things to agree on before we decide what the default value of >> "interpretation" should be, because we haven't settled on a lot of issues >> about what is required and what is optional. >> Satish's argument is only relevant if we require SRGS/SISR for grammars >> and >> semantic interpretation, but we actually don't require either of those >> right >> now, so it doesn't matter what they do as far as the current spec goes. >> (Although it's worth noting that SRGS doesn't require anything to be >> returned at all, even the transcript >> http://www.w3.org/TR/speech-grammar/#S1.10). >> So I think we first need to decide and explicitly state in the spec --- >> >> 1. what we want to say about grammar formats (which are allowed/required, >> or >> is the grammar format open). It probably needs to be somewhat open because >> of SLM's. >> 2. what we want to say about semantic tag formats (are proprietary formats >> allowed, is SISR required or is the semantic tag format just whatever the >> grammar format uses) >> 3. is "transcript" required? >> 4. is "interpretation" required? >> >> Debbie**** >> >> >> > -----Original Message----- >> > From: Hans Wennborg [mailto:hwennborg@google.com] >> > Sent: Tuesday, August 21, 2012 12:50 PM >> > To: Glen Shires >> > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org >> > Subject: Re: SpeechRecognitionAlternative.interpretation when >> > interpretation can't be provided >> > >> > Björn, Deborah, are you ok with this as well? I.e. that the spec >> > shouldn't mandate a "default" value for the interpretation attribute, >> > but rather return null when there is no interpretation? >> > >> > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> >> wrote: >> > > I agree, return "null" (not "undefined") in such cases. >> > > >> > > >> > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote: >> > >> >> > >> > I may have missed something, but I don’t see in the spec where it >> says >> > >> > that “interpretation” is optional. >> > >> >> > >> Developers specify the interpretation value with SISR and if they >> don't >> > >> specify there is no 'default' interpretation available. In that sense >> it is >> > >> optional because grammars don't mandate it. So I think this API >> shouldn't >> > >> mandate providing a default value if the engine did not provide one, >> and >> > >> return null in such cases. >> >> >> >> > >> >> > >> Cheers >> > >> Satish >> > >> >> > >> >> > >> >> > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl >> > >> <dahl@conversational-technologies.com> wrote: >> > >>> >> > >>> I may have missed something, but I don’t see in the spec where it >> says >> > >>> that “interpretation” is optional. >> > >>> >> > >>> From: Satish S [mailto:satish@google.com] >> > >>> Sent: Thursday, August 16, 2012 7:38 PM >> > >>> To: Deborah Dahl >> > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org >> > >>> >> > >>> >> > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when >> > >>> interpretation can't be provided >> > >>> >> > >>> >> > >>> >> > >>> 'interpretation' is an optional attribute because engines are not >> > >>> required to provide an interpretation on their own (unlike >> 'transcript'). >> > As >> > >>> such I think it should return null when there isn't a value to be >> returned >> > >>> as that is the convention for optional attributes, not 'undefined' >> or >> a >> > copy >> > >>> of some other attribute. >> > >>> >> > >>> >> > >>> >> > >>> If an engine chooses to return the same value for 'transcript' and >> > >>> 'interpretation' or do textnorm of the value and return in >> 'interpretation' >> > >>> that will be an implementation detail of the engine. But in the >> absence >> > of >> > >>> any such value for 'interpretation' from the engine I think the UA >> should >> > >>> return null. >> > >>> >> > >>> >> > >>> Cheers >> > >>> Satish >> > >>> >> > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl >> > >>> <dahl@conversational-technologies.com> wrote: >> > >>> >> > >>> That's a good point. There are lots of use cases where some simple >> > >>> normalization is extremely useful, as in your example, or collapsing >> all >> > the >> > >>> ways that the user might say "yes" or "no". However, you could say >> that >> > once >> > >>> the implementation has modified or normalized the transcript that >> > means it >> > >>> has some kind of interpretation, so putting a normalized value in >> the >> > >>> interpretation slot should be fine. Nothing says that the >> "interpretation" >> > >>> has to be a particularly fine-grained interpretation, or one with a >> lot of >> > >>> structure. >> > >>> >> > >>> >> > >>> >> > >>> > -----Original Message----- >> > >>> > From: Bjorn Bringert [mailto:bringert@google.com] >> > >>> > Sent: Thursday, August 16, 2012 9:09 AM >> > >>> > To: Hans Wennborg >> > >>> > Cc: Conversational; public-speech-api@w3.org >> > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when >> > >>> > interpretation can't be provided >> > >>> > >> > >>> > I'm not sure that it has to be that strict in requiring that the >> value >> > >>> > is the same as the "transcript" attribute. For example, an engine >> > >>> > might return the words recognized in "transcript" and apply some >> > extra >> > >>> > textnorm to the text that it returns in "interpretation", e.g. >> > >>> > converting digit words to digits ("three" -> "3"). Not sure if >> that's >> > >>> > useful though. >> > >>> > >> > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg >> > >>> > <hwennborg@google.com> wrote: >> > >>> > > Yes, the raw text is in the 'transcript' attribute. >> > >>> > > >> > >>> > > The description of 'interpretation' is currently: "The >> interpretation >> > >>> > > represents the semantic meaning from what the user said. This >> > might >> > >>> > > be >> > >>> > > determined, for instance, through the SISR specification of >> semantics >> > >>> > > in a grammar." >> > >>> > > >> > >>> > > I propose that we change it to "The interpretation represents >> the >> > >>> > > semantic meaning from what the user said. This might be >> > determined, >> > >>> > > for instance, through the SISR specification of semantics in a >> > >>> > > grammar. If no semantic meaning can be determined, the attribute >> > must >> > >>> > > be a string with the same value as the 'transcript' attribute." >> > >>> > > >> > >>> > > Does that sound good to everyone? If there are no objections, >> I'll >> > >>> > > make the change to the draft next week. >> > >>> > > >> > >>> > > Thanks, >> > >>> > > Hans >> > >>> > > >> > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational >> > >>> > > <dahl@conversational-technologies.com> wrote: >> > >>> > >> I can't check the spec right now, but I assume there's already >> an >> > >>> > >> attribute >> > >>> > that currently is defined to contain the raw text. So I think we >> could >> > >>> > say that >> > >>> > if there's no interpretation the value of the interpretation >> attribute >> > >>> > would be >> > >>> > the same as the value of the "raw string" attribute, >> > >>> > >> >> > >>> > >> Sent from my iPhone >> > >>> > >> >> > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg >> > <hwennborg@google.com> >> > >>> > wrote: >> > >>> > >> >> > >>> > >>> OK, that would work I suppose. >> > >>> > >>> >> > >>> > >>> What would the spec text look like? Something like "[...] If >> no >> > >>> > >>> semantic meaning can be determined, the attribute will a >> string >> > >>> > >>> representing the raw words that the user spoke."? >> > >>> > >>> >> > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert >> > >>> > <bringert@google.com> wrote: >> > >>> > >>>> Yeah, that would be my preference too. >> > >>> > >>>> >> > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational >> > >>> > >>>> <dahl@conversational-technologies.com> wrote: >> > >>> > >>>>> If there isn't an interpretation I think it would make the >> most >> > >>> > >>>>> sense >> > >>> > for the attribute to contain the literal string result. I believe >> this >> > >>> > is what >> > >>> > happens in VoiceXML. >> > >>> > >>>>> >> > >>> > >>>>>> My question is: for implementations that cannot provide an >> > >>> > >>>>>> interpretation, what should the attribute's value be? null? >> > >>> > undefined? >> > >>> > >> > >>> > >> > >>> > >> > >>> > -- >> > >>> > Bjorn Bringert >> > >>> > Google UK Limited, Registered Office: Belgrave House, 76 >> Buckingham >> > >>> > Palace Road, London, SW1W 9TQ >> > >>> > Registered in England Number: 3977902 >> > >>> >> > >>> >> > >>> >> > >> >> > >> >> > >**** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> **** >> >> ** ** >> > >
Received on Tuesday, 18 September 2012 05:32:45 UTC