Re: SpeechRecognitionAlternative.interpretation when interpretation can't be provided from Glen Shires on 2012-09-18 (public-speech-api@w3.org from September 2012)

From: Glen Shires <gshires@google.com>
Date: Mon, 17 Sep 2012 22:31:32 -0700
To: Deborah Dahl <dahl@conversational-technologies.com>
Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
Message-ID: <CAEE5bcj0X8Xf22vfOQYBwV=oKQJfUbWMV1DNcrUYKGaMObROAA@mail.gmail.com>
I've updated the spec with this change:
https://dvcs.w3.org/hg/speech-api/rev/5e2d87e7d977

As always, the current draft spec is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

On Thu, Sep 13, 2012 at 10:42 AM, Glen Shires <gshires@google.com> wrote:

> Debbie,
> Yes, I like the text you propose.  If there's no disagreement, I'll add it
> to the spec on Monday.
>
>
> On Thu, Sep 13, 2012 at 10:36 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:
>
>> Having a more specific error like “TAG_FORMAT_NOT_SUPPORTED” would be
>> more informative, but I think using BAD_GRAMMAR is ok. If so, the text
>> should probably say something like "There was an error in the speech
>> recognition grammar or semantic tags, or the grammar format or tag format
>> is unsupported." ****
>>
>> ** **
>>
>> ** **
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Thursday, September 13, 2012 12:40 PM
>>
>> *To:* Deborah Dahl
>> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
>> public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>> ** **
>>
>> The spec already defines SpeechRecognitionError BAD_GRAMMAR.  I propose
>> we use this same error for bad tag formats, since they're so related (and
>> in fact there may be some edge-cases in which it's not clear whether the
>> error is parsed as a grammar error or a semantic tag error.)****
>>
>> ** **
>>
>> The current definition in the spec for BAD_GRAMMAR is:****
>>
>> "There was an error in the speech recognition grammar."****
>>
>> ** **
>>
>> I propose changing this to:****
>>
>> "There was an error in the speech recognition grammar or semantic tags."*
>> ***
>>
>> ** **
>>
>> /Glen Shires****
>>
>>  ****
>>
>> ** **
>>
>> On Thu, Sep 13, 2012 at 8:53 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> The example of the author supplying semantics that the recognizer can’t
>> interpret I think is Bjorn’s “open question B” in his email --****
>>
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html**
>> **
>>
>>  ****
>>
>> I proposed that this situation should raise an error in this email, but I
>> don’t think there’s been any other discussion, so we should discuss this at
>> some point.****
>>
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0072.html**
>> **
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, September 12, 2012 5:58 PM****
>>
>>
>> *To:* Deborah Dahl
>> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
>> public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> > ...any use cases where the interpretation is of interest to the
>> developer and  it’s not known whether the interpretation is an object or a
>> string.  What would be an example of that? ****
>>
>>  ****
>>
>> An example is: if the author supplies semantics that the recognizer can't
>> interpret, then the recognizer might return a normalized result.****
>>
>>  ****
>>
>> > I also think that the third use case would be very rare, since it would
>> involve asking the user to make a decision about whether they want a
>> normalized or non-normalized version of the result, and it’s not clear when
>> the user would actually be interested in making that kind of choice.****
>>
>>  ****
>>
>> If the user is shown alternatives, one option might be normalized. I
>> provided an example of this, where the non-normalized might be preferred by
>> the user.****
>>
>>  ****
>>
>>    transcript: "Like I've done one million times before."****
>>
>> normalized: "Like I've done 1,000,000 times before."****
>>
>>  ****
>>
>> I understand that this may be a rare use case, but regardless of that, I
>> still don't know of any use case in which returning a copy of the
>> transcript is preferable to null. ****
>>
>>  ****
>>
>> I'd prefer that we put the specific behavior in the spec, but if all we
>> can agree on at this point is: “The group is currently discussing options
>> for the value of the interpretation attribute when no interpretation has
>> been returned by the recognizer. Current options are ‘null’ or a copy of
>> the transcript.”, then I will agree to that.****
>>
>>  ****
>>
>> I too would like to hear others' opinions.****
>>
>> /Glen Shires****
>>
>> On Wed, Sep 12, 2012 at 2:16 PM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> I’m not sure I can think of any use cases where the interpretation is of
>> interest to the developer and  it’s not known whether the interpretation is
>> an object or a string. What would be an example of that? I also think that
>> the third use case would be very rare, since it would involve asking the
>> user to make a decision about whether they want a normalized or
>> non-normalized version of the result, and it’s not clear when the user
>> would actually be interested in making that kind of choice.****
>>
>> I think it would be good at this point to get some other opinions about
>> this. ****
>>
>> Also, in the interest of moving forward, I think it’s perfectly fine to
>> have language in the spec that just says “The group is currently discussing
>> options for the value of the interpretation attribute when no
>> interpretation has been returned by the recognizer. Current options are
>> ‘null’ or a copy of the transcript.” This may also serve to encourage
>> external comments from developers who have an opinion about this. ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, September 12, 2012 4:21 PM****
>>
>>
>> *To:* Deborah Dahl
>> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
>> public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> I disagree with the code [1] for this use case. Since the interpretation
>> may be a non-string object, good defensive coding practice is:****
>>
>>  ****
>>
>> if (typeof(interpretation) == "string") {****
>>
>>   document.write(interpretation)****
>>
>> } else {****
>>
>>   document.write(transcript);****
>>
>> }****
>>
>>  ****
>>
>> Thus, for this use case it doesn't matter. The code is identical for
>> either definition of what the interpretation attributes returns when there
>> is no interpretation. (That is, whether interpretation is defined to return
>> null or to returns a copy of transcript.)****
>>
>>  ****
>>
>> In contrast, [2] shows a use case where it does matter, the code is
>> simpler and less error-prone if the interpretation attributes returns null
>> when there is no interpretation.****
>>
>>  ****
>>
>> Below a third use case where it also matters. Since interpretation may
>> return a normalized string, an author may wish to show both the normalized
>> string and the transcript string to the user, and let them choose which one
>> to use.  For example:****
>>
>>  ****
>>
>>    interpretation: "Like I've done 1,000,000 times before."****
>>
>>    transcript: "Like I've done one million times before."****
>>
>>  ****
>>
>> (The author might also add transcript alternatives to this choice list,
>> but I'll omit that to keep the example simple.)****
>>
>>  ****
>>
>> For the option where interpretation returns a copy of transcript when
>> there is no interpretation:****
>>
>>  ****
>>
>> var choices;****
>>
>> if (typeof(interpretation) == "string" && interpretation != transcript) {
>> ****
>>
>>   choices.push(interpretation);****
>>
>> }****
>>
>> choices.push(transcript);****
>>
>> if (choices.length > 1) {****
>>
>>   AskUserToDisambiguate(choices);****
>>
>> }****
>>
>>  ****
>>
>>  ****
>>
>> For the option where interpretation returns a null when there is no
>> interpretation:****
>>
>>  ****
>>
>> var choices;****
>>
>> if (typeof(interpretation) == "string") {****
>>
>>   choices.push(interpretation);****
>>
>> }****
>>
>> choices.push(transcript);****
>>
>> if (choices.length > 1) {****
>>
>>   AskUserToDisambiguate(choices);****
>>
>> }****
>>
>>  ****
>>
>>  ****
>>
>> So there's clearly use cases in which returning null allows for simpler
>> and less error-prone code, whereas it's not clear to me there is any use
>> case in which returning a copy of the transcript simplifies the code.
>> Together, these use cases cover all the scenarios:****
>>
>>  ****
>>
>> - where there is an interpretation that contains a complex object****
>>
>> - where there is an interpretation that contains a string, and****
>>
>> - where there is no interpretation.****
>>
>>  ****
>>
>> So, I continue to propose adding one additional sentence.****
>>
>>  ****
>>
>>     "If no interpretation is available, this attribute MUST return null."
>> ****
>>
>>  ****
>>
>> If there's no disagreement, I will add this sentence to the spec on
>> Friday.****
>>
>>  ****
>>
>> (Please note, this is very different than the reasoning behind requiring
>> the emma attribute to never be null. "emma" is always of type "Document"
>> and always returns a valid emma document, not simply a copy of some other
>> attribute.  Here, "interpretation" is an attribute of type "any", so it
>> must always be type-checked.)****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>> [1]
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.html**
>> **
>>
>> [2]
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0107.html**
>> **
>>
>>  ****
>>
>> On Wed, Sep 12, 2012 at 6:44 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> I would still prefer that the interpretation slot always be filled, at
>> least by the transcript if there’s nothing better. I think that the use
>> case I described in ****
>>
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.htmlis going to be pretty common and in that case being able to rely on
>> something other than null being in the interpretation field is very
>> convenient. On the other hand, if the application really depends on the
>> availability of a more complex interpretation object, the developer is
>> going to have to make sure that a specific speech service that can provide
>> that kind of interpretation is used. In that case, I don’t see how there
>> can be a transcript without an interpretation. ****
>>
>> On a related topic, I think we should also include some of the points
>> that Bjorn made about support for grammars and semantic tagging as
>> discussed in this thread --
>> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html.*
>> ***
>>
>>  ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Tuesday, September 11, 2012 8:33 PM
>> *To:* Deborah Dahl; Jim Barnett; Hans Wennborg; Satish S; Bjorn
>> Bringert; public-speech-api@w3.org****
>>
>>
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> The current definition of interpretation in the spec is:****
>>
>>  ****
>>
>>     "The interpretation represents the semantic meaning from what the
>> user said. This might be determined, for instance, through the SISR
>> specification of semantics in a grammar."****
>>
>>  ****
>>
>> I propose adding an additional sentence at the end.****
>>
>>  ****
>>
>>     "If no interpretation is available, this attribute MUST return null."
>> ****
>>
>>  ****
>>
>> My reasoning (based on this lengthy thread):****
>>
>>    - If an SISR / etc interpretation is available, the UA must return it.
>>    ****
>>    - If an alternative string interpretation is available, such as
>>    a normalization, the UA may return it.****
>>    - If there's no more information available than in the transcript,
>>    then "null" provides a very simple way for the author to check for this
>>    condition. The author avoids a clumsy conditional (typeof(interpretation)
>>    != "string") and the author can easily distinguish between the case when
>>    the interpretation returns a normalization string as opposed to if it had
>>    just copied the transcript verbatim.****
>>    - "null" is more commonly used than "undefined" in these
>>    circumstances.****
>>
>> If there's no disagreement, I will add this sentence to the spec on
>> Thursday.****
>>
>> /Glen Shires****
>>
>>  ****
>>
>>  ****
>>
>> On Tue, Sep 4, 2012 at 11:04 AM, Glen Shires <gshires@google.com> wrote:*
>> ***
>>
>> I've updated the spec with this change (moved interpretation and emma
>> attributes to SpeechRecognitionEvent):****
>>
>> https://dvcs.w3.org/hg/speech-api/rev/48a58e558fcc****
>>
>>  ****
>>
>> As always, the current draft spec is at:****
>>
>> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>> On Thu, Aug 30, 2012 at 10:07 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> Thanks for the clarification, that makes sense.  When each new version of
>> the emma document arrives in a  SpeechRecognitionEvent, the author can just
>> repopulate all the  earlier form fields, as well as the newest one, with
>> the data from the most recent emma version. ****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Thursday, August 30, 2012 12:45 PM****
>>
>>
>> *To:* Deborah Dahl
>> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
>> public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> Debbie,****
>>
>> In my proposal, the single emma document is updated with each
>> new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma
>> document is populated in "real time" as the user speaks each field, without
>> waiting for the user to finish speaking. A JavaScript author could use this
>> to populate a form in "real time".****
>>
>>  ****
>>
>>  ****
>>
>> Also, I now realize that the SpeechRecognitionEvent.transcript is not
>> useful in continuous = false mode because only one final result is
>> returned, and thus SpeechRecognitionEvent.results[0].transcript always
>> contains the same string (no concatenation needed).  I also don't see it as
>> very useful in continuous = true mode because if an author is using this
>> mode, it's presumably because he wants to show continuous final results
>> (and perhaps interim as well). Since the author is already writing code to
>> concatenate results to display them "real-time", there's little or no
>> savings with this new attribute.  So I now retract that portion of my
>> proposal.****
>>
>>  ****
>>
>> So to clarify, here's my proposed changes to the spec. If there's no
>> disagreement by the end of the week I'll add it to the spec...****
>>
>>  ****
>>
>>  ****
>>
>> Delete SpeechRecognitionAlternative.interpretation****
>>
>>  ****
>>
>> Delete SpeechRecognitionResult.emma****
>>
>>  ****
>>
>> Add interpretation and emma attributes to SpeechRecognitionEvent.
>>  Specifically:****
>>
>>  ****
>>
>>     interface SpeechRecognitionEvent : Event {****
>>
>>         readonly attribute short resultIndex;****
>>
>>         readonly attribute SpeechRecognitionResultList results;****
>>
>>         readonly attribute any interpretation;****
>>
>>         readonly attribute Document emma;****
>>
>>     };****
>>
>>  ****
>>
>> I do not propose to change the definitions of interpretation and emma at
>> this time (because there is on-going discussion), but rather to simply move
>> their current definitions to the new heading: "5.1.8 Speech Recognition
>> Event".****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>>  ****
>>
>> On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> Hi Glenn,****
>>
>> I agree that a single cumulative emma document is preferable to multiple
>> emma documents in general, although I think that there might be use cases
>> where it would be convenient to have both.  For example, you want to
>> populate a form in real time as the user speaks each field, without waiting
>> for the user to finish speaking. After the result is final the application
>> could send the cumulative result to the server, but seeing the interim
>> results would be helpful feedback to the user.****
>>
>> Debbie****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, August 29, 2012 2:57 PM
>> *To:* Deborah Dahl
>> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
>> public-speech-api@w3.org****
>>
>>
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> I believe the same is true for emma, a single, cumulative emma document
>> is preferable to multiple emma documents. ****
>>
>>  ****
>>
>> I propose the following changes to the spec:****
>>
>>  ****
>>
>> Delete SpeechRecognitionAlternative.interpretation****
>>
>>  ****
>>
>> Delete SpeechRecognitionResult.emma****
>>
>>  ****
>>
>> Add interpretation and emma attributes to SpeechRecognitionEvent.
>>  Specifically:****
>>
>>  ****
>>
>>     interface SpeechRecognitionEvent : Event {****
>>
>>         readonly attribute short resultIndex;****
>>
>>         readonly attribute SpeechRecognitionResultList results;****
>>
>>         readonly attribute DOMString transcript;****
>>
>>         readonly attribute any interpretation;****
>>
>>         readonly attribute Document emma;****
>>
>>     };****
>>
>>  ****
>>
>> I do not propose to change the definitions of interpretation and emma at
>> this time (because there is on-going discussion), but rather to simply move
>> their current definitions to the new heading: "5.1.8 Speech Recognition
>> Event".****
>>
>>  ****
>>
>> I also propose adding transcript attribute to SpeechRecognitionEvent (but
>> also retaining SpeechRecognitionAlternative.transcript). This provides a
>> simple option for JavaScript authors to get at the full, cumulative
>> transcript.  I propose the definition under "5.1.8 Speech Recognition
>> Event" be:****
>>
>>  ****
>>
>> transcript****
>>
>> The transcript string represents the raw words that the user spoke. This
>> is a concatenation of the first (highest confidence) alternative of all
>> final SpeechRecognitionAlternative.transcript strings.****
>>
>>  ****
>>
>> /Glen Shires ****
>>
>>  ****
>>
>>  ****
>>
>> On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> I agree with having a single interpretation that represents the
>> cumulative interpretation of the utterance so far. ****
>>
>> I think an example of what Jim is talking about, when the interpretation
>> wouldn’t be final even if the transcript is, might be the utterance “from
>> Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and
>> returns “from: ORD”, because most people don’t bother to say “O’Hare”, but
>> then it hears “Midway” and changes the interpretation to “from: MDW”.
>>  However, “from Chicago” is still the transcript. ****
>>
>> Also the problem that Glenn points out is bad enough with two slots, but
>> it gets even worse as the number of slots gets bigger. For example, you
>> might have a pizza-ordering utterance with five or six ingredients (“I want
>> a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would
>> be very cumbersome to have to go back through all the results to fill in
>> the slots separately.****
>>
>>  ****
>>
>> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com]
>> *Sent:* Wednesday, August 29, 2012 12:37 PM
>> *To:* Glen Shires; Deborah Dahl****
>>
>>
>> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org**
>> **
>>
>> *Subject:* RE: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> I agree with the idea of having a single interpretation.  There is no
>> guarantee that the different parts of the string have independent
>> interpretations.  For example, even if the transcription “from New York” is
>> final,  its interpretation may not  be, since it may depend on the
>> remaining parts of the utterance (that depends on how complicated the
>> grammar is, of course.)  ****
>>
>>  ****
>>
>> -          Jim****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Wednesday, August 29, 2012 11:44 AM
>> *To:* Deborah Dahl
>> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> How should interpretation work with continuous speech?****
>>
>>  ****
>>
>> Specifically, as each portion becomes final (each SpeechRecognitionResult
>> with final=true), the corresponding alternative(s) for transcription and
>> interpretation become final.****
>>
>>  ****
>>
>> It's easy for the JavaScript author to handle the consecutive list of
>> transcription strings - simply concatenate them.****
>>
>>  ****
>>
>> However, if the interpretation returns a semantic structure (such as the
>> depart/arrive example), it's unclear to me how they should be returned.
>>  For example, if the first final result was "from New York" and the second
>> "to San Francisco", then:****
>>
>>  ****
>>
>> After the first final result, the list is:****
>>
>>  ****
>>
>> event.results[0].item[0].transcription = "from New York"****
>>
>> event.results[0].item[0].interpretation = {****
>>
>>   depart: "New York",****
>>
>>   arrive: null****
>>
>> };****
>>
>>  ****
>>
>> After the second final result, the list is:****
>>
>>  ****
>>
>> event.results[0].item[0].transcription = "from New York"****
>>
>> event.results[0].item[0].interpretation = {****
>>
>>   depart: "New York",****
>>
>>   arrive: null****
>>
>> };****
>>
>>  ****
>>
>> event.results[1].item[0].transcription = "to San Francisco"****
>>
>> event.results[1].item[0].interpretation = {****
>>
>>   depart: null,****
>>
>>   arrive: "San Francisco"****
>>
>> };****
>>
>>  ****
>>
>> If so, this makes using the interpretation structure very messy for the
>> author because he needs to loop through all the results to find each
>> interpretation slot that he needs.****
>>
>>  ****
>>
>> I suggest that we instead consider changing the spec to provide a single
>> interpretation that always represents the most current interpretation.***
>> *
>>
>>  ****
>>
>> After the first final result, the list is:****
>>
>>  ****
>>
>> event.results[0].item[0].transcription = "from New York"****
>>
>> event.interpretation = {****
>>
>>   depart: "New York",****
>>
>>   arrive: null****
>>
>> };****
>>
>>  ****
>>
>> After the second final result, the list is:****
>>
>>  ****
>>
>> event.results[0].item[0].transcription = "from New York"****
>>
>> event.results[1].item[0].transcription = "to San Francisco"****
>>
>> event.interpretation = {****
>>
>>   depart: "New York",****
>>
>>   arrive: "San Francisco"****
>>
>> };****
>>
>>  ****
>>
>> This not only makes it simple for the author to process the
>> interpretation, it also solves the problem that the interpretation may not
>> be available at the same point in time that the transcription becomes
>> final.  If alternative interpretations are important, then it's easy to add
>> them to the interpretation structure that is returned, and this format far
>> easier for the author to process than
>> multiple SpeechRecognitionAlternative.interpretations.  For example:****
>>
>>  ****
>>
>> event.interpretation = {****
>>
>>   depart: ["New York", "Newark"],****
>>
>>   arrive: ["San Francisco", "San Bernardino"],****
>>
>> };****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>> On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> I don’t think there’s a big difference in complexity in this use case,
>> but here’s another one, that I think might be more common.****
>>
>> Suppose the application is something like search or composing email, and
>> the transcript alone would serve the application's purposes. However, some
>> implementations might also provide useful normalizations like converting
>> text numbers to digits or capitalization that would make the dictated text
>> look more like written language, and this normalization fills the
>> "interpretation slot". If the developer can count on the "interpretation"
>> slot being filled by the transcript if there's nothing better, then the
>> developer only has to ask for the interpretation. ****
>>
>> e.g. ****
>>
>> document.write(interpretation)****
>>
>>  ****
>>
>> vs. ****
>>
>> if(intepretation)****
>>
>>                 document.write(interpretation)****
>>
>> else****
>>
>>                 document.write(transcript)****
>>
>>  ****
>>
>> which I think is simpler. The developer doesn’t have to worry about type
>> checking because in this application the “interpretation” will always be a
>> string.****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Tuesday, August 28, 2012 10:44 PM
>> *To:* Deborah Dahl****
>>
>>
>> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> Debbie,****
>>
>> Looking at this from the viewpoint of what is easier for the JavaScript
>> author, I believe:****
>>
>>  ****
>>
>> SpeechRecognitionAlternative.transcript must return a string (even if an
>> empty string). Thus, an author wishing to use the transcript doesn't need
>> to perform any type checking.****
>>
>>  ****
>>
>> SpeechRecognitionAlternative.interpretation must be null if no
>> interpretation is provided.  This simplifies the required conditional by
>> eliminating type checking.  For example:****
>>
>>  ****
>>
>> transcript = "from New York to San Francisco";****
>>
>>  ****
>>
>> interpretation = {****
>>
>>   depart: "New York",****
>>
>>   arrive: "San Francisco"****
>>
>> };****
>>
>>  ****
>>
>> if (interpretation)  // this works if interpretation is present or if null
>> ****
>>
>>   document.write("Depart " + interpretation.depart + " and arrive in " +
>> interpretation.arrive);****
>>
>> else****
>>
>>   document.write(transcript);****
>>
>> fi****
>>
>>  ****
>>
>>  ****
>>
>> Whereas, if the interpretation contains the transcript string when no
>> interpretation is present, the condition would have to be:****
>>
>>  ****
>>
>> if (typeof(interpretation) != "string")****
>>
>>  ****
>>
>> Which is more complex, and more prone to errors (e.g. if spell "string"
>> wrong).****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>>  ****
>>
>> On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> Hi Glenn,****
>>
>> In the case of an SLM, if there’s a classification, I think the
>> classification would be the interpretation. If the SLM is just used to
>> improve dictation results, without classification, then the interpretation
>> would be whatever we say it is – either the transcript, null, or undefined.
>> ****
>>
>> My point about stating that the “transcript” attribute is required or
>> optional wasn’t whether or not there was a use case where it would be
>> desirable not to return a transcript. My point was that the spec needs to
>> be explicit about the optional/required status of every feature. It’s
>> fine to postpone that decision if there’s any controversy, but if we all
>> agree we might as well add it to the spec. ****
>>
>> I can’t think of any cases where it would be bad to return a transcript,
>> although I can think of use cases where the developer wouldn’t choose to do
>> anything with the transcript (like multi-slot form filling – all the end
>> user really needs to see is the correctly filled slots). ****
>>
>> Debbie****
>>
>>  ****
>>
>> *From:* Glen Shires [mailto:gshires@google.com]
>> *Sent:* Thursday, August 23, 2012 3:48 AM
>> *To:* Deborah Dahl
>> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org**
>> **
>>
>>
>> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
>> interpretation can't be provided****
>>
>>  ****
>>
>> Debbie,****
>>
>> I agree with the need to support SLMs. This implies that, in some cases,
>> the author may not specify semantic information, and thus there would not
>> be an interpretation.****
>>
>>  ****
>>
>> Under what circumstances (except error conditions) do you envision that a
>> transcript would not be returned?****
>>
>>  ****
>>
>> /Glen Shires****
>>
>>  ****
>>
>> On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl <
>> dahl@conversational-technologies.com> wrote:****
>>
>> Actually, Satish's comment made me think that we probably have a few other
>> things to agree on before we decide what the default value of
>> "interpretation" should be, because we haven't settled on a lot of issues
>> about what is required and what is optional.
>> Satish's argument is only relevant if we require SRGS/SISR for grammars
>> and
>> semantic interpretation, but we actually don't require either of those
>> right
>> now, so it doesn't matter what they do as far as the current spec goes.
>> (Although it's worth noting that  SRGS doesn't require anything to be
>> returned at all, even the transcript
>> http://www.w3.org/TR/speech-grammar/#S1.10).
>> So I think we first need to decide and explicitly state in the spec ---
>>
>> 1. what we want to say about grammar formats (which are allowed/required,
>> or
>> is the grammar format open). It probably needs to be somewhat open because
>> of SLM's.
>> 2. what we want to say about semantic tag formats (are proprietary formats
>> allowed, is SISR required or is the semantic tag format just whatever the
>> grammar format uses)
>> 3. is "transcript" required?
>> 4. is "interpretation" required?
>>
>> Debbie****
>>
>>
>> > -----Original Message-----
>> > From: Hans Wennborg [mailto:hwennborg@google.com]
>> > Sent: Tuesday, August 21, 2012 12:50 PM
>> > To: Glen Shires
>> > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org
>> > Subject: Re: SpeechRecognitionAlternative.interpretation when
>> > interpretation can't be provided
>> >
>> > Björn, Deborah, are you ok with this as well? I.e. that the spec
>> > shouldn't mandate a "default" value for the interpretation attribute,
>> > but rather return null when there is no interpretation?
>> >
>> > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com>
>> wrote:
>> > > I agree, return "null" (not "undefined") in such cases.
>> > >
>> > >
>> > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote:
>> > >>
>> > >> > I may have missed something, but I don’t see in the spec where it
>> says
>> > >> > that “interpretation” is optional.
>> > >>
>> > >> Developers specify the interpretation value with SISR and if they
>> don't
>> > >> specify there is no 'default' interpretation available. In that sense
>> it is
>> > >> optional because grammars don't mandate it. So I think this API
>> shouldn't
>> > >> mandate providing a default value if the engine did not provide one,
>> and
>> > >> return null in such cases.
>>
>>
>>
>> > >>
>> > >> Cheers
>> > >> Satish
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl
>> > >> <dahl@conversational-technologies.com> wrote:
>> > >>>
>> > >>> I may have missed something, but I don’t see in the spec where it
>> says
>> > >>> that “interpretation” is optional.
>> > >>>
>> > >>> From: Satish S [mailto:satish@google.com]
>> > >>> Sent: Thursday, August 16, 2012 7:38 PM
>> > >>> To: Deborah Dahl
>> > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org
>> > >>>
>> > >>>
>> > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when
>> > >>> interpretation can't be provided
>> > >>>
>> > >>>
>> > >>>
>> > >>> 'interpretation' is an optional attribute because engines are not
>> > >>> required to provide an interpretation on their own (unlike
>> 'transcript').
>> > As
>> > >>> such I think it should return null when there isn't a value to be
>> returned
>> > >>> as that is the convention for optional attributes, not 'undefined'
>> or
>> a
>> > copy
>> > >>> of some other attribute.
>> > >>>
>> > >>>
>> > >>>
>> > >>> If an engine chooses to return the same value for 'transcript' and
>> > >>> 'interpretation' or do textnorm of the value and return in
>> 'interpretation'
>> > >>> that will be an implementation detail of the engine. But in the
>> absence
>> > of
>> > >>> any such value for 'interpretation' from the engine I think the UA
>> should
>> > >>> return null.
>> > >>>
>> > >>>
>> > >>> Cheers
>> > >>> Satish
>> > >>>
>> > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl
>> > >>> <dahl@conversational-technologies.com> wrote:
>> > >>>
>> > >>> That's a good point. There are lots of use cases where some simple
>> > >>> normalization is extremely useful, as in your example, or collapsing
>> all
>> > the
>> > >>> ways that the user might say "yes" or "no". However, you could say
>> that
>> > once
>> > >>> the implementation has modified or normalized the transcript that
>> > means it
>> > >>> has some kind of interpretation, so putting a normalized value in
>> the
>> > >>> interpretation slot should be fine. Nothing says that the
>> "interpretation"
>> > >>> has to be a particularly fine-grained interpretation, or one with a
>> lot of
>> > >>> structure.
>> > >>>
>> > >>>
>> > >>>
>> > >>> > -----Original Message-----
>> > >>> > From: Bjorn Bringert [mailto:bringert@google.com]
>> > >>> > Sent: Thursday, August 16, 2012 9:09 AM
>> > >>> > To: Hans Wennborg
>> > >>> > Cc: Conversational; public-speech-api@w3.org
>> > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when
>> > >>> > interpretation can't be provided
>> > >>> >
>> > >>> > I'm not sure that it has to be that strict in requiring that the
>> value
>> > >>> > is the same as the "transcript" attribute. For example, an engine
>> > >>> > might return the words recognized in "transcript" and apply some
>> > extra
>> > >>> > textnorm to the text that it returns in "interpretation", e.g.
>> > >>> > converting digit words to digits ("three" -> "3"). Not sure if
>> that's
>> > >>> > useful though.
>> > >>> >
>> > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg
>> > >>> > <hwennborg@google.com> wrote:
>> > >>> > > Yes, the raw text is in the 'transcript' attribute.
>> > >>> > >
>> > >>> > > The description of 'interpretation' is currently: "The
>> interpretation
>> > >>> > > represents the semantic meaning from what the user said. This
>> > might
>> > >>> > > be
>> > >>> > > determined, for instance, through the SISR specification of
>> semantics
>> > >>> > > in a grammar."
>> > >>> > >
>> > >>> > > I propose that we change it to "The interpretation represents
>> the
>> > >>> > > semantic meaning from what the user said. This might be
>> > determined,
>> > >>> > > for instance, through the SISR specification of semantics in a
>> > >>> > > grammar. If no semantic meaning can be determined, the attribute
>> > must
>> > >>> > > be a string with the same value as the 'transcript' attribute."
>> > >>> > >
>> > >>> > > Does that sound good to everyone? If there are no objections,
>> I'll
>> > >>> > > make the change to the draft next week.
>> > >>> > >
>> > >>> > > Thanks,
>> > >>> > > Hans
>> > >>> > >
>> > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational
>> > >>> > > <dahl@conversational-technologies.com> wrote:
>> > >>> > >> I can't check the spec right now, but I assume there's already
>> an
>> > >>> > >> attribute
>> > >>> > that currently is defined to contain the raw text. So I think we
>> could
>> > >>> > say that
>> > >>> > if there's no interpretation the value of the interpretation
>> attribute
>> > >>> > would be
>> > >>> > the same as the value of the "raw string" attribute,
>> > >>> > >>
>> > >>> > >> Sent from my iPhone
>> > >>> > >>
>> > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg
>> > <hwennborg@google.com>
>> > >>> > wrote:
>> > >>> > >>
>> > >>> > >>> OK, that would work I suppose.
>> > >>> > >>>
>> > >>> > >>> What would the spec text look like? Something like "[...] If
>> no
>> > >>> > >>> semantic meaning can be determined, the attribute will a
>> string
>> > >>> > >>> representing the raw words that the user spoke."?
>> > >>> > >>>
>> > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert
>> > >>> > <bringert@google.com> wrote:
>> > >>> > >>>> Yeah, that would be my preference too.
>> > >>> > >>>>
>> > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational
>> > >>> > >>>> <dahl@conversational-technologies.com> wrote:
>> > >>> > >>>>> If there isn't an interpretation I think it would make the
>> most
>> > >>> > >>>>> sense
>> > >>> > for the attribute to contain the literal string result. I believe
>> this
>> > >>> > is what
>> > >>> > happens in VoiceXML.
>> > >>> > >>>>>
>> > >>> > >>>>>> My question is: for implementations that cannot provide an
>> > >>> > >>>>>> interpretation, what should the attribute's value be? null?
>> > >>> > undefined?
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > --
>> > >>> > Bjorn Bringert
>> > >>> > Google UK Limited, Registered Office: Belgrave House, 76
>> Buckingham
>> > >>> > Palace Road, London, SW1W 9TQ
>> > >>> > Registered in England Number: 3977902
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>>  ****
>>
>> ** **
>>
>
>
Received on Tuesday, 18 September 2012 05:32:45 UTC