Re: SpeechRecognitionAlternative.interpretation when interpretation can't be provided from Glen Shires on 2012-09-13 (public-speech-api@w3.org from September 2012)

From: Glen Shires <gshires@google.com>
Date: Thu, 13 Sep 2012 10:42:09 -0700
To: Deborah Dahl <dahl@conversational-technologies.com>
Cc: Jim Barnett <Jim.Barnett@genesyslab.com>, Hans Wennborg <hwennborg@google.com>, Satish S <satish@google.com>, Bjorn Bringert <bringert@google.com>, public-speech-api@w3.org
Message-ID: <CAEE5bciXvsQ88RLVa-vfYU+96MqNE7Z1sZrtCQOCGHK9RpKKQQ@mail.gmail.com>
Debbie,
Yes, I like the text you propose.  If there's no disagreement, I'll add it
to the spec on Monday.

On Thu, Sep 13, 2012 at 10:36 AM, Deborah Dahl <
dahl@conversational-technologies.com> wrote:

> Having a more specific error like “TAG_FORMAT_NOT_SUPPORTED” would be more
> informative, but I think using BAD_GRAMMAR is ok. If so, the text should
> probably say something like "There was an error in the speech recognition
> grammar or semantic tags, or the grammar format or tag format is
> unsupported." ****
>
> ** **
>
> ** **
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, September 13, 2012 12:40 PM
>
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
> ** **
>
> The spec already defines SpeechRecognitionError BAD_GRAMMAR.  I propose we
> use this same error for bad tag formats, since they're so related (and in
> fact there may be some edge-cases in which it's not clear whether the error
> is parsed as a grammar error or a semantic tag error.)****
>
> ** **
>
> The current definition in the spec for BAD_GRAMMAR is:****
>
> "There was an error in the speech recognition grammar."****
>
> ** **
>
> I propose changing this to:****
>
> "There was an error in the speech recognition grammar or semantic tags."**
> **
>
> ** **
>
> /Glen Shires****
>
>  ****
>
> ** **
>
> On Thu, Sep 13, 2012 at 8:53 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> The example of the author supplying semantics that the recognizer can’t
> interpret I think is Bjorn’s “open question B” in his email --****
>
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html***
> *
>
>  ****
>
> I proposed that this situation should raise an error in this email, but I
> don’t think there’s been any other discussion, so we should discuss this at
> some point.****
>
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0072.html***
> *
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, September 12, 2012 5:58 PM****
>
>
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> > ...any use cases where the interpretation is of interest to the
> developer and  it’s not known whether the interpretation is an object or a
> string.  What would be an example of that? ****
>
>  ****
>
> An example is: if the author supplies semantics that the recognizer can't
> interpret, then the recognizer might return a normalized result.****
>
>  ****
>
> > I also think that the third use case would be very rare, since it would
> involve asking the user to make a decision about whether they want a
> normalized or non-normalized version of the result, and it’s not clear when
> the user would actually be interested in making that kind of choice.****
>
>  ****
>
> If the user is shown alternatives, one option might be normalized. I
> provided an example of this, where the non-normalized might be preferred by
> the user.****
>
>  ****
>
>    transcript: "Like I've done one million times before."****
>
> normalized: "Like I've done 1,000,000 times before."****
>
>  ****
>
> I understand that this may be a rare use case, but regardless of that, I
> still don't know of any use case in which returning a copy of the
> transcript is preferable to null. ****
>
>  ****
>
> I'd prefer that we put the specific behavior in the spec, but if all we
> can agree on at this point is: “The group is currently discussing options
> for the value of the interpretation attribute when no interpretation has
> been returned by the recognizer. Current options are ‘null’ or a copy of
> the transcript.”, then I will agree to that.****
>
>  ****
>
> I too would like to hear others' opinions.****
>
> /Glen Shires****
>
> On Wed, Sep 12, 2012 at 2:16 PM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I’m not sure I can think of any use cases where the interpretation is of
> interest to the developer and  it’s not known whether the interpretation is
> an object or a string. What would be an example of that? I also think that
> the third use case would be very rare, since it would involve asking the
> user to make a decision about whether they want a normalized or
> non-normalized version of the result, and it’s not clear when the user
> would actually be interested in making that kind of choice.****
>
> I think it would be good at this point to get some other opinions about
> this. ****
>
> Also, in the interest of moving forward, I think it’s perfectly fine to
> have language in the spec that just says “The group is currently discussing
> options for the value of the interpretation attribute when no
> interpretation has been returned by the recognizer. Current options are
> ‘null’ or a copy of the transcript.” This may also serve to encourage
> external comments from developers who have an opinion about this. ****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, September 12, 2012 4:21 PM****
>
>
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I disagree with the code [1] for this use case. Since the interpretation
> may be a non-string object, good defensive coding practice is:****
>
>  ****
>
> if (typeof(interpretation) == "string") {****
>
>   document.write(interpretation)****
>
> } else {****
>
>   document.write(transcript);****
>
> }****
>
>  ****
>
> Thus, for this use case it doesn't matter. The code is identical for
> either definition of what the interpretation attributes returns when there
> is no interpretation. (That is, whether interpretation is defined to return
> null or to returns a copy of transcript.)****
>
>  ****
>
> In contrast, [2] shows a use case where it does matter, the code is
> simpler and less error-prone if the interpretation attributes returns null
> when there is no interpretation.****
>
>  ****
>
> Below a third use case where it also matters. Since interpretation may
> return a normalized string, an author may wish to show both the normalized
> string and the transcript string to the user, and let them choose which one
> to use.  For example:****
>
>  ****
>
>    interpretation: "Like I've done 1,000,000 times before."****
>
>    transcript: "Like I've done one million times before."****
>
>  ****
>
> (The author might also add transcript alternatives to this choice list,
> but I'll omit that to keep the example simple.)****
>
>  ****
>
> For the option where interpretation returns a copy of transcript when
> there is no interpretation:****
>
>  ****
>
> var choices;****
>
> if (typeof(interpretation) == "string" && interpretation != transcript) {*
> ***
>
>   choices.push(interpretation);****
>
> }****
>
> choices.push(transcript);****
>
> if (choices.length > 1) {****
>
>   AskUserToDisambiguate(choices);****
>
> }****
>
>  ****
>
>  ****
>
> For the option where interpretation returns a null when there is no
> interpretation:****
>
>  ****
>
> var choices;****
>
> if (typeof(interpretation) == "string") {****
>
>   choices.push(interpretation);****
>
> }****
>
> choices.push(transcript);****
>
> if (choices.length > 1) {****
>
>   AskUserToDisambiguate(choices);****
>
> }****
>
>  ****
>
>  ****
>
> So there's clearly use cases in which returning null allows for simpler
> and less error-prone code, whereas it's not clear to me there is any use
> case in which returning a copy of the transcript simplifies the code.
> Together, these use cases cover all the scenarios:****
>
>  ****
>
> - where there is an interpretation that contains a complex object****
>
> - where there is an interpretation that contains a string, and****
>
> - where there is no interpretation.****
>
>  ****
>
> So, I continue to propose adding one additional sentence.****
>
>  ****
>
>     "If no interpretation is available, this attribute MUST return null."*
> ***
>
>  ****
>
> If there's no disagreement, I will add this sentence to the spec on Friday.
> ****
>
>  ****
>
> (Please note, this is very different than the reasoning behind requiring
> the emma attribute to never be null. "emma" is always of type "Document"
> and always returns a valid emma document, not simply a copy of some other
> attribute.  Here, "interpretation" is an attribute of type "any", so it
> must always be type-checked.)****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> [1]
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.html***
> *
>
> [2]
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0107.html***
> *
>
>  ****
>
> On Wed, Sep 12, 2012 at 6:44 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I would still prefer that the interpretation slot always be filled, at
> least by the transcript if there’s nothing better. I think that the use
> case I described in ****
>
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0108.htmlis going to be pretty common and in that case being able to rely on
> something other than null being in the interpretation field is very
> convenient. On the other hand, if the application really depends on the
> availability of a more complex interpretation object, the developer is
> going to have to make sure that a specific speech service that can provide
> that kind of interpretation is used. In that case, I don’t see how there
> can be a transcript without an interpretation. ****
>
> On a related topic, I think we should also include some of the points that
> Bjorn made about support for grammars and semantic tagging as discussed in
> this thread --
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0071.html.**
> **
>
>  ****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Tuesday, September 11, 2012 8:33 PM
> *To:* Deborah Dahl; Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org****
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> The current definition of interpretation in the spec is:****
>
>  ****
>
>     "The interpretation represents the semantic meaning from what the user
> said. This might be determined, for instance, through the SISR
> specification of semantics in a grammar."****
>
>  ****
>
> I propose adding an additional sentence at the end.****
>
>  ****
>
>     "If no interpretation is available, this attribute MUST return null."*
> ***
>
>  ****
>
> My reasoning (based on this lengthy thread):****
>
>    - If an SISR / etc interpretation is available, the UA must return it.*
>    ***
>    - If an alternative string interpretation is available, such as
>    a normalization, the UA may return it.****
>    - If there's no more information available than in the transcript,
>    then "null" provides a very simple way for the author to check for this
>    condition. The author avoids a clumsy conditional (typeof(interpretation)
>    != "string") and the author can easily distinguish between the case when
>    the interpretation returns a normalization string as opposed to if it had
>    just copied the transcript verbatim.****
>    - "null" is more commonly used than "undefined" in these circumstances.
>    ****
>
> If there's no disagreement, I will add this sentence to the spec on
> Thursday.****
>
> /Glen Shires****
>
>  ****
>
>  ****
>
> On Tue, Sep 4, 2012 at 11:04 AM, Glen Shires <gshires@google.com> wrote:**
> **
>
> I've updated the spec with this change (moved interpretation and emma
> attributes to SpeechRecognitionEvent):****
>
> https://dvcs.w3.org/hg/speech-api/rev/48a58e558fcc****
>
>  ****
>
> As always, the current draft spec is at:****
>
> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Thu, Aug 30, 2012 at 10:07 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Thanks for the clarification, that makes sense.  When each new version of
> the emma document arrives in a  SpeechRecognitionEvent, the author can just
> repopulate all the  earlier form fields, as well as the newest one, with
> the data from the most recent emma version. ****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, August 30, 2012 12:45 PM****
>
>
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> In my proposal, the single emma document is updated with each
> new SpeechRecognitionEvent. Therefore, in continuous = true mode, the emma
> document is populated in "real time" as the user speaks each field, without
> waiting for the user to finish speaking. A JavaScript author could use this
> to populate a form in "real time".****
>
>  ****
>
>  ****
>
> Also, I now realize that the SpeechRecognitionEvent.transcript is not
> useful in continuous = false mode because only one final result is
> returned, and thus SpeechRecognitionEvent.results[0].transcript always
> contains the same string (no concatenation needed).  I also don't see it as
> very useful in continuous = true mode because if an author is using this
> mode, it's presumably because he wants to show continuous final results
> (and perhaps interim as well). Since the author is already writing code to
> concatenate results to display them "real-time", there's little or no
> savings with this new attribute.  So I now retract that portion of my
> proposal.****
>
>  ****
>
> So to clarify, here's my proposed changes to the spec. If there's no
> disagreement by the end of the week I'll add it to the spec...****
>
>  ****
>
>  ****
>
> Delete SpeechRecognitionAlternative.interpretation****
>
>  ****
>
> Delete SpeechRecognitionResult.emma****
>
>  ****
>
> Add interpretation and emma attributes to SpeechRecognitionEvent.
>  Specifically:****
>
>  ****
>
>     interface SpeechRecognitionEvent : Event {****
>
>         readonly attribute short resultIndex;****
>
>         readonly attribute SpeechRecognitionResultList results;****
>
>         readonly attribute any interpretation;****
>
>         readonly attribute Document emma;****
>
>     };****
>
>  ****
>
> I do not propose to change the definitions of interpretation and emma at
> this time (because there is on-going discussion), but rather to simply move
> their current definitions to the new heading: "5.1.8 Speech Recognition
> Event".****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
>  ****
>
> On Thu, Aug 30, 2012 at 8:36 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Hi Glenn,****
>
> I agree that a single cumulative emma document is preferable to multiple
> emma documents in general, although I think that there might be use cases
> where it would be convenient to have both.  For example, you want to
> populate a form in real time as the user speaks each field, without waiting
> for the user to finish speaking. After the result is final the application
> could send the cumulative result to the server, but seeing the interim
> results would be helpful feedback to the user.****
>
> Debbie****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 2:57 PM
> *To:* Deborah Dahl
> *Cc:* Jim Barnett; Hans Wennborg; Satish S; Bjorn Bringert;
> public-speech-api@w3.org****
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I believe the same is true for emma, a single, cumulative emma document is
> preferable to multiple emma documents. ****
>
>  ****
>
> I propose the following changes to the spec:****
>
>  ****
>
> Delete SpeechRecognitionAlternative.interpretation****
>
>  ****
>
> Delete SpeechRecognitionResult.emma****
>
>  ****
>
> Add interpretation and emma attributes to SpeechRecognitionEvent.
>  Specifically:****
>
>  ****
>
>     interface SpeechRecognitionEvent : Event {****
>
>         readonly attribute short resultIndex;****
>
>         readonly attribute SpeechRecognitionResultList results;****
>
>         readonly attribute DOMString transcript;****
>
>         readonly attribute any interpretation;****
>
>         readonly attribute Document emma;****
>
>     };****
>
>  ****
>
> I do not propose to change the definitions of interpretation and emma at
> this time (because there is on-going discussion), but rather to simply move
> their current definitions to the new heading: "5.1.8 Speech Recognition
> Event".****
>
>  ****
>
> I also propose adding transcript attribute to SpeechRecognitionEvent (but
> also retaining SpeechRecognitionAlternative.transcript). This provides a
> simple option for JavaScript authors to get at the full, cumulative
> transcript.  I propose the definition under "5.1.8 Speech Recognition
> Event" be:****
>
>  ****
>
> transcript****
>
> The transcript string represents the raw words that the user spoke. This
> is a concatenation of the first (highest confidence) alternative of all
> final SpeechRecognitionAlternative.transcript strings.****
>
>  ****
>
> /Glen Shires ****
>
>  ****
>
>  ****
>
> On Wed, Aug 29, 2012 at 10:30 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I agree with having a single interpretation that represents the cumulative
> interpretation of the utterance so far. ****
>
> I think an example of what Jim is talking about, when the interpretation
> wouldn’t be final even if the transcript is, might be the utterance “from
> Chicago … Midway”. Maybe the grammar has a default of “Chicago O’Hare”, and
> returns “from: ORD”, because most people don’t bother to say “O’Hare”, but
> then it hears “Midway” and changes the interpretation to “from: MDW”.
>  However, “from Chicago” is still the transcript. ****
>
> Also the problem that Glenn points out is bad enough with two slots, but
> it gets even worse as the number of slots gets bigger. For example, you
> might have a pizza-ordering utterance with five or six ingredients (“I want
> a large pizza with mushrooms…pepperoni…onions…olives…anchovies”). It would
> be very cumbersome to have to go back through all the results to fill in
> the slots separately.****
>
>  ****
>
> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com]
> *Sent:* Wednesday, August 29, 2012 12:37 PM
> *To:* Glen Shires; Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
> *Subject:* RE: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> I agree with the idea of having a single interpretation.  There is no
> guarantee that the different parts of the string have independent
> interpretations.  For example, even if the transcription “from New York” is
> final,  its interpretation may not  be, since it may depend on the
> remaining parts of the utterance (that depends on how complicated the
> grammar is, of course.)  ****
>
>  ****
>
> -          Jim****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Wednesday, August 29, 2012 11:44 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> How should interpretation work with continuous speech?****
>
>  ****
>
> Specifically, as each portion becomes final (each SpeechRecognitionResult
> with final=true), the corresponding alternative(s) for transcription and
> interpretation become final.****
>
>  ****
>
> It's easy for the JavaScript author to handle the consecutive list of
> transcription strings - simply concatenate them.****
>
>  ****
>
> However, if the interpretation returns a semantic structure (such as the
> depart/arrive example), it's unclear to me how they should be returned.
>  For example, if the first final result was "from New York" and the second
> "to San Francisco", then:****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[0].item[0].interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.results[1].item[0].interpretation = {****
>
>   depart: null,****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> If so, this makes using the interpretation structure very messy for the
> author because he needs to loop through all the results to find each
> interpretation slot that he needs.****
>
>  ****
>
> I suggest that we instead consider changing the spec to provide a single
> interpretation that always represents the most current interpretation.****
>
>  ****
>
> After the first final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: null****
>
> };****
>
>  ****
>
> After the second final result, the list is:****
>
>  ****
>
> event.results[0].item[0].transcription = "from New York"****
>
> event.results[1].item[0].transcription = "to San Francisco"****
>
> event.interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> This not only makes it simple for the author to process the
> interpretation, it also solves the problem that the interpretation may not
> be available at the same point in time that the transcription becomes
> final.  If alternative interpretations are important, then it's easy to add
> them to the interpretation structure that is returned, and this format far
> easier for the author to process than
> multiple SpeechRecognitionAlternative.interpretations.  For example:****
>
>  ****
>
> event.interpretation = {****
>
>   depart: ["New York", "Newark"],****
>
>   arrive: ["San Francisco", "San Bernardino"],****
>
> };****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 29, 2012 at 7:07 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> I don’t think there’s a big difference in complexity in this use case, but
> here’s another one, that I think might be more common.****
>
> Suppose the application is something like search or composing email, and
> the transcript alone would serve the application's purposes. However, some
> implementations might also provide useful normalizations like converting
> text numbers to digits or capitalization that would make the dictated text
> look more like written language, and this normalization fills the
> "interpretation slot". If the developer can count on the "interpretation"
> slot being filled by the transcript if there's nothing better, then the
> developer only has to ask for the interpretation. ****
>
> e.g. ****
>
> document.write(interpretation)****
>
>  ****
>
> vs. ****
>
> if(intepretation)****
>
>                 document.write(interpretation)****
>
> else****
>
>                 document.write(transcript)****
>
>  ****
>
> which I think is simpler. The developer doesn’t have to worry about type
> checking because in this application the “interpretation” will always be a
> string.****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Tuesday, August 28, 2012 10:44 PM
> *To:* Deborah Dahl****
>
>
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> Looking at this from the viewpoint of what is easier for the JavaScript
> author, I believe:****
>
>  ****
>
> SpeechRecognitionAlternative.transcript must return a string (even if an
> empty string). Thus, an author wishing to use the transcript doesn't need
> to perform any type checking.****
>
>  ****
>
> SpeechRecognitionAlternative.interpretation must be null if no
> interpretation is provided.  This simplifies the required conditional by
> eliminating type checking.  For example:****
>
>  ****
>
> transcript = "from New York to San Francisco";****
>
>  ****
>
> interpretation = {****
>
>   depart: "New York",****
>
>   arrive: "San Francisco"****
>
> };****
>
>  ****
>
> if (interpretation)  // this works if interpretation is present or if null
> ****
>
>   document.write("Depart " + interpretation.depart + " and arrive in " +
> interpretation.arrive);****
>
> else****
>
>   document.write(transcript);****
>
> fi****
>
>  ****
>
>  ****
>
> Whereas, if the interpretation contains the transcript string when no
> interpretation is present, the condition would have to be:****
>
>  ****
>
> if (typeof(interpretation) != "string")****
>
>  ****
>
> Which is more complex, and more prone to errors (e.g. if spell "string"
> wrong).****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
>  ****
>
> On Thu, Aug 23, 2012 at 6:37 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Hi Glenn,****
>
> In the case of an SLM, if there’s a classification, I think the
> classification would be the interpretation. If the SLM is just used to
> improve dictation results, without classification, then the interpretation
> would be whatever we say it is – either the transcript, null, or undefined.
> ****
>
> My point about stating that the “transcript” attribute is required or
> optional wasn’t whether or not there was a use case where it would be
> desirable not to return a transcript. My point was that the spec needs to
> be explicit about the optional/required status of every feature. It’s
> fine to postpone that decision if there’s any controversy, but if we all
> agree we might as well add it to the spec. ****
>
> I can’t think of any cases where it would be bad to return a transcript,
> although I can think of use cases where the developer wouldn’t choose to do
> anything with the transcript (like multi-slot form filling – all the end
> user really needs to see is the correctly filled slots). ****
>
> Debbie****
>
>  ****
>
> *From:* Glen Shires [mailto:gshires@google.com]
> *Sent:* Thursday, August 23, 2012 3:48 AM
> *To:* Deborah Dahl
> *Cc:* Hans Wennborg; Satish S; Bjorn Bringert; public-speech-api@w3.org***
> *
>
>
> *Subject:* Re: SpeechRecognitionAlternative.interpretation when
> interpretation can't be provided****
>
>  ****
>
> Debbie,****
>
> I agree with the need to support SLMs. This implies that, in some cases,
> the author may not specify semantic information, and thus there would not
> be an interpretation.****
>
>  ****
>
> Under what circumstances (except error conditions) do you envision that a
> transcript would not be returned?****
>
>  ****
>
> /Glen Shires****
>
>  ****
>
> On Wed, Aug 22, 2012 at 6:08 AM, Deborah Dahl <
> dahl@conversational-technologies.com> wrote:****
>
> Actually, Satish's comment made me think that we probably have a few other
> things to agree on before we decide what the default value of
> "interpretation" should be, because we haven't settled on a lot of issues
> about what is required and what is optional.
> Satish's argument is only relevant if we require SRGS/SISR for grammars and
> semantic interpretation, but we actually don't require either of those
> right
> now, so it doesn't matter what they do as far as the current spec goes.
> (Although it's worth noting that  SRGS doesn't require anything to be
> returned at all, even the transcript
> http://www.w3.org/TR/speech-grammar/#S1.10).
> So I think we first need to decide and explicitly state in the spec ---
>
> 1. what we want to say about grammar formats (which are allowed/required,
> or
> is the grammar format open). It probably needs to be somewhat open because
> of SLM's.
> 2. what we want to say about semantic tag formats (are proprietary formats
> allowed, is SISR required or is the semantic tag format just whatever the
> grammar format uses)
> 3. is "transcript" required?
> 4. is "interpretation" required?
>
> Debbie****
>
>
> > -----Original Message-----
> > From: Hans Wennborg [mailto:hwennborg@google.com]
> > Sent: Tuesday, August 21, 2012 12:50 PM
> > To: Glen Shires
> > Cc: Satish S; Deborah Dahl; Bjorn Bringert; public-speech-api@w3.org
> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > interpretation can't be provided
> >
> > Björn, Deborah, are you ok with this as well? I.e. that the spec
> > shouldn't mandate a "default" value for the interpretation attribute,
> > but rather return null when there is no interpretation?
> >
> > On Fri, Aug 17, 2012 at 6:32 PM, Glen Shires <gshires@google.com> wrote:
> > > I agree, return "null" (not "undefined") in such cases.
> > >
> > >
> > > On Fri, Aug 17, 2012 at 7:41 AM, Satish S <satish@google.com> wrote:
> > >>
> > >> > I may have missed something, but I don’t see in the spec where it
> says
> > >> > that “interpretation” is optional.
> > >>
> > >> Developers specify the interpretation value with SISR and if they
> don't
> > >> specify there is no 'default' interpretation available. In that sense
> it is
> > >> optional because grammars don't mandate it. So I think this API
> shouldn't
> > >> mandate providing a default value if the engine did not provide one,
> and
> > >> return null in such cases.
>
>
>
> > >>
> > >> Cheers
> > >> Satish
> > >>
> > >>
> > >>
> > >> On Fri, Aug 17, 2012 at 1:57 PM, Deborah Dahl
> > >> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> I may have missed something, but I don’t see in the spec where it
> says
> > >>> that “interpretation” is optional.
> > >>>
> > >>> From: Satish S [mailto:satish@google.com]
> > >>> Sent: Thursday, August 16, 2012 7:38 PM
> > >>> To: Deborah Dahl
> > >>> Cc: Bjorn Bringert; Hans Wennborg; public-speech-api@w3.org
> > >>>
> > >>>
> > >>> Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> interpretation can't be provided
> > >>>
> > >>>
> > >>>
> > >>> 'interpretation' is an optional attribute because engines are not
> > >>> required to provide an interpretation on their own (unlike
> 'transcript').
> > As
> > >>> such I think it should return null when there isn't a value to be
> returned
> > >>> as that is the convention for optional attributes, not 'undefined' or
> a
> > copy
> > >>> of some other attribute.
> > >>>
> > >>>
> > >>>
> > >>> If an engine chooses to return the same value for 'transcript' and
> > >>> 'interpretation' or do textnorm of the value and return in
> 'interpretation'
> > >>> that will be an implementation detail of the engine. But in the
> absence
> > of
> > >>> any such value for 'interpretation' from the engine I think the UA
> should
> > >>> return null.
> > >>>
> > >>>
> > >>> Cheers
> > >>> Satish
> > >>>
> > >>> On Thu, Aug 16, 2012 at 2:52 PM, Deborah Dahl
> > >>> <dahl@conversational-technologies.com> wrote:
> > >>>
> > >>> That's a good point. There are lots of use cases where some simple
> > >>> normalization is extremely useful, as in your example, or collapsing
> all
> > the
> > >>> ways that the user might say "yes" or "no". However, you could say
> that
> > once
> > >>> the implementation has modified or normalized the transcript that
> > means it
> > >>> has some kind of interpretation, so putting a normalized value in the
> > >>> interpretation slot should be fine. Nothing says that the
> "interpretation"
> > >>> has to be a particularly fine-grained interpretation, or one with a
> lot of
> > >>> structure.
> > >>>
> > >>>
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: Bjorn Bringert [mailto:bringert@google.com]
> > >>> > Sent: Thursday, August 16, 2012 9:09 AM
> > >>> > To: Hans Wennborg
> > >>> > Cc: Conversational; public-speech-api@w3.org
> > >>> > Subject: Re: SpeechRecognitionAlternative.interpretation when
> > >>> > interpretation can't be provided
> > >>> >
> > >>> > I'm not sure that it has to be that strict in requiring that the
> value
> > >>> > is the same as the "transcript" attribute. For example, an engine
> > >>> > might return the words recognized in "transcript" and apply some
> > extra
> > >>> > textnorm to the text that it returns in "interpretation", e.g.
> > >>> > converting digit words to digits ("three" -> "3"). Not sure if
> that's
> > >>> > useful though.
> > >>> >
> > >>> > On Thu, Aug 16, 2012 at 1:58 PM, Hans Wennborg
> > >>> > <hwennborg@google.com> wrote:
> > >>> > > Yes, the raw text is in the 'transcript' attribute.
> > >>> > >
> > >>> > > The description of 'interpretation' is currently: "The
> interpretation
> > >>> > > represents the semantic meaning from what the user said. This
> > might
> > >>> > > be
> > >>> > > determined, for instance, through the SISR specification of
> semantics
> > >>> > > in a grammar."
> > >>> > >
> > >>> > > I propose that we change it to "The interpretation represents the
> > >>> > > semantic meaning from what the user said. This might be
> > determined,
> > >>> > > for instance, through the SISR specification of semantics in a
> > >>> > > grammar. If no semantic meaning can be determined, the attribute
> > must
> > >>> > > be a string with the same value as the 'transcript' attribute."
> > >>> > >
> > >>> > > Does that sound good to everyone? If there are no objections,
> I'll
> > >>> > > make the change to the draft next week.
> > >>> > >
> > >>> > > Thanks,
> > >>> > > Hans
> > >>> > >
> > >>> > > On Wed, Aug 15, 2012 at 5:29 PM, Conversational
> > >>> > > <dahl@conversational-technologies.com> wrote:
> > >>> > >> I can't check the spec right now, but I assume there's already
> an
> > >>> > >> attribute
> > >>> > that currently is defined to contain the raw text. So I think we
> could
> > >>> > say that
> > >>> > if there's no interpretation the value of the interpretation
> attribute
> > >>> > would be
> > >>> > the same as the value of the "raw string" attribute,
> > >>> > >>
> > >>> > >> Sent from my iPhone
> > >>> > >>
> > >>> > >> On Aug 15, 2012, at 9:57 AM, Hans Wennborg
> > <hwennborg@google.com>
> > >>> > wrote:
> > >>> > >>
> > >>> > >>> OK, that would work I suppose.
> > >>> > >>>
> > >>> > >>> What would the spec text look like? Something like "[...] If no
> > >>> > >>> semantic meaning can be determined, the attribute will a string
> > >>> > >>> representing the raw words that the user spoke."?
> > >>> > >>>
> > >>> > >>> On Wed, Aug 15, 2012 at 2:24 PM, Bjorn Bringert
> > >>> > <bringert@google.com> wrote:
> > >>> > >>>> Yeah, that would be my preference too.
> > >>> > >>>>
> > >>> > >>>> On Wed, Aug 15, 2012 at 2:19 PM, Conversational
> > >>> > >>>> <dahl@conversational-technologies.com> wrote:
> > >>> > >>>>> If there isn't an interpretation I think it would make the
> most
> > >>> > >>>>> sense
> > >>> > for the attribute to contain the literal string result. I believe
> this
> > >>> > is what
> > >>> > happens in VoiceXML.
> > >>> > >>>>>
> > >>> > >>>>>> My question is: for implementations that cannot provide an
> > >>> > >>>>>> interpretation, what should the attribute's value be? null?
> > >>> > undefined?
> > >>> >
> > >>> >
> > >>> >
> > >>> > --
> > >>> > Bjorn Bringert
> > >>> > Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> > >>> > Palace Road, London, SW1W 9TQ
> > >>> > Registered in England Number: 3977902
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>
Received on Thursday, 13 September 2012 17:43:23 UTC