Re: Concatenating transcript results from Glen Shires on 2012-09-01 (public-speech-api@w3.org from September 2012)

From: Glen Shires <gshires@google.com>
Date: Sat, 1 Sep 2012 07:48:23 -0700
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: Satish S <satish@google.com>, "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CAEE5bcguZMf-bh1m3c7KuoWqoMBLGJUvyZRODkktBxOw_kafJg@mail.gmail.com>
Wonderful, it seems we're all in agreement. That a JavaScript author can
simply concatenate SpeechRecognitionResults to create a proper transcript.
That the author does NOT need to add additional whitespace. That this
simple concatenation works for all languages, including compound words (no
edge cases) and CJK. Also that this would continue to be the default
behavior if we do choose to add a flag for alternative behavior in the
future.

To make this more clear, I've slightly re-worded my proposed text for the
spec as follows. If there's no disagreement, I'll add this to the spec on
Tuesday.

"For continuous recognition, leading or trailing whitespace MUST be
included where necessary such that concatenation of consecutive
SpeechRecognitionResults produces a proper transcript of the session."

On Fri, Aug 31, 2012 at 5:39 PM, Young, Milan <Milan.Young@nuance.com>wrote:

>  I’m uncomfortable with language-specific behavior.  That might become a
> mess if one wanted to write a multi-lingual page.****
>
> ** **
>
> If we’re really intent upon doing this, Glen’s suggestion of a flag that
> defaults to true seems like the best route.  That way English and other
> language users can easily disable the behavior if they choose.****
>
> ** **
>
> Also, as Jerry pointed out in a fork of this thread, there are some
> English edge cases where engine-driven whitespace may make sense.****
>
> ** **
>
> Thanks****
>
> ** **
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Friday, August 31, 2012 10:23 AM
> *To:* Glen Shires
> *Cc:* Young, Milan; public-speech-api@w3.org
>
> *Subject:* Re: Concatenating transcript results****
>
>  ** **
>
> Glen and I talked about this later and I also looked at other speech
> recognition APIs where appending a white space in some form is the norm
> (either as flags or as a space character). Witih those in mind Glen's
> suggestion of making the flag default to true makes sense and in v1 we
> could leave out the flag. So I am ok with the original proposal of
> appending a white space in the transcript for languages where it is
> applicable and if we get developer feedback that a flag to turn it off is
> necessary it can be added in a future revision of the spec proposal.****
>
>
> Cheers
> Satish
>
> ****
>
> On Fri, Aug 31, 2012 at 11:01 AM, Satish S <satish@google.com> wrote:****
>
> Looking at it from another angle - if there was automatic binding to a
> HTML element and the spoken text was entered into the element in by the
> browser, then adding spaces automatically is the right thing to do. The
> equivalent for this is the keyboard IME on mobile phones where tapping on a
> word in the suggestion bar enters the word and a space with it. But the
> events that get dispatched to JS should not contain spaces appended or
> prepended.****
>
> ** **
>
> Some web apps may want to also offer correction of a word/phrase based on
> the list of hypotheses in the results, so when the user taps/clicks on the
> phrase it may offer a drop down list of suggestions. If we add spaces
> before or after the phrase then the UI would include those in the highlight
> instead of just the text, so developers may end up stripping off the space
> to show a better UI. This feels like working against the framework and
> something we should avoid.****
>
> ** **
>
> Perhaps we could look at it post v1 of the spec based on developer
> feedback?****
>
> ** **
>
> Cheers
> Satish****
>
>
>
> ****
>
> On Fri, Aug 31, 2012 at 1:51 AM, Glen Shires <gshires@google.com> wrote:**
> **
>
> If this is an optional flag that we add in the future, I strongly believe
> the default should be true.  (That is, until we add this feature, proper
> whitespace must be inserted by the speech recognizer.)****
>
> ** **
>
> If a user is searching for a consecutive key-words such as "peanut
> butter", there is no guarantee that the they will be returned in the same
> final result. For example:****
>
> ** **
>
> result[0].transcript = "I'd like a peanut"****
>
> result[1].transcript = "butter sandwich."****
>
> ** **
>
> While there's various algorithms that might be used to find the
> consecutive key-words, perhaps the easiest is to concatenate the results
> together with a space input between, and search for "peanut butter". So for
> this use case, it would be simpler if the speech-recognizer had returned
> the results with proper white-spacing.****
>
> ** **
>
> But frankly, I think the complexity of writing a JavaScript algorithm that
> knows how to insert proper whitespaces - and works on a wide variety of
> international languages, far outweighs any minor simplification of scanning
> for keywords by ignoring leading/trailing whitespaces. I believe there will
> be many applications that do use dictation to generate emails, documents,
> product reviews, etc. So I believe we must ensure that authoring a
> dictation app should not be more difficult than it needs to be.****
>
> ** **
>
> /Glen Shires****
>
> ** **
>
> ** **
>
> On Thu, Aug 30, 2012 at 4:04 PM, Satish S <satish@google.com> wrote:****
>
> Stripping whitespace is something that almost every app that doesn't use
> the API for dictation would need. To me this looks like an optional
> feature, something which gets turned on based on a flag such as
> "SpeechRecognition.autoWhiteSpace" that the developer would set if they
> want it.. and as such it could be added in a future revision of the API if
> we see developers asking for it.****
>
>
> Cheers
> Satish****
>
>
>
> ****
>
> On Thu, Aug 30, 2012 at 9:48 PM, Glen Shires <gshires@google.com> wrote:**
> **
>
> Inserting whitespace is non-trivial, particularly when considering
> punctuation and internationalization. Some punctuation is placed before the
> whitespace, others after. Some languages don't use whitespace. I'd prefer
> to avoid placing this burden on the JavaScript author.  Speech recognition
> engines already contain this logic.****
>
> ** **
>
> Conversely, stripping leading and trailing whitespace is trivial, as is
> writing a comparison routine that ignores whitespace.****
>
> ** **
>
> On Thu, Aug 30, 2012 at 1:35 PM, Young, Milan <Milan.Young@nuance.com>
> wrote:****
>
> I prefer Satish’s suggestion.  If the web author needs to concatenate,
> sandwiching in some whitespace seems like a trivial adjustment.****
>
>  ****
>
>  ****
>
> *From:* Satish S [mailto:satish@google.com]
> *Sent:* Thursday, August 30, 2012 1:28 PM
> *To:* Glen Shires
> *Cc:* public-speech-api@w3.org
> *Subject:* Re: Concatenating transcript results****
>
>  ****
>
> We could also say the transcript should not include leading or trailing
> spaces, so the web app should always use a whitespace if it needs to
> concatenate.  This would work better for apps that check the transcript
> with known words (e.g. command and control) instead of having to
> append/prepend whitespaces to their string literals. Also depending on the
> language of the recognized text whitespace may not be appropriate (e.g. CJK
> don't use white spaces).****
>
>
> Cheers
> Satish****
>
> On Thu, Aug 30, 2012 at 6:11 PM, Glen Shires <gshires@google.com> wrote:**
> **
>
> If there's no disagreement by the end of the week I'll add it to the
> spec...****
>
>  ****
>
> On Wed, Aug 29, 2012 at 9:36 AM, Glen Shires <gshires@google.com> wrote:**
> **
>
> I propose adding the following sentence to the definition
> of SpeechRecognitionAlternative.transcript to make it clear that a
> JavaScript author can simply concatenate SpeechRecognitionResults without
> the author having to worry about where/when to add whitespace.****
>
>  ****
>
> "For continuous recognition, whitespace MUST be included in the
> transcript, including leading or trailing whitespace, as necessary such
> that concatenation of consecutive SpeechRecognitionResults produces a
> proper transcript of the session."****
>
>  ****
>
>  ****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
Received on Saturday, 1 September 2012 14:49:33 UTC