Re: Continuous recognition API from Bjorn Bringert on 2011-05-23 (public-xg-htmlspeech@w3.org from May 2011)

From: Bjorn Bringert <bringert@google.com>
Date: Mon, 23 May 2011 16:05:59 +0100
To: "Young, Milan" <Milan.Young@nuance.com>
Cc: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <BANLkTimtB2N8WrA83tJ9MQUV1pdSc-j52g@mail.gmail.com>
On Mon, May 23, 2011 at 3:50 PM, Young, Milan <Milan.Young@nuance.com> wrote:
> You say below that the speech service "divides the audio chunks".  This
> could be interpreted that the SS could apply a portion of a chunk in one
> result, and the remainder applied to the next.  May want to discuss this
> further once we better understand the streaming model.

I meant that each Result comes from a single audio chunk. But that's
really a service implementation detail I guess.

> I also thought we agreed the web-app could send continuous correction
> feedback following the same model as feedback in the form-filling case.
> The main difference to consider is that in the continuous case, feedback
> could trigger the SS sending replacement results.

Yeah, I forgot to write about that. The Result object could contain a
feedback method for telling the speech recognition service when the
user corrects a result. The SS could then send a replace event for
some other Result if it wants.

> I suggest that we come up with a good use case before specing
> intermediate results.  Even marking them as optional will consume time
> to: 1) spec, 2) implement in UA, and 3) handle conformance.

Yeah, I'm fine with omitting intermediate events for now. I just
wanted to capture how we could add them if we want.

> -----Original Message-----
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Bjorn Bringert
> Sent: Monday, May 23, 2011 7:29 AM
> To: public-xg-htmlspeech@w3.org
> Subject: Continuous recognition API
>
> This is a summary of the continuous recognition API proposed in the
> face-to-face today. I'm sorry if it's not comprehensible for those not
> attending the fast-to-face.
>
> As already agreed, a one-shot recognition returns a single Result:
>
> Result { EMMA; Alternative[] }
> Alternative { utterance, confidence, interpretation }
>
> Continuous recognition ('result' event), REQUIRED:
>
> - In continuous recognition mode, audio is continuously captured and
> passed to the speech recognition service.
> - The speech recognition service divides the audio into chunks in some
> way (e.g. at sentence boundaries).
> - If an SRGS grammar is specified for the continuous recognition
> request, each Result should correspond to a single utterance in the
> grammar.
> - For each chunk, the speech recognition service sends a 'result'
> event containing a Result object.
>
> Continuous recognition ('intermediate' event), OPTIONAL:
>
> - The speech recognition service may return 'intermediate' events.
> - An intermediate event contains a Result which represents the entire
> audio from the last 'result' event.
>
> Continuous recognition ('replace' event), OPTIONAL:
>
> - Each 'result' event has an ID.
> - The speech recognition service can send 'replace' events containing
> { ID of result to replace, new Result }.
> - This must refer to a previous result event.
> - It does not represent any new input.
>
>
> An example using all three:
>
> User says "my hovercraft is full of eels. they are tasty."
>
> 1. 'intermediate': "may"
> 2. 'intermediate': "my hovercraft"
> 3. 'intermediate': "my hovercraft is fool"
> 4. 'intermediate': "my hovercraft is full of eel"
> 5. 'result': ID=0, "my hovercraft is full of eel."
> 6. 'intermediate': "they"
> 7. 'intermediate': "they are"
> 8. 'intermediate': "they aren't tasty"
> 9. 'result': ID=1 "they are tasty."
> 10. 'replace': ID=0, "my hovercraft is full of eels."
>
>
> It should be possible to change parameters and grammars during
> continuous recognition. All 'result' events returned after a grammar
> or parameter is changed must reflect that change. This means that the
> speech recognition service may need to buffer audio since the last
> 'result' event to rerecognize it in case of a parameter or grammar
> change.
>
>
> --
> Bjorn Bringert
> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
> Palace Road, London, SW1W 9TQ
> Registered in England Number: 3977902
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Monday, 23 May 2011 15:06:27 UTC