RE: Continuous recognition API from Young, Milan on 2011-05-23 (public-xg-htmlspeech@w3.org from May 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Mon, 23 May 2011 07:50:32 -0700
To: "Bjorn Bringert" <bringert@google.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0B4D2029@SUN-EXCH01.nuance.com>

You say below that the speech service "divides the audio chunks".  This
could be interpreted that the SS could apply a portion of a chunk in one
result, and the remainder applied to the next.  May want to discuss this
further once we better understand the streaming model.

I also thought we agreed the web-app could send continuous correction
feedback following the same model as feedback in the form-filling case.
The main difference to consider is that in the continuous case, feedback
could trigger the SS sending replacement results.

I suggest that we come up with a good use case before specing
intermediate results.  Even marking them as optional will consume time
to: 1) spec, 2) implement in UA, and 3) handle conformance.


-----Original Message-----
From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Bjorn Bringert
Sent: Monday, May 23, 2011 7:29 AM
To: public-xg-htmlspeech@w3.org
Subject: Continuous recognition API

This is a summary of the continuous recognition API proposed in the
face-to-face today. I'm sorry if it's not comprehensible for those not
attending the fast-to-face.

As already agreed, a one-shot recognition returns a single Result:

Result { EMMA; Alternative[] }
Alternative { utterance, confidence, interpretation }

Continuous recognition ('result' event), REQUIRED:

- In continuous recognition mode, audio is continuously captured and
passed to the speech recognition service.
- The speech recognition service divides the audio into chunks in some
way (e.g. at sentence boundaries).
- If an SRGS grammar is specified for the continuous recognition
request, each Result should correspond to a single utterance in the
grammar.
- For each chunk, the speech recognition service sends a 'result'
event containing a Result object.

Continuous recognition ('intermediate' event), OPTIONAL:

- The speech recognition service may return 'intermediate' events.
- An intermediate event contains a Result which represents the entire
audio from the last 'result' event.

Continuous recognition ('replace' event), OPTIONAL:

- Each 'result' event has an ID.
- The speech recognition service can send 'replace' events containing
{ ID of result to replace, new Result }.
- This must refer to a previous result event.
- It does not represent any new input.


An example using all three:

User says "my hovercraft is full of eels. they are tasty."

1. 'intermediate': "may"
2. 'intermediate': "my hovercraft"
3. 'intermediate': "my hovercraft is fool"
4. 'intermediate': "my hovercraft is full of eel"
5. 'result': ID=0, "my hovercraft is full of eel."
6. 'intermediate': "they"
7. 'intermediate': "they are"
8. 'intermediate': "they aren't tasty"
9. 'result': ID=1 "they are tasty."
10. 'replace': ID=0, "my hovercraft is full of eels."


It should be possible to change parameters and grammars during
continuous recognition. All 'result' events returned after a grammar
or parameter is changed must reflect that change. This means that the
speech recognition service may need to buffer audio since the last
'result' event to rerecognize it in case of a parameter or grammar
change.


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Monday, 23 May 2011 14:51:03 UTC