Continuous recognition API from Bjorn Bringert on 2011-05-23 (public-xg-htmlspeech@w3.org from May 2011)

From: Bjorn Bringert <bringert@google.com>
Date: Mon, 23 May 2011 15:28:51 +0100
To: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <BANLkTikLXE=zBpfDJSmFsY9Jvk0OWRfkGQ@mail.gmail.com>

This is a summary of the continuous recognition API proposed in the
face-to-face today. I'm sorry if it's not comprehensible for those not
attending the fast-to-face.

As already agreed, a one-shot recognition returns a single Result:

Result { EMMA; Alternative[] }
Alternative { utterance, confidence, interpretation }

Continuous recognition ('result' event), REQUIRED:

- In continuous recognition mode, audio is continuously captured and
passed to the speech recognition service.
- The speech recognition service divides the audio into chunks in some
way (e.g. at sentence boundaries).
- If an SRGS grammar is specified for the continuous recognition
request, each Result should correspond to a single utterance in the
grammar.
- For each chunk, the speech recognition service sends a 'result'
event containing a Result object.

Continuous recognition ('intermediate' event), OPTIONAL:

- The speech recognition service may return 'intermediate' events.
- An intermediate event contains a Result which represents the entire
audio from the last 'result' event.

Continuous recognition ('replace' event), OPTIONAL:

- Each 'result' event has an ID.
- The speech recognition service can send 'replace' events containing
{ ID of result to replace, new Result }.
- This must refer to a previous result event.
- It does not represent any new input.


An example using all three:

User says "my hovercraft is full of eels. they are tasty."

1. 'intermediate': "may"
2. 'intermediate': "my hovercraft"
3. 'intermediate': "my hovercraft is fool"
4. 'intermediate': "my hovercraft is full of eel"
5. 'result': ID=0, "my hovercraft is full of eel."
6. 'intermediate': "they"
7. 'intermediate': "they are"
8. 'intermediate': "they aren't tasty"
9. 'result': ID=1 "they are tasty."
10. 'replace': ID=0, "my hovercraft is full of eels."


It should be possible to change parameters and grammars during
continuous recognition. All 'result' events returned after a grammar
or parameter is changed must reflect that change. This means that the
speech recognition service may need to buffer audio since the last
'result' event to rerecognize it in case of a parameter or grammar
change.


-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Monday, 23 May 2011 14:29:16 UTC