Speech API breakdown of work items from Michael Bodell on 2011-06-15 (public-xg-htmlspeech@w3.org from June 2011)

From: Michael Bodell <mbodell@microsoft.com>
Date: Wed, 15 Jun 2011 23:31:30 +0000
To: "Bjorn Bringert (bringert@google.com)" <bringert@google.com>, "Dan Burnett (dburnett@voxeo.com)" <dburnett@voxeo.com>, "Deborah Dahl (dahl@conversational-technologies.com)" <dahl@conversational-technologies.com>, "Olli.Pettay@gmail.com" <Olli.Pettay@gmail.com>, Michael Bodell <mbodell@microsoft.com>, "Charles Hemphill" <charles@everspeech.com>, "dd5826@att.com" <dd5826@att.com>, "Raj (Openstream) (raj@openstream.com)" <raj@openstream.com>
CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <22CD592CCD76414085591204EB19F4E81D8BD7D9@TK5EX14MBXC263.redmond.corp.microsoft.>

So from our last call Debbie suggested we break down the API work items which would allow folks to volunteer for what parts they wanted to work on (we'll need everyone to do 2 of these each, on average). As we discussed in the call, some of this might need to wait for the requirement and design decision work that is ongoing, but much of it could be covered with the existing discussions that we've had on the calls, over email, and at the face-to-face. Here is my breakdown (not in any special order) of the basic things that we have to do, and the couple of people who have volunteered so far. Everything except the requirements and design decision needs both the IDL API outline *and* text describing the details of the semantics that are encapsulated. If the people on the to line could reply on which sections they want to take first, that would help (and anyone else in the group who wants to jump in would also be welcomed). Also, if anyone thinks of a major section I've omitted, please reply and add it.

1. Go through the design decisions and requirements and flag and organize the requirements and design decisions that relate to API. Similar to what Marc did for Protocol. Raj volunteered for this task on the last call.

2. The markup associated with the recognition element and any associated properties and api (I.e., the element, for label, implied behavior). This is still controversial, but should be orthogonal from the rest of API.

3. The API hooks relating to "setting up" or "preparing" or "checking the capabilities" of a request (both recognition and TTS). Dan Druita volunteered for this task on the last call.

4. The API hooks for specifying grammars and also other recognition properties (both what these properties are, and how to specify them). We covered some of this at the F2F.

5. The API hooks for getting speech results back (both the EMMA XML and text representation that was in a couple of proposals and Bjorn outlined, but also the continuous results that we talked about at the F2F - this also possibly covers feedback functionality).

6. The recognition events that are raised and the associated handlers and data (including any semantics about time stamps and other related information we covered at the F2F).

7. The API hooks related to the protocol for both speech and synthesis (both what speech service to use, and also anything else the protocol team identifies as a need). This might have to wait until the protocol is further along (and might also be something someone on the protocol team wants to take).

8. The API hooks related to hooking up with the capture system.

9. The API hooks associated with actually doing the recognition. (This may, or may not, be different than a combination of 3 and 4 above).

10. The API hooks related to actually doing a synthesis transaction.

11. The synthesis events that are raised and the associated handlers and data (same caveat about timing as with 6).

12. The API hooks for controlling synthesis, if any (pause, resume, play, etc.).

13. The API to do text based recognition. We covered this some at the F2F.

14. The API to do a combination of bargeable synthesis and recognition. This was a little controversial, but we discussed it at the F2F.

15. The API hooks to do continuous recognition (both open microphone as well as dictation). This was covered some at the F2F and may just be part of 3, 4, and 9 above.

Received on Wednesday, 15 June 2011 23:31:59 UTC