- From: James Larson <jim@larson-tech.com>
- Date: Thu, 09 Sep 2010 12:59:52 -0700
- To: public-xg-htmlspeech@w3.org
- Message-ID: <4C893CB8.1060308@larson-tech.com>
I believe we should consider some use cases describing how speech might be used within HTML-5 applications. Below are brief sketches of four HTML use cases that use speech technology. I believe that developers will want to develop applications using these and similar use cases. I encourage others to respond to this e-mail with comments about these use cases, and to offer alternative or additional use cases that will help us to identify requirements. 1. Hello world a. Example: When a page is loaded, the speech synthesis renders the text "Hello World". b. Input to TTS i. Notification of triggering event /start/ ii. Name of the speech synthesis engine and associated parameters, e.g., Voice (male or female), Volume, Replay speed. Pronunciation lexicon (contains pronunciation for unusual words) iii. Text to be rendered c. Output from TTS i. Audio stream containing the rendered content for presentation to the user ii. Return code (e.g., successful, missing input parameter, missing text to be rendered, etc.) 2. Basic VCR-like text reader a. Example: Users may start/, pause, resume, rewind/ synthesized content. User may also /increase speed, decrease speed, increase volume, decrease volume/ b. Input to TTS i. User triggered events, e.g, /start, pause, resume, rewind, increase/decrease increase speed, decrease speed, increase volume, decrease volume/ ii. Name of speech synthesis engine and associated parameters e.g., e.g., Voice (male or female), Volume, Replay speed. Pronunciation lexicon (contains pronunciation for unusual words) iii. Text to be rendered c. Output from TTS i. Audio stream containing rendered context ii. Return codes, e.g., /started, paused, resumed, rewound, changed volume, changed speed, no-more-text-to-read/ 3. Free-form collector a. Example: User enters a page, reads a prompt stating "What is your name?", and then speaks his/her name. After speaking ,the user reads the recognized text which is displayed on the screen. b. Input to ASR: audio spoken by the user c. Output from kASR: text that the ASR recognized, /done/ return code d. Comment: this uses a free-form speech recognition engine so no grammar is required. 4. Grammar-based collector a. Example: User enters a page, reads a prompt stating "Where do you want to go?", and then speaks the destination "Austin". The speech recognition engine compares the spoken phrase with the words in the grammar and determines that the word might be either "Austin" or "Boston". The ASR returns an n-best list consisting of two words, Austin and Boston, and associated confidence scores. The collector displays a menu to the user asking "Did you say (1) Austin, or (2) Boston." The user selects one of the menu options. b. Input to ASR: audio spoken by the user, a grammar describing the words which the ASR will listen for c. Output from ASR: text that the ASR recognized, /done/ return code, n-best list of words and their confidence scores. 5. Dictation collector (press to speak) a. Example, the user dictates the contents of an e-mail message: "This meeting is going overtime. I will be late getting home. See you later." The user presses a button while speaking and releases the button when finished. b. Input to ASR: start-speaking event, stop speaking event, audio to be transcribed to text. c. Output from ASR: dictated text.
Received on Thursday, 9 September 2010 20:00:26 UTC