Use cases

I believe we should consider some use cases describing how speech might 
be used within HTML-5 applications.  Below are brief sketches of four 
HTML use cases that use speech technology.  I believe that developers 
will want to develop applications using these and similar use cases.

I encourage others to respond to this e-mail with comments about these 
use cases, and to offer alternative or additional use cases that will 
help us to identify requirements.

1. Hello world
a. Example: When a page is loaded, the speech synthesis renders the text 
"Hello World".
b. Input to TTS
i. Notification of triggering event /start/
ii. Name of the speech synthesis engine and associated parameters, e.g., 
Voice (male or female), Volume, Replay speed. Pronunciation lexicon 
(contains pronunciation for unusual words)
iii. Text to be rendered
c. Output from TTS
i. Audio stream containing the rendered content for presentation to the user
ii. Return code (e.g., successful, missing input parameter, missing text 
to be rendered, etc.)
2. Basic VCR-like text reader
a. Example: Users may start/, pause, resume, rewind/ synthesized 
content. User may also /increase speed, decrease speed, increase volume, 
decrease volume/
b. Input to TTS
i. User triggered events, e.g, /start, pause, resume, rewind, 
increase/decrease increase speed, decrease speed, increase volume, 
decrease volume/
ii. Name of speech synthesis engine and associated parameters e.g., 
e.g., Voice (male or female), Volume, Replay speed. Pronunciation 
lexicon (contains pronunciation for unusual words)
iii. Text to be rendered
c. Output from TTS
i. Audio stream containing rendered context
ii. Return codes, e.g., /started, paused, resumed, rewound, changed 
volume, changed speed, no-more-text-to-read/
3. Free-form collector
a. Example: User enters a page, reads a prompt stating "What is your 
name?", and then speaks his/her name. After speaking ,the user reads the 
recognized text which is displayed on the screen.
b. Input to ASR: audio spoken by the user
c. Output from kASR: text that the ASR recognized, /done/ return code
d. Comment: this uses a free-form speech recognition engine so no 
grammar is required.
4. Grammar-based collector
a. Example: User enters a page, reads a prompt stating "Where do you 
want to go?", and then speaks the destination "Austin". The speech 
recognition engine compares the spoken phrase with the words in the 
grammar and determines that the word might be either "Austin" or 
"Boston". The ASR returns an n-best list consisting of two words, Austin 
and Boston, and associated confidence scores. The collector displays a 
menu to the user asking "Did you say (1) Austin, or (2) Boston." The 
user selects one of the menu options.
b. Input to ASR: audio spoken by the user, a grammar describing the 
words which the ASR will listen for
c. Output from ASR: text that the ASR recognized, /done/ return code, 
n-best list of words and their confidence scores.
5. Dictation collector (press to speak)
a. Example, the user dictates the contents of an e-mail message: "This 
meeting is going overtime. I will be late getting home. See you later." 
The user presses a button while speaking and releases the button when 
finished.
b. Input to ASR: start-speaking event, stop speaking event, audio to be 
transcribed to text.
c. Output from ASR: dictated text.

Received on Thursday, 9 September 2010 20:00:26 UTC