Re: Use cases from Bjorn Bringert on 2010-09-10 (public-xg-htmlspeech@w3.org from September 2010)

From: Bjorn Bringert <bringert@google.com>
Date: Fri, 10 Sep 2010 11:45:42 +0100
To: jim@larson-tech.com
Cc: public-xg-htmlspeech@w3.org
Message-ID: <AANLkTinJ9Nq4hsJwBiht2XZqLbbwDb_1kZS9RC3tsyPR@mail.gmail.com>
On Thu, Sep 9, 2010 at 8:59 PM, James Larson <jim@larson-tech.com> wrote:
>
> I believe we should consider some use cases describing how speech might be used within HTML-5 applications.  Below are brief sketches of four HTML use cases that use speech technology.  I believe that developers will want to develop applications using these and similar use cases.
> I encourage others to respond to this e-mail with comments about these use cases, and to offer alternative or additional use cases that will help us to identify requirements.
>
> 1.        Hello world
> a.       Example: When a page is loaded, the speech synthesis renders the text “Hello World”.
> b.      Input to TTS
>                                                                i.      Notification of triggering event start
>                                                              ii.      Name of the speech synthesis engine and associated parameters, e.g., Voice (male or female), Volume, Replay speed. Pronunciation lexicon (contains pronunciation for unusual words)
>                                                             iii.      Text to be rendered
> c.       Output from TTS
>                                                                i.      Audio stream containing the rendered content for presentation to the user
>                                                              ii.      Return code (e.g., successful, missing input parameter, missing text to be rendered, etc.)
> 2.       Basic VCR-like text reader
> a.       Example:  Users may start, pause, resume, rewind synthesized content.  User may also increase speed, decrease speed, increase volume, decrease volume
> b.      Input to TTS
>                                                                i.      User triggered events, e.g, start, pause, resume, rewind, increase/decrease increase speed, decrease speed, increase volume, decrease volume
>                                                              ii.      Name of speech synthesis engine and associated parameters e.g., e.g., Voice (male or female), Volume, Replay speed. Pronunciation lexicon (contains pronunciation for unusual words)
>                                                             iii.      Text to be rendered
> c.       Output from TTS
>                                                                i.      Audio stream containing rendered context
>                                                              ii.      Return codes, e.g., started, paused, resumed, rewound, changed volume, changed speed, no-more-text-to-read
> 3.       Free-form collector
> a.       Example: User enters a page, reads a prompt stating “What is your name?”, and then speaks his/her name.  After speaking ,the user reads the recognized text which is displayed on the screen.
> b.      Input to ASR: audio spoken by the user
> c.       Output from kASR: text that the ASR recognized, done return code
> d.       Comment: this uses a free-form speech recognition engine so no grammar is required.
> 4.       Grammar-based collector
> a.       Example: User enters a page, reads a prompt stating “Where do you want to go?”, and then speaks the destination “Austin”.  The speech recognition engine compares the spoken phrase with the words in the grammar and determines that the word might be either “Austin” or “Boston”.  The ASR returns an n-best list consisting of two words, Austin and Boston, and associated confidence scores.  The collector displays a menu to the user asking “Did you say (1) Austin, or (2) Boston.”  The user selects one of the menu options.
> b.      Input to ASR: audio spoken by the user, a grammar describing the words which the ASR will listen for
> c.       Output from ASR: text that the ASR recognized, done return code, n-best list of words and their confidence scores.
> 5.       Dictation collector (press to speak)
> a.       Example, the user dictates the contents of an e-mail message: “This meeting is going overtime.  I will be late getting home.  See you later.”  The user presses a button while speaking and releases the button when finished.
> b.      Input to ASR: start-speaking event, stop speaking event, audio to be transcribed to text.
> c.       Output from ASR: dictated text.
>

Here are some more use cases (copied from the HTML Speech Input
Element proposal at
https://docs.google.com/View?id=dcfg79pz_3drj79fhq):

- Web search by voice: Speak a search query, and get search results

- Speech translation: The app works as an interpreter between two
users that speak different languages.

- Speech-enabled webmail client, e.g. for in-car use. Reads out
e-mails and listens for commands, e.g. "archive", "star", "reply, ok,
let's meet at 2 pm", "forward to bob".

- Speech shell: Allows multiple commands, most of which take
arguments, some of which are free-form. E.g. "call <number>", "call
<contact>", "calculate <arithmetic expression>", "search for <query>".

- Turn-by-turn navigation: Speaks driving instructions, and accepts
spoken commands, e.g. "navigate to <address>", "navigate to <contact
name>", "navigate to <business name>", "reroute", "suspend
navigation".

- Dialog systems, e.g. flight booking, pizza ordering.

- Multimodal interaction: Say "I want to go here", and click on a map.

- VoiceXML interpreter: Fetches a VoiceXML app using XMLHttpRequest,
and interprets it using JavaScript and DOM.

--
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Friday, 10 September 2010 10:46:18 UTC