Collection of requirements and use cases from Michael Bodell on 2010-09-23 (public-xg-htmlspeech@w3.org from September 2010)

From: Michael Bodell <mbodell@microsoft.com>
Date: Thu, 23 Sep 2010 04:50:40 +0000
To: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <22CD592CCD76414085591204EB19F4E805E4936D@TK5EX14MBXC201.redmond.corp.microsoft.>
In order to make more structured progress on addressing all the requirements and use cases sent to the list I’ve collated them into one comprehensive set in the order they were received.  If anyone has use case or requirements that they didn’t send yet or that they don’t see in this list please send them by Monday the 27th.  I've tried to be exhaustive here to be complete and fair and not worried at all if some of the requirements are similar to one another and if other requirements are exact opposites.  I’ll work on a more organized representation of both additional information sent and this list of requirements and use cases next week.


1.       Web search by voice:  Speak a search query, and get search results.. [1]


2.       Speech translation: The app works as an interpreter between two users that speak different languages. [1]



3.       Speech-enabled webmail client, e.g. for in-car use. Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet at 2 pm", "forward to bob". [1]



4.       Speech shell:  Allows multiple comments, most of which take arguments, some of which are free-form. E.g. "call <number>", "call <contact>", "calculate <arithmetic expression>", "search for <query>". [1]



5.       Turn-by-turn navigation:  Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>", "navigate to <business name>", "reroute", "suspend navigation". [1]



6.       Dialog systems, e.g. flight booking, pizza ordering. [1]



7.       Multimodal interaction:  Say "I want to go here", and click on a map. [1]



8.       VoiceXML interpreter:  Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [1]



9.       The HTML+Speech standard must allow specification of the speech resource (e.g. speech recognizer) to be used for processing of the audio collected from the user. [2]



10.   The ability to switch between a grammar based recognition to free form recognition. [3]



11.   Ability to specify the field relationships. For example when a country field is selected, the state field selections change, so corresponding grammar/choices should also be changed. [3]



12.   The API must notify the web app when a spoken utterance has been recognized. [4]



13.   The API must notify the web app on speech recognition errors. [4]



14.   The API should provide access to a list of speech recognition hypotheses. [4]



15.   The API should allow, but not require, specifying a grammar for the speech recognizer to use. [4]



16.   The API should allow specifying the natural language in which to perform speech recognition. This will override the language of the web page. [4]



17.   For privacy reasons, the API should not allow web apps access to raw audio data but only provide recognition results. [4]



18.   For privacy reason, speech recognition should only be started in response to user action. [4]



19.   Web app developers should not have to run their own speech recognition services. [4]



20.   Provide temporal structure of synthesized speech.  E.g., to highlight the word in a visual rendition of the speech, to synchronize with other modalities in a multimodal presentation, to know when to interrupt [5]



21.   Allow streaming for longer stretches of spoken output. [5]



22.   Use full SSML features including gender, language, pronunciations, etc. [5]



23.   Web app developers should not be excluded from running their own speech recognition services. [6]



24.   End users should not be prevented from creating or extend existing grammars on both a global and per application basis. [6]



25.   End-user extensions should be accessible either from the desktop or from the cloud. [6]



26.   For reasons of privacy, the user should not be forced to store anything about their speech recognition environment on the cloud. [6]



27.   Any public interfaces for creating extensions should be "speakable". [6]



28.   TTS in Speech translation: The app works as an interpreter between two users that speak different languages. [7]



29.   TTS in Speech-enabled webmail client, e.g. for in-car use.  Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet at 2 pm", "forward to bob". [7]



30.   TTS in Turn-by-turn navigation:  Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>", "navigate to <business name>", "reroute", "suspend navigation". [7]



31.   TTS in Dialog systems, e.g. flight booking, pizza ordering. [7]



32.   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [7]



33.   A developer creating a (multimodal) interface combining speech input with graphical output needs to have the ability to provide a consistent user experience not just for graphical elements but also for voice. [8]



34.   Hello world example. [9]



35.   Basic VCR-like text reader example. [9]



36.   Free-form collector example. [9]



37.   Grammar-based collector example. [9]



38.  User-selected recognizer. [10]



39.  User-controlled speech parameters. [10]



40.  Make it easy to integrate input from different modalities. [10]



41.  Allow an author to specify an application-specific statistical language model. [10]



42.  Make the use of speech optional. [10]



43.  Support for completely hands-free operation. [10]



44.  Make the standard easy to extend. [10]



45.  Selection of the speech engine should be a user-setting in the browser, not a Web developer setting. [11]



46.  It should be possible to specify a target TTS engine not only via the "URI" attribute, but via a more generic "source" attribute, which can point to a local TTS engine as well. [12]



47.  TTS should provide the user, or developer, with finer granularity in control over the text segments being synthesized. [13]



48.  Interacting with multiple input elements. [14]



49.  Interacting without visible input elements. [14]



50.  Re-recognition. [14]



51.  Continuous recognition. [14]



52.  Voice activity detection. [14]



53.  Minimize user perceived latency. [14]



54.  High quality default, but application customizable, speech recognition graphical user interface. [14]



55.  Rich recognition results allowing analysis and compex expression (I.e., confidence, alternatives, structured output). [14]



56.  Ability to specify domain specific grammars. [14]



57.  Web author able to write one speech experience that performs identically across user agents and/or devices. [14]



58.  Sythesis that is synchronized with other media (particular visual display). [14]



59.  Ability to effect barge-in (interrupt sythesis). [14]



60.  Ability to mitigate false-barge-in scenarios. [14]



61.  Playback controls (repeat, skip forward, skip backwards, not just by time but by spoken language segments like words, sentences, and paragraphs). [14]



62.  A user agent needs to provide clear indication to the user whenever it is using a microphone to listen to the user. [14]



63.  Ability of users to explicitly grant permission for the browser, or an application, to listen to them. [14]



64.  Needs to be a way to have a trust relationship between the user and whatever processes their utterance. [14]



65.  Any user agent should work with any vendor's speech services, provided it meets specific open protocol requirements. [14]



66.  Grammars, TTS and media composition, and recognition results should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]



67.  Ability to specify service capabilities and hints. [14]



68.  Ability to enable multiple languages/dialects for the same page. [15]



69.  It is critical that the markup support specification of a network speech resource to be used for recognition or synthesis. [16]



70.  End users need a way to adjust properties such as timeouts. [17]



References:
1 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html [referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and repeated in http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html]
2 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
3 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
4 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
5 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
6 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
7 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html [referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
8 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
9 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
10 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
11 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
12 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
13 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
14 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
15 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
16 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
17 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
Received on Thursday, 23 September 2010 14:47:27 UTC