Collection of requirements and use cases from T.V Raman on 2010-09-23 (public-xg-htmlspeech@w3.org from September 2010)

From: T.V Raman <raman@google.com>
Date: Thu, 23 Sep 2010 08:22:40 -0700
To: mbodell@microsoft.com
Cc: public-xg-htmlspeech@w3.org
Message-ID: <19611.28864.741350.921731@retriever.mtv.corp.google.com>
Good job Michael!

Next step -- at this point we've pooled all the requirements of
the last 10 + years of the MMIWG, plus a few additional ones to
boot from VBWG.

Now, given that those have not been addressed by a single
solution in 10+ years,  I believe it would be both naive and
extremely egotistic of this XG  to  try to address all of them at
one fell swoop -- we'll be here another 10 years -- during which
time the requirements will only increase.

I urge everyone to take a deep breath, then proceed in small
practical steps toward building things that the Web needs today.

Michael Bodell writes:
 > In order to make more structured progress on addressing all the requirements and use cases sent to the list I’ve collated them into one comprehensive set in
 > the order they were received.  If anyone has use case or requirements that they didn’t send yet or that they don’t see in this list please send them by Monday
 > the 27^th.  I've tried to be exhaustive here to be complete and fair and not worried at all if some of the requirements are similar to one another and if
 > other requirements are exact opposites.  I’ll work on a more organized representation of both additional information sent and this list of requirements and
 > use cases next week.
 > 
 > 1.       Web search by voice:  Speak a search query, and get search results. [1]
 > 
 > 2.       Speech translation: The app works as an interpreter between two users that speak different languages. [1]
 > 
 > 3.       Speech-enabled webmail client, e.g. for in-car use. Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet at 2
 > pm", "forward to bob". [1]
 > 
 > 4.       Speech shell:  Allows multiple comments, most of which take arguments, some of which are free-form. E.g. "call <number>", "call <contact>",
 > "calculate <arithmetic expression>", "search for <query>".. [1]
 > 
 > 5.       Turn-by-turn navigation:  Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>",
 > "navigate to <business name>", "reroute", "suspend navigation". [1]
 > 
 > 6.       Dialog systems, e.g. flight booking, pizza ordering. [1]
 > 
 > 7.       Multimodal interaction:  Say "I want to go here", and click on a map. [1]
 > 
 > 8.       VoiceXML interpreter:  Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [1]
 > 
 > 9.       The HTML+Speech standard must allow specification of the speech resource (e.g. speech recognizer) to be used for processing of the audio collected
 > from the user. [2]
 > 
 > 10..   The ability to switch between a grammar based recognition to free form recognition. [3]
 > 
 > 11..   Ability to specify the field relationships. For example when a country field is selected, the state field selections change, so corresponding grammar/
 > choices should also be changed. [3]
 > 
 > 12..   The API must notify the web app when a spoken utterance has been recognized. [4]
 > 
 > 13..   The API must notify the web app on speech recognition errors. [4]
 > 
 > 14..   The API should provide access to a list of speech recognition hypotheses. [4]
 > 
 > 15..   The API should allow, but not require, specifying a grammar for the speech recognizer to use. [4]
 > 
 > 16..   The API should allow specifying the natural language in which to perform speech recognition. This will override the language of the web page. [4]
 > 
 > 17..   For privacy reasons, the API should not allow web apps access to raw audio data but only provide recognition results. [4]
 > 
 > 18..   For privacy reason, speech recognition should only be started in response to user action. [4]
 > 
 > 19..   Web app developers should not have to run their own speech recognition services. [4]
 > 
 > 20..   Provide temporal structure of synthesized speech.  E.g., to highlight the word in a visual rendition of the speech, to synchronize with other
 > modalities in a multimodal presentation, to know when to interrupt [5]
 > 
 > 21..   Allow streaming for longer stretches of spoken output. [5]
 > 
 > 22..   Use full SSML features including gender, language, pronunciations, etc. [5]
 > 
 > 23..   Web app developers should not be excluded from running their own speech recognition services. [6]
 > 
 > 24..   End users should not be prevented from creating or extend existing grammars on both a global and per application basis. [6]
 > 
 > 25..   End-user extensions should be accessible either from the desktop or from the cloud. [6]
 > 
 > 26..   For reasons of privacy, the user should not be forced to store anything about their speech recognition environment on the cloud. [6]
 > 
 > 27..   Any public interfaces for creating extensions should be "speakable". [6]
 > 
 > 28..   TTS in Speech translation: The app works as an interpreter between two users that speak different languages. [7]
 > 
 > 29..   TTS in Speech-enabled webmail client, e.g. for in-car use.  Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet
 > at 2 pm", "forward to bob". [7]
 > 
 > 30..   TTS in Turn-by-turn navigation:  Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>",
 > "navigate to <business name>", "reroute", "suspend navigation". [7]
 > 
 > 31..   TTS in Dialog systems, e.g. flight booking, pizza ordering. [7]
 > 
 > 32..   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [7]
 > 
 > 33..   A developer creating a (multimodal) interface combining speech input with graphical output needs to have the ability to provide a consistent user
 > experience not just for graphical elements but also for voice. [8]
 > 
 > 34..   Hello world example. [9]
 > 
 > 35..   Basic VCR-like text reader example. [9]
 > 
 > 36..   Free-form collector example. [9]
 > 
 > 37..   Grammar-based collector example. [9]
 > 
 > 38.  User-selected recognizer. [10]
 > 
 > 39.  User-controlled speech parameters. [10]
 > 
 > 40.  Make it easy to integrate input from different modalities. [10]
 > 
 > 41.  Allow an author to specify an application-specific statistical language model. [10]
 > 
 > 42.  Make the use of speech optional. [10]
 > 
 > 43.  Support for completely hands-free operation. [10]
 > 
 > 44.  Make the standard easy to extend. [10]
 > 
 > 45.  Selection of the speech engine should be a user-setting in the browser, not a Web developer setting. [11]
 > 
 > 46.  It should be possible to specify a target TTS engine not only via the "URI" attribute, but via a more generic "source" attribute, which can point to a
 > local TTS engine as well. [12]
 > 
 > 47.  TTS should provide the user, or developer, with finer granularity in control over the text segments being synthesized. [13]
 > 
 > 48.  Interacting with multiple input elements. [14]
 > 
 > 49.  Interacting without visible input elements. [14]
 > 
 > 50.  Re-recognition. [14]
 > 
 > 51.  Continuous recognition. [14]
 > 
 > 52.  Voice activity detection. [14]
 > 
 > 53.  Minimize user perceived latency. [14]
 > 
 > 54.  High quality default, but application customizable, speech recognition graphical user interface. [14]
 > 
 > 55.  Rich recognition results allowing analysis and compex expression (I.e., confidence, alternatives, structured output). [14]
 > 
 > 56.  Ability to specify domain specific grammars. [14]
 > 
 > 57.  Web author able to write one speech experience that performs identically across user agents and/or devices. [14]
 > 
 > 58.  Sythesis that is synchronized with other media (particular visual display). [14]
 > 
 > 59.  Ability to effect barge-in (interrupt sythesis). [14]
 > 
 > 60.  Ability to mitigate false-barge-in scenarios. [14]
 > 
 > 61.  Playback controls (repeat, skip forward, skip backwards, not just by time but by spoken language segments like words, sentences, and paragraphs). [14]
 > 
 > 62.  A user agent needs to provide clear indication to the user whenever it is using a microphone to listen to the user. [14]
 > 
 > 63.  Ability of users to explicitly grant permission for the browser, or an application, to listen to them. [14]
 > 
 > 64.  Needs to be a way to have a trust relationship between the user and whatever processes their utterance. [14]
 > 
 > 65.  Any user agent should work with any vendor's speech services, provided it meets specific open protocol requirements. [14]
 > 
 > 66.  Grammars, TTS and media composition, and recognition results should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]
 > 
 > 67.  Ability to specify service capabilities and hints. [14]
 > 
 > 68.  Ability to enable multiple languages/dialects for the same page. [15]
 > 
 > 69.  It is critical that the markup support specification of a network speech resource to be used for recognition or synthesis. [16]
 > 
 > 70.  End users need a way to adjust properties such as timeouts. [17]
 > 
 > References:
 > 
 > 1 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html [referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and repeated in
 > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html]
 > 
 > 2 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
 > 
 > 3 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
 > 
 > 4 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
 > 
 > 5 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
 > 
 > 6 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
 > 
 > 7 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html [referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
 > 
 > 8 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
 > 
 > 9 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
 > 
 > 10 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
 > 
 > 11 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
 > 
 > 12 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
 > 
 > 13 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
 > 
 > 14 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
 > 
 > 15 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
 > 
 > 16 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
 > 
 > 17 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
 > 

-- 
Best Regards,
--raman

Title:  Research Scientist                              
Email:  raman@google.com                                
WWW:    http://emacspeak.sf.net/raman/                  
Google: tv+raman                                        
GTalk:  raman@google.com                                
PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc
Received on Thursday, 23 September 2010 15:23:14 UTC