Re: Collection of requirements and use cases from T.V Raman on 2010-09-23 (public-xg-htmlspeech@w3.org from September 2010)

From: T.V Raman <raman@google.com>
Date: Thu, 23 Sep 2010 09:43:11 -0700
To: raj@openstream.com
Cc: raman@google.com, mbodell@microsoft.com, public-xg-htmlspeech@w3.org
Message-ID: <19611.33695.498064.179763@retriever.mtv.corp.google.com>
Raj, I believe that the MMIWG and VBWG as fully-chartered WGs
should continue developing the long-term work. If the XG is going
to do what those two WGs are doing, then we might as well shut
those down (not necessarily advocating that on this thread) --
-

Raj (Openstream) writes:
 > Great Job Michael..And excellent commentary by TVRaman as usual..
 > 
 > During the deep-breath exercise that TVR suggested, I would also
 > add, at the risk of sounding trite, that the "simple & practical" 
 > alternatives
 > did not work either over the last 10 years resulting in where we are 
 > today,
 > and I am afraid, any "quick & practical" approach will result in
 > greater fragmentation of development, which I am sure we are all
 > trying to avoid here..
 > 
 > Perhaps, it would make it easier to highlight the requirement
 > and link the illustrative example-use-cases to make an easy reading
 > of the set/union of must-haves for the feature-set as TVR suggest
 > for a realistic initial set of XG devlierables.
 > 
 > 
 > Regards,
 > Raj
 >   
 > 
 > On Thu, 23 Sep 2010 08:22:40 -0700
 >   raman@google.com (T.V Raman) wrote:
 > > 
 > > Good job Michael!
 > > 
 > > Next step -- at this point we've pooled all the requirements of
 > > the last 10 + years of the MMIWG, plus a few additional ones to
 > > boot from VBWG.
 > > 
 > > Now, given that those have not been addressed by a single
 > > solution in 10+ years,  I believe it would be both naive and
 > > extremely egotistic of this XG  to  try to address all of them at
 > > one fell swoop -- we'll be here another 10 years -- during which
 > > time the requirements will only increase.
 > > 
 > > I urge everyone to take a deep breath, then proceed in small
 > > practical steps toward building things that the Web needs today.
 > > 
 > > Michael Bodell writes:
 > > > In order to make more structured progress on addressing all the 
 > >requirements and use cases sent to the list I’ve collated them into 
 > >one comprehensive set in
 > > > the order they were received.  If anyone has use case or 
 > >requirements that they didn’t send yet or that they don’t see in this 
 > >list please send them by Monday
 > > > the 27^th.  I've tried to be exhaustive here to be complete and 
 > >fair and not worried at all if some of the requirements are similar 
 > >to one another and if
 > > > other requirements are exact opposites.  I’ll work on a more 
 > >organized representation of both additional information sent and this 
 > >list of requirements and
 > > > use cases next week.
 > > > 
 > > > 1.       Web search by voice:  Speak a search query, and get 
 > >search results. [1]
 > > > 
 > > > 2.       Speech translation: The app works as an interpreter 
 > >between two users that speak different languages. [1]
 > > > 
 > > > 3.       Speech-enabled webmail client, e.g. for in-car use. Reads 
 > >out e-mails and listens for commands, e.g. "archive", "star", "reply, 
 > >ok, let's meet at 2
 > > > pm", "forward to bob". [1]
 > > > 
 > > > 4.       Speech shell:  Allows multiple comments, most of which 
 > >take arguments, some of which are free-form. E.g. "call <number>", 
 > >"call <contact>",
 > > > "calculate <arithmetic expression>", "search for <query>".. [1]
 > > > 
 > > > 5.       Turn-by-turn navigation:  Speaks driving instructions, 
 > >and accepts spoken commands, e.g. "navigate to <address>", "navigate 
 > >to <contact name>",
 > > > "navigate to <business name>", "reroute", "suspend navigation". 
 > >[1]
 > > > 
 > > > 6.       Dialog systems, e.g. flight booking, pizza ordering. [1]
 > > > 
 > > > 7.       Multimodal interaction:  Say "I want to go here", and 
 > >click on a map. [1]
 > > > 
 > > > 8.       VoiceXML interpreter:  Fetches a VoiceXML app using 
 > >XMLHttpRequest, and interprets it using JavaScript and DOM. [1]
 > > > 
 > > > 9.       The HTML+Speech standard must allow specification of the 
 > >speech resource (e.g. speech recognizer) to be used for processing of 
 > >the audio collected
 > > > from the user. [2]
 > > > 
 > > > 10..   The ability to switch between a grammar based recognition 
 > >to free form recognition. [3]
 > > > 
 > > > 11..   Ability to specify the field relationships. For example 
 > >when a country field is selected, the state field selections change, 
 > >so corresponding grammar/
 > > > choices should also be changed. [3]
 > > > 
 > > > 12..   The API must notify the web app when a spoken utterance has 
 > >been recognized. [4]
 > > > 
 > > > 13..   The API must notify the web app on speech recognition 
 > >errors. [4]
 > > > 
 > > > 14..   The API should provide access to a list of speech 
 > >recognition hypotheses. [4]
 > > > 
 > > > 15..   The API should allow, but not require, specifying a grammar 
 > >for the speech recognizer to use. [4]
 > > > 
 > > > 16..   The API should allow specifying the natural language in 
 > >which to perform speech recognition. This will override the language 
 > >of the web page. [4]
 > > > 
 > > > 17..   For privacy reasons, the API should not allow web apps 
 > >access to raw audio data but only provide recognition results. [4]
 > > > 
 > > > 18..   For privacy reason, speech recognition should only be 
 > >started in response to user action. [4]
 > > > 
 > > > 19..   Web app developers should not have to run their own speech 
 > >recognition services. [4]
 > > > 
 > > > 20..   Provide temporal structure of synthesized speech.  E.g., to 
 > >highlight the word in a visual rendition of the speech, to 
 > >synchronize with other
 > > > modalities in a multimodal presentation, to know when to interrupt 
 > >[5]
 > > > 
 > > > 21..   Allow streaming for longer stretches of spoken output. [5]
 > > > 
 > > > 22..   Use full SSML features including gender, language, 
 > >pronunciations, etc. [5]
 > > > 
 > > > 23..   Web app developers should not be excluded from running 
 > >their own speech recognition services. [6]
 > > > 
 > > > 24..   End users should not be prevented from creating or extend 
 > >existing grammars on both a global and per application basis. [6]
 > > > 
 > > > 25..   End-user extensions should be accessible either from the 
 > >desktop or from the cloud. [6]
 > > > 
 > > > 26..   For reasons of privacy, the user should not be forced to 
 > >store anything about their speech recognition environment on the 
 > >cloud. [6]
 > > > 
 > > > 27..   Any public interfaces for creating extensions should be 
 > >"speakable". [6]
 > > > 
 > > > 28..   TTS in Speech translation: The app works as an interpreter 
 > >between two users that speak different languages. [7]
 > > > 
 > > > 29..   TTS in Speech-enabled webmail client, e.g. for in-car use. 
 > > Reads out e-mails and listens for commands, e.g. "archive", "star", 
 > >"reply, ok, let's meet
 > > > at 2 pm", "forward to bob". [7]
 > > > 
 > > > 30..   TTS in Turn-by-turn navigation:  Speaks driving 
 > >instructions, and accepts spoken commands, e.g. "navigate to 
 > ><address>", "navigate to <contact name>",
 > > > "navigate to <business name>", "reroute", "suspend navigation". 
 > >[7]
 > > > 
 > > > 31..   TTS in Dialog systems, e.g. flight booking, pizza ordering. 
 > >[7]
 > > > 
 > > > 32..   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using 
 > >XMLHttpRequest, and interprets it using JavaScript and DOM. [7]
 > > > 
 > > > 33..   A developer creating a (multimodal) interface combining 
 > >speech input with graphical output needs to have the ability to 
 > >provide a consistent user
 > > > experience not just for graphical elements but also for voice. [8]
 > > > 
 > > > 34..   Hello world example. [9]
 > > > 
 > > > 35..   Basic VCR-like text reader example. [9]
 > > > 
 > > > 36..   Free-form collector example. [9]
 > > > 
 > > > 37..   Grammar-based collector example. [9]
 > > > 
 > > > 38.  User-selected recognizer. [10]
 > > > 
 > > > 39.  User-controlled speech parameters. [10]
 > > > 
 > > > 40.  Make it easy to integrate input from different modalities. 
 > >[10]
 > > > 
 > > > 41.  Allow an author to specify an application-specific 
 > >statistical language model. [10]
 > > > 
 > > > 42.  Make the use of speech optional. [10]
 > > > 
 > > > 43.  Support for completely hands-free operation. [10]
 > > > 
 > > > 44.  Make the standard easy to extend. [10]
 > > > 
 > > > 45.  Selection of the speech engine should be a user-setting in 
 > >the browser, not a Web developer setting. [11]
 > > > 
 > > > 46.  It should be possible to specify a target TTS engine not only 
 > >via the "URI" attribute, but via a more generic "source" attribute, 
 > >which can point to a
 > > > local TTS engine as well. [12]
 > > > 
 > > > 47.  TTS should provide the user, or developer, with finer 
 > >granularity in control over the text segments being synthesized. [13]
 > > > 
 > > > 48.  Interacting with multiple input elements. [14]
 > > > 
 > > > 49.  Interacting without visible input elements. [14]
 > > > 
 > > > 50.  Re-recognition. [14]
 > > > 
 > > > 51.  Continuous recognition. [14]
 > > > 
 > > > 52.  Voice activity detection. [14]
 > > > 
 > > > 53.  Minimize user perceived latency. [14]
 > > > 
 > > > 54.  High quality default, but application customizable, speech 
 > >recognition graphical user interface. [14]
 > > > 
 > > > 55.  Rich recognition results allowing analysis and compex 
 > >expression (I.e., confidence, alternatives, structured output). [14]
 > > > 
 > > > 56.  Ability to specify domain specific grammars. [14]
 > > > 
 > > > 57.  Web author able to write one speech experience that performs 
 > >identically across user agents and/or devices. [14]
 > > > 
 > > > 58.  Sythesis that is synchronized with other media (particular 
 > >visual display). [14]
 > > > 
 > > > 59.  Ability to effect barge-in (interrupt sythesis). [14]
 > > > 
 > > > 60.  Ability to mitigate false-barge-in scenarios. [14]
 > > > 
 > > > 61.  Playback controls (repeat, skip forward, skip backwards, not 
 > >just by time but by spoken language segments like words, sentences, 
 > >and paragraphs). [14]
 > > > 
 > > > 62.  A user agent needs to provide clear indication to the user 
 > >whenever it is using a microphone to listen to the user. [14]
 > > > 
 > > > 63.  Ability of users to explicitly grant permission for the 
 > >browser, or an application, to listen to them. [14]
 > > > 
 > > > 64.  Needs to be a way to have a trust relationship between the 
 > >user and whatever processes their utterance. [14]
 > > > 
 > > > 65.  Any user agent should work with any vendor's speech services, 
 > >provided it meets specific open protocol requirements. [14]
 > > > 
 > > > 66.  Grammars, TTS and media composition, and recognition results 
 > >should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]
 > > > 
 > > > 67.  Ability to specify service capabilities and hints. [14]
 > > > 
 > > > 68.  Ability to enable multiple languages/dialects for the same 
 > >page. [15]
 > > > 
 > > > 69.  It is critical that the markup support specification of a 
 > >network speech resource to be used for recognition or synthesis. [16]
 > > > 
 > > > 70.  End users need a way to adjust properties such as timeouts. 
 > >[17]
 > > > 
 > > > References:
 > > > 
 > > > 1 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html 
 > >[referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and 
 > >repeated in
 > > > 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html]
 > > > 
 > > > 2 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
 > > > 
 > > > 3 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
 > > > 
 > > > 4 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
 > > > 
 > > > 5 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
 > > > 
 > > > 6 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
 > > > 
 > > > 7 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html 
 > >[referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
 > > > 
 > > > 8 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
 > > > 
 > > > 9 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
 > > > 
 > > > 10 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
 > > > 
 > > > 11 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
 > > > 
 > > > 12 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
 > > > 
 > > > 13 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
 > > > 
 > > > 14 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
 > > > 
 > > > 15 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
 > > > 
 > > > 16 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
 > > > 
 > > > 17 - 
 > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
 > > > 
 > > 
 > > -- 
 > > Best Regards,
 > > --raman
 > > 
 > > Title:  Research Scientist                              
 > > Email:  raman@google.com                                
 > > WWW:    http://emacspeak.sf.net/raman/                  
 > > Google: tv+raman                                        
 > > GTalk:  raman@google.com                                
 > > PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc    
 > > 
 > 
 > --
 > NOTICE TO RECIPIENT:  
 > THIS E-MAIL IS  MEANT FOR ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGED BY LAW.  IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS E-MAIL IS STRICTLY PROHIBITED.  PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM. THANK YOU IN ADVANCE FOR YOUR COOPERATION. 
 > Reply to : legal@openstream.com

-- 
Best Regards,
--raman

Title:  Research Scientist                              
Email:  raman@google.com                                
WWW:    http://emacspeak.sf.net/raman/                  
Google: tv+raman                                        
GTalk:  raman@google.com                                
PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc
Received on Thursday, 23 September 2010 16:43:46 UTC