Re: Collection of requirements and use cases from Raj (Openstream) on 2010-09-23 (public-xg-htmlspeech@w3.org from September 2010)

From: Raj (Openstream) <raj@openstream.com>
Date: Thu, 23 Sep 2010 11:51:09 -0400
To: raman@google.com (T.V Raman),mbodell@microsoft.com
Cc: public-xg-htmlspeech@w3.org
Message-ID: <web-947218@smartmessenger.com>
Great Job Michael..And excellent commentary by TVRaman as usual..

During the deep-breath exercise that TVR suggested, I would also
add, at the risk of sounding trite, that the "simple & practical" 
alternatives
did not work either over the last 10 years resulting in where we are 
today,
and I am afraid, any "quick & practical" approach will result in
greater fragmentation of development, which I am sure we are all
trying to avoid here..

Perhaps, it would make it easier to highlight the requirement
and link the illustrative example-use-cases to make an easy reading
of the set/union of must-haves for the feature-set as TVR suggest
for a realistic initial set of XG devlierables.


Regards,
Raj
  

On Thu, 23 Sep 2010 08:22:40 -0700
  raman@google.com (T.V Raman) wrote:
> 
> Good job Michael!
> 
> Next step -- at this point we've pooled all the requirements of
> the last 10 + years of the MMIWG, plus a few additional ones to
> boot from VBWG.
> 
> Now, given that those have not been addressed by a single
> solution in 10+ years,  I believe it would be both naive and
> extremely egotistic of this XG  to  try to address all of them at
> one fell swoop -- we'll be here another 10 years -- during which
> time the requirements will only increase.
> 
> I urge everyone to take a deep breath, then proceed in small
> practical steps toward building things that the Web needs today.
> 
> Michael Bodell writes:
> > In order to make more structured progress on addressing all the 
>requirements and use cases sent to the list I’ve collated them into 
>one comprehensive set in
> > the order they were received.  If anyone has use case or 
>requirements that they didn’t send yet or that they don’t see in this 
>list please send them by Monday
> > the 27^th.  I've tried to be exhaustive here to be complete and 
>fair and not worried at all if some of the requirements are similar 
>to one another and if
> > other requirements are exact opposites.  I’ll work on a more 
>organized representation of both additional information sent and this 
>list of requirements and
> > use cases next week.
> > 
> > 1.       Web search by voice:  Speak a search query, and get 
>search results. [1]
> > 
> > 2.       Speech translation: The app works as an interpreter 
>between two users that speak different languages. [1]
> > 
> > 3.       Speech-enabled webmail client, e.g. for in-car use. Reads 
>out e-mails and listens for commands, e.g. "archive", "star", "reply, 
>ok, let's meet at 2
> > pm", "forward to bob". [1]
> > 
> > 4.       Speech shell:  Allows multiple comments, most of which 
>take arguments, some of which are free-form. E.g. "call <number>", 
>"call <contact>",
> > "calculate <arithmetic expression>", "search for <query>".. [1]
> > 
> > 5.       Turn-by-turn navigation:  Speaks driving instructions, 
>and accepts spoken commands, e.g. "navigate to <address>", "navigate 
>to <contact name>",
> > "navigate to <business name>", "reroute", "suspend navigation". 
>[1]
> > 
> > 6.       Dialog systems, e.g. flight booking, pizza ordering. [1]
> > 
> > 7.       Multimodal interaction:  Say "I want to go here", and 
>click on a map. [1]
> > 
> > 8.       VoiceXML interpreter:  Fetches a VoiceXML app using 
>XMLHttpRequest, and interprets it using JavaScript and DOM. [1]
> > 
> > 9.       The HTML+Speech standard must allow specification of the 
>speech resource (e.g. speech recognizer) to be used for processing of 
>the audio collected
> > from the user. [2]
> > 
> > 10..   The ability to switch between a grammar based recognition 
>to free form recognition. [3]
> > 
> > 11..   Ability to specify the field relationships. For example 
>when a country field is selected, the state field selections change, 
>so corresponding grammar/
> > choices should also be changed. [3]
> > 
> > 12..   The API must notify the web app when a spoken utterance has 
>been recognized. [4]
> > 
> > 13..   The API must notify the web app on speech recognition 
>errors. [4]
> > 
> > 14..   The API should provide access to a list of speech 
>recognition hypotheses. [4]
> > 
> > 15..   The API should allow, but not require, specifying a grammar 
>for the speech recognizer to use. [4]
> > 
> > 16..   The API should allow specifying the natural language in 
>which to perform speech recognition. This will override the language 
>of the web page. [4]
> > 
> > 17..   For privacy reasons, the API should not allow web apps 
>access to raw audio data but only provide recognition results. [4]
> > 
> > 18..   For privacy reason, speech recognition should only be 
>started in response to user action. [4]
> > 
> > 19..   Web app developers should not have to run their own speech 
>recognition services. [4]
> > 
> > 20..   Provide temporal structure of synthesized speech.  E.g., to 
>highlight the word in a visual rendition of the speech, to 
>synchronize with other
> > modalities in a multimodal presentation, to know when to interrupt 
>[5]
> > 
> > 21..   Allow streaming for longer stretches of spoken output. [5]
> > 
> > 22..   Use full SSML features including gender, language, 
>pronunciations, etc. [5]
> > 
> > 23..   Web app developers should not be excluded from running 
>their own speech recognition services. [6]
> > 
> > 24..   End users should not be prevented from creating or extend 
>existing grammars on both a global and per application basis. [6]
> > 
> > 25..   End-user extensions should be accessible either from the 
>desktop or from the cloud. [6]
> > 
> > 26..   For reasons of privacy, the user should not be forced to 
>store anything about their speech recognition environment on the 
>cloud. [6]
> > 
> > 27..   Any public interfaces for creating extensions should be 
>"speakable". [6]
> > 
> > 28..   TTS in Speech translation: The app works as an interpreter 
>between two users that speak different languages. [7]
> > 
> > 29..   TTS in Speech-enabled webmail client, e.g. for in-car use. 
> Reads out e-mails and listens for commands, e.g. "archive", "star", 
>"reply, ok, let's meet
> > at 2 pm", "forward to bob". [7]
> > 
> > 30..   TTS in Turn-by-turn navigation:  Speaks driving 
>instructions, and accepts spoken commands, e.g. "navigate to 
><address>", "navigate to <contact name>",
> > "navigate to <business name>", "reroute", "suspend navigation". 
>[7]
> > 
> > 31..   TTS in Dialog systems, e.g. flight booking, pizza ordering. 
>[7]
> > 
> > 32..   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using 
>XMLHttpRequest, and interprets it using JavaScript and DOM. [7]
> > 
> > 33..   A developer creating a (multimodal) interface combining 
>speech input with graphical output needs to have the ability to 
>provide a consistent user
> > experience not just for graphical elements but also for voice. [8]
> > 
> > 34..   Hello world example. [9]
> > 
> > 35..   Basic VCR-like text reader example. [9]
> > 
> > 36..   Free-form collector example. [9]
> > 
> > 37..   Grammar-based collector example. [9]
> > 
> > 38.  User-selected recognizer. [10]
> > 
> > 39.  User-controlled speech parameters. [10]
> > 
> > 40.  Make it easy to integrate input from different modalities. 
>[10]
> > 
> > 41.  Allow an author to specify an application-specific 
>statistical language model. [10]
> > 
> > 42.  Make the use of speech optional. [10]
> > 
> > 43.  Support for completely hands-free operation. [10]
> > 
> > 44.  Make the standard easy to extend. [10]
> > 
> > 45.  Selection of the speech engine should be a user-setting in 
>the browser, not a Web developer setting. [11]
> > 
> > 46.  It should be possible to specify a target TTS engine not only 
>via the "URI" attribute, but via a more generic "source" attribute, 
>which can point to a
> > local TTS engine as well. [12]
> > 
> > 47.  TTS should provide the user, or developer, with finer 
>granularity in control over the text segments being synthesized. [13]
> > 
> > 48.  Interacting with multiple input elements. [14]
> > 
> > 49.  Interacting without visible input elements. [14]
> > 
> > 50.  Re-recognition. [14]
> > 
> > 51.  Continuous recognition. [14]
> > 
> > 52.  Voice activity detection. [14]
> > 
> > 53.  Minimize user perceived latency. [14]
> > 
> > 54.  High quality default, but application customizable, speech 
>recognition graphical user interface. [14]
> > 
> > 55.  Rich recognition results allowing analysis and compex 
>expression (I.e., confidence, alternatives, structured output). [14]
> > 
> > 56.  Ability to specify domain specific grammars. [14]
> > 
> > 57.  Web author able to write one speech experience that performs 
>identically across user agents and/or devices. [14]
> > 
> > 58.  Sythesis that is synchronized with other media (particular 
>visual display). [14]
> > 
> > 59.  Ability to effect barge-in (interrupt sythesis). [14]
> > 
> > 60.  Ability to mitigate false-barge-in scenarios. [14]
> > 
> > 61.  Playback controls (repeat, skip forward, skip backwards, not 
>just by time but by spoken language segments like words, sentences, 
>and paragraphs). [14]
> > 
> > 62.  A user agent needs to provide clear indication to the user 
>whenever it is using a microphone to listen to the user. [14]
> > 
> > 63.  Ability of users to explicitly grant permission for the 
>browser, or an application, to listen to them. [14]
> > 
> > 64.  Needs to be a way to have a trust relationship between the 
>user and whatever processes their utterance. [14]
> > 
> > 65.  Any user agent should work with any vendor's speech services, 
>provided it meets specific open protocol requirements. [14]
> > 
> > 66.  Grammars, TTS and media composition, and recognition results 
>should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]
> > 
> > 67.  Ability to specify service capabilities and hints. [14]
> > 
> > 68.  Ability to enable multiple languages/dialects for the same 
>page. [15]
> > 
> > 69.  It is critical that the markup support specification of a 
>network speech resource to be used for recognition or synthesis. [16]
> > 
> > 70.  End users need a way to adjust properties such as timeouts. 
>[17]
> > 
> > References:
> > 
> > 1 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html 
>[referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and 
>repeated in
> > 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html]
> > 
> > 2 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
> > 
> > 3 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
> > 
> > 4 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
> > 
> > 5 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
> > 
> > 6 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
> > 
> > 7 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html 
>[referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
> > 
> > 8 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
> > 
> > 9 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
> > 
> > 10 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
> > 
> > 11 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
> > 
> > 12 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
> > 
> > 13 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
> > 
> > 14 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
> > 
> > 15 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
> > 
> > 16 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
> > 
> > 17 - 
>http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
> > 
> 
> -- 
> Best Regards,
> --raman
> 
> Title:  Research Scientist                              
> Email:  raman@google.com                                
> WWW:    http://emacspeak.sf.net/raman/                  
> Google: tv+raman                                        
> GTalk:  raman@google.com                                
> PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc    
> 

--
NOTICE TO RECIPIENT:  
THIS E-MAIL IS  MEANT FOR ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGED BY LAW.  IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS E-MAIL IS STRICTLY PROHIBITED.  PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM. THANK YOU IN ADVANCE FOR YOUR COOPERATION. 
Reply to : legal@openstream.com
Received on Thursday, 23 September 2010 16:42:20 UTC