- From: Raj (Openstream) <raj@openstream.com>
- Date: Thu, 23 Sep 2010 11:51:09 -0400
- To: raman@google.com (T.V Raman),mbodell@microsoft.com
- Cc: public-xg-htmlspeech@w3.org
Great Job Michael..And excellent commentary by TVRaman as usual.. During the deep-breath exercise that TVR suggested, I would also add, at the risk of sounding trite, that the "simple & practical" alternatives did not work either over the last 10 years resulting in where we are today, and I am afraid, any "quick & practical" approach will result in greater fragmentation of development, which I am sure we are all trying to avoid here.. Perhaps, it would make it easier to highlight the requirement and link the illustrative example-use-cases to make an easy reading of the set/union of must-haves for the feature-set as TVR suggest for a realistic initial set of XG devlierables. Regards, Raj On Thu, 23 Sep 2010 08:22:40 -0700 raman@google.com (T.V Raman) wrote: > > Good job Michael! > > Next step -- at this point we've pooled all the requirements of > the last 10 + years of the MMIWG, plus a few additional ones to > boot from VBWG. > > Now, given that those have not been addressed by a single > solution in 10+ years, I believe it would be both naive and > extremely egotistic of this XG to try to address all of them at > one fell swoop -- we'll be here another 10 years -- during which > time the requirements will only increase. > > I urge everyone to take a deep breath, then proceed in small > practical steps toward building things that the Web needs today. > > Michael Bodell writes: > > In order to make more structured progress on addressing all the >requirements and use cases sent to the list I’ve collated them into >one comprehensive set in > > the order they were received. If anyone has use case or >requirements that they didn’t send yet or that they don’t see in this >list please send them by Monday > > the 27^th. I've tried to be exhaustive here to be complete and >fair and not worried at all if some of the requirements are similar >to one another and if > > other requirements are exact opposites. I’ll work on a more >organized representation of both additional information sent and this >list of requirements and > > use cases next week. > > > > 1. Web search by voice: Speak a search query, and get >search results. [1] > > > > 2. Speech translation: The app works as an interpreter >between two users that speak different languages. [1] > > > > 3. Speech-enabled webmail client, e.g. for in-car use. Reads >out e-mails and listens for commands, e.g. "archive", "star", "reply, >ok, let's meet at 2 > > pm", "forward to bob". [1] > > > > 4. Speech shell: Allows multiple comments, most of which >take arguments, some of which are free-form. E.g. "call <number>", >"call <contact>", > > "calculate <arithmetic expression>", "search for <query>".. [1] > > > > 5. Turn-by-turn navigation: Speaks driving instructions, >and accepts spoken commands, e.g. "navigate to <address>", "navigate >to <contact name>", > > "navigate to <business name>", "reroute", "suspend navigation". >[1] > > > > 6. Dialog systems, e.g. flight booking, pizza ordering. [1] > > > > 7. Multimodal interaction: Say "I want to go here", and >click on a map. [1] > > > > 8. VoiceXML interpreter: Fetches a VoiceXML app using >XMLHttpRequest, and interprets it using JavaScript and DOM. [1] > > > > 9. The HTML+Speech standard must allow specification of the >speech resource (e.g. speech recognizer) to be used for processing of >the audio collected > > from the user. [2] > > > > 10.. The ability to switch between a grammar based recognition >to free form recognition. [3] > > > > 11.. Ability to specify the field relationships. For example >when a country field is selected, the state field selections change, >so corresponding grammar/ > > choices should also be changed. [3] > > > > 12.. The API must notify the web app when a spoken utterance has >been recognized. [4] > > > > 13.. The API must notify the web app on speech recognition >errors. [4] > > > > 14.. The API should provide access to a list of speech >recognition hypotheses. [4] > > > > 15.. The API should allow, but not require, specifying a grammar >for the speech recognizer to use. [4] > > > > 16.. The API should allow specifying the natural language in >which to perform speech recognition. This will override the language >of the web page. [4] > > > > 17.. For privacy reasons, the API should not allow web apps >access to raw audio data but only provide recognition results. [4] > > > > 18.. For privacy reason, speech recognition should only be >started in response to user action. [4] > > > > 19.. Web app developers should not have to run their own speech >recognition services. [4] > > > > 20.. Provide temporal structure of synthesized speech. E.g., to >highlight the word in a visual rendition of the speech, to >synchronize with other > > modalities in a multimodal presentation, to know when to interrupt >[5] > > > > 21.. Allow streaming for longer stretches of spoken output. [5] > > > > 22.. Use full SSML features including gender, language, >pronunciations, etc. [5] > > > > 23.. Web app developers should not be excluded from running >their own speech recognition services. [6] > > > > 24.. End users should not be prevented from creating or extend >existing grammars on both a global and per application basis. [6] > > > > 25.. End-user extensions should be accessible either from the >desktop or from the cloud. [6] > > > > 26.. For reasons of privacy, the user should not be forced to >store anything about their speech recognition environment on the >cloud. [6] > > > > 27.. Any public interfaces for creating extensions should be >"speakable". [6] > > > > 28.. TTS in Speech translation: The app works as an interpreter >between two users that speak different languages. [7] > > > > 29.. TTS in Speech-enabled webmail client, e.g. for in-car use. > Reads out e-mails and listens for commands, e.g. "archive", "star", >"reply, ok, let's meet > > at 2 pm", "forward to bob". [7] > > > > 30.. TTS in Turn-by-turn navigation: Speaks driving >instructions, and accepts spoken commands, e.g. "navigate to ><address>", "navigate to <contact name>", > > "navigate to <business name>", "reroute", "suspend navigation". >[7] > > > > 31.. TTS in Dialog systems, e.g. flight booking, pizza ordering. >[7] > > > > 32.. TTS in VoiceXML interpreter: Fetches a VoiceXML app using >XMLHttpRequest, and interprets it using JavaScript and DOM. [7] > > > > 33.. A developer creating a (multimodal) interface combining >speech input with graphical output needs to have the ability to >provide a consistent user > > experience not just for graphical elements but also for voice. [8] > > > > 34.. Hello world example. [9] > > > > 35.. Basic VCR-like text reader example. [9] > > > > 36.. Free-form collector example. [9] > > > > 37.. Grammar-based collector example. [9] > > > > 38. User-selected recognizer. [10] > > > > 39. User-controlled speech parameters. [10] > > > > 40. Make it easy to integrate input from different modalities. >[10] > > > > 41. Allow an author to specify an application-specific >statistical language model. [10] > > > > 42. Make the use of speech optional. [10] > > > > 43. Support for completely hands-free operation. [10] > > > > 44. Make the standard easy to extend. [10] > > > > 45. Selection of the speech engine should be a user-setting in >the browser, not a Web developer setting. [11] > > > > 46. It should be possible to specify a target TTS engine not only >via the "URI" attribute, but via a more generic "source" attribute, >which can point to a > > local TTS engine as well. [12] > > > > 47. TTS should provide the user, or developer, with finer >granularity in control over the text segments being synthesized. [13] > > > > 48. Interacting with multiple input elements. [14] > > > > 49. Interacting without visible input elements. [14] > > > > 50. Re-recognition. [14] > > > > 51. Continuous recognition. [14] > > > > 52. Voice activity detection. [14] > > > > 53. Minimize user perceived latency. [14] > > > > 54. High quality default, but application customizable, speech >recognition graphical user interface. [14] > > > > 55. Rich recognition results allowing analysis and compex >expression (I.e., confidence, alternatives, structured output). [14] > > > > 56. Ability to specify domain specific grammars. [14] > > > > 57. Web author able to write one speech experience that performs >identically across user agents and/or devices. [14] > > > > 58. Sythesis that is synchronized with other media (particular >visual display). [14] > > > > 59. Ability to effect barge-in (interrupt sythesis). [14] > > > > 60. Ability to mitigate false-barge-in scenarios. [14] > > > > 61. Playback controls (repeat, skip forward, skip backwards, not >just by time but by spoken language segments like words, sentences, >and paragraphs). [14] > > > > 62. A user agent needs to provide clear indication to the user >whenever it is using a microphone to listen to the user. [14] > > > > 63. Ability of users to explicitly grant permission for the >browser, or an application, to listen to them. [14] > > > > 64. Needs to be a way to have a trust relationship between the >user and whatever processes their utterance. [14] > > > > 65. Any user agent should work with any vendor's speech services, >provided it meets specific open protocol requirements. [14] > > > > 66. Grammars, TTS and media composition, and recognition results >should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14] > > > > 67. Ability to specify service capabilities and hints. [14] > > > > 68. Ability to enable multiple languages/dialects for the same >page. [15] > > > > 69. It is critical that the markup support specification of a >network speech resource to be used for recognition or synthesis. [16] > > > > 70. End users need a way to adjust properties such as timeouts. >[17] > > > > References: > > > > 1 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html >[referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and >repeated in > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html] > > > > 2 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html > > > > 3 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html > > > > 4 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html > > > > 5 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html > > > > 6 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html > > > > 7 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html >[referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz] > > > > 8 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html > > > > 9 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html > > > > 10 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html > > > > 11 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html > > > > 12 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html > > > > 13 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html > > > > 14 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html > > > > 15 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html > > > > 16 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html > > > > 17 - >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html > > > > -- > Best Regards, > --raman > > Title: Research Scientist > Email: raman@google.com > WWW: http://emacspeak.sf.net/raman/ > Google: tv+raman > GTalk: raman@google.com > PGP: http://emacspeak.sf.net/raman/raman-almaden.asc > -- NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGED BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS E-MAIL IS STRICTLY PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM. THANK YOU IN ADVANCE FOR YOUR COOPERATION. Reply to : legal@openstream.com
Received on Thursday, 23 September 2010 16:42:20 UTC