- From: Raj\(Openstream\) <raj@openstream.com>
- Date: Thu, 23 Sep 2010 12:48:32 -0400
- To: "T.V Raman" <raman@google.com>
- Cc: <raman@google.com>, <mbodell@microsoft.com>, <public-xg-htmlspeech@w3.org>
Sure..I recongnize that both of us are aware of the scope and roles of these respective groups... I believe my comments are in-line with your stated objective of biting off as much as we can practically chew from a HTML speech enablement perspective.. without embarking on a broader feature-set that other fully-commissioned Working Groups are currently set to pursue.. Regards, Raj ----- Original Message ----- From: "T.V Raman" <raman@google.com> To: <raj@openstream.com> Cc: <raman@google.com>; <mbodell@microsoft.com>; <public-xg-htmlspeech@w3.org> Sent: Thursday, September 23, 2010 12:43 PM Subject: Re: Collection of requirements and use cases > > Raj, I believe that the MMIWG and VBWG as fully-chartered WGs > should continue developing the long-term work. If the XG is going > to do what those two WGs are doing, then we might as well shut > those down (not necessarily advocating that on this thread) -- > - > > Raj (Openstream) writes: > > Great Job Michael..And excellent commentary by TVRaman as usual.. > > > > During the deep-breath exercise that TVR suggested, I would also > > add, at the risk of sounding trite, that the "simple & practical" > > alternatives > > did not work either over the last 10 years resulting in where we are > > today, > > and I am afraid, any "quick & practical" approach will result in > > greater fragmentation of development, which I am sure we are all > > trying to avoid here.. > > > > Perhaps, it would make it easier to highlight the requirement > > and link the illustrative example-use-cases to make an easy reading > > of the set/union of must-haves for the feature-set as TVR suggest > > for a realistic initial set of XG devlierables. > > > > > > Regards, > > Raj > > > > > > On Thu, 23 Sep 2010 08:22:40 -0700 > > raman@google.com (T.V Raman) wrote: > > > > > > Good job Michael! > > > > > > Next step -- at this point we've pooled all the requirements of > > > the last 10 + years of the MMIWG, plus a few additional ones to > > > boot from VBWG. > > > > > > Now, given that those have not been addressed by a single > > > solution in 10+ years, I believe it would be both naive and > > > extremely egotistic of this XG to try to address all of them at > > > one fell swoop -- we'll be here another 10 years -- during which > > > time the requirements will only increase. > > > > > > I urge everyone to take a deep breath, then proceed in small > > > practical steps toward building things that the Web needs today. > > > > > > Michael Bodell writes: > > > > In order to make more structured progress on addressing all the > > >requirements and use cases sent to the list I’ve collated them into > > >one comprehensive set in > > > > the order they were received. If anyone has use case or > > >requirements that they didn’t send yet or that they don’t see in this > > >list please send them by Monday > > > > the 27^th. I've tried to be exhaustive here to be complete and > > >fair and not worried at all if some of the requirements are similar > > >to one another and if > > > > other requirements are exact opposites. I’ll work on a more > > >organized representation of both additional information sent and this > > >list of requirements and > > > > use cases next week. > > > > > > > > 1. Web search by voice: Speak a search query, and get > > >search results. [1] > > > > > > > > 2. Speech translation: The app works as an interpreter > > >between two users that speak different languages. [1] > > > > > > > > 3. Speech-enabled webmail client, e.g. for in-car use. Reads > > >out e-mails and listens for commands, e.g. "archive", "star", "reply, > > >ok, let's meet at 2 > > > > pm", "forward to bob". [1] > > > > > > > > 4. Speech shell: Allows multiple comments, most of which > > >take arguments, some of which are free-form. E.g. "call <number>", > > >"call <contact>", > > > > "calculate <arithmetic expression>", "search for <query>".. [1] > > > > > > > > 5. Turn-by-turn navigation: Speaks driving instructions, > > >and accepts spoken commands, e.g. "navigate to <address>", "navigate > > >to <contact name>", > > > > "navigate to <business name>", "reroute", "suspend navigation". > > >[1] > > > > > > > > 6. Dialog systems, e.g. flight booking, pizza ordering. [1] > > > > > > > > 7. Multimodal interaction: Say "I want to go here", and > > >click on a map. [1] > > > > > > > > 8. VoiceXML interpreter: Fetches a VoiceXML app using > > >XMLHttpRequest, and interprets it using JavaScript and DOM. [1] > > > > > > > > 9. The HTML+Speech standard must allow specification of the > > >speech resource (e.g. speech recognizer) to be used for processing of > > >the audio collected > > > > from the user. [2] > > > > > > > > 10.. The ability to switch between a grammar based recognition > > >to free form recognition. [3] > > > > > > > > 11.. Ability to specify the field relationships. For example > > >when a country field is selected, the state field selections change, > > >so corresponding grammar/ > > > > choices should also be changed. [3] > > > > > > > > 12.. The API must notify the web app when a spoken utterance has > > >been recognized. [4] > > > > > > > > 13.. The API must notify the web app on speech recognition > > >errors. [4] > > > > > > > > 14.. The API should provide access to a list of speech > > >recognition hypotheses. [4] > > > > > > > > 15.. The API should allow, but not require, specifying a grammar > > >for the speech recognizer to use. [4] > > > > > > > > 16.. The API should allow specifying the natural language in > > >which to perform speech recognition. This will override the language > > >of the web page. [4] > > > > > > > > 17.. For privacy reasons, the API should not allow web apps > > >access to raw audio data but only provide recognition results. [4] > > > > > > > > 18.. For privacy reason, speech recognition should only be > > >started in response to user action. [4] > > > > > > > > 19.. Web app developers should not have to run their own speech > > >recognition services. [4] > > > > > > > > 20.. Provide temporal structure of synthesized speech. E.g., to > > >highlight the word in a visual rendition of the speech, to > > >synchronize with other > > > > modalities in a multimodal presentation, to know when to interrupt > > >[5] > > > > > > > > 21.. Allow streaming for longer stretches of spoken output. [5] > > > > > > > > 22.. Use full SSML features including gender, language, > > >pronunciations, etc. [5] > > > > > > > > 23.. Web app developers should not be excluded from running > > >their own speech recognition services. [6] > > > > > > > > 24.. End users should not be prevented from creating or extend > > >existing grammars on both a global and per application basis. [6] > > > > > > > > 25.. End-user extensions should be accessible either from the > > >desktop or from the cloud. [6] > > > > > > > > 26.. For reasons of privacy, the user should not be forced to > > >store anything about their speech recognition environment on the > > >cloud. [6] > > > > > > > > 27.. Any public interfaces for creating extensions should be > > >"speakable". [6] > > > > > > > > 28.. TTS in Speech translation: The app works as an interpreter > > >between two users that speak different languages. [7] > > > > > > > > 29.. TTS in Speech-enabled webmail client, e.g. for in-car use. > > > Reads out e-mails and listens for commands, e.g. "archive", "star", > > >"reply, ok, let's meet > > > > at 2 pm", "forward to bob". [7] > > > > > > > > 30.. TTS in Turn-by-turn navigation: Speaks driving > > >instructions, and accepts spoken commands, e.g. "navigate to > > ><address>", "navigate to <contact name>", > > > > "navigate to <business name>", "reroute", "suspend navigation". > > >[7] > > > > > > > > 31.. TTS in Dialog systems, e.g. flight booking, pizza ordering. > > >[7] > > > > > > > > 32.. TTS in VoiceXML interpreter: Fetches a VoiceXML app using > > >XMLHttpRequest, and interprets it using JavaScript and DOM. [7] > > > > > > > > 33.. A developer creating a (multimodal) interface combining > > >speech input with graphical output needs to have the ability to > > >provide a consistent user > > > > experience not just for graphical elements but also for voice. [8] > > > > > > > > 34.. Hello world example. [9] > > > > > > > > 35.. Basic VCR-like text reader example. [9] > > > > > > > > 36.. Free-form collector example. [9] > > > > > > > > 37.. Grammar-based collector example. [9] > > > > > > > > 38. User-selected recognizer. [10] > > > > > > > > 39. User-controlled speech parameters. [10] > > > > > > > > 40. Make it easy to integrate input from different modalities. > > >[10] > > > > > > > > 41. Allow an author to specify an application-specific > > >statistical language model. [10] > > > > > > > > 42. Make the use of speech optional. [10] > > > > > > > > 43. Support for completely hands-free operation. [10] > > > > > > > > 44. Make the standard easy to extend. [10] > > > > > > > > 45. Selection of the speech engine should be a user-setting in > > >the browser, not a Web developer setting. [11] > > > > > > > > 46. It should be possible to specify a target TTS engine not only > > >via the "URI" attribute, but via a more generic "source" attribute, > > >which can point to a > > > > local TTS engine as well. [12] > > > > > > > > 47. TTS should provide the user, or developer, with finer > > >granularity in control over the text segments being synthesized. [13] > > > > > > > > 48. Interacting with multiple input elements. [14] > > > > > > > > 49. Interacting without visible input elements. [14] > > > > > > > > 50. Re-recognition. [14] > > > > > > > > 51. Continuous recognition. [14] > > > > > > > > 52. Voice activity detection. [14] > > > > > > > > 53. Minimize user perceived latency. [14] > > > > > > > > 54. High quality default, but application customizable, speech > > >recognition graphical user interface. [14] > > > > > > > > 55. Rich recognition results allowing analysis and compex > > >expression (I.e., confidence, alternatives, structured output). [14] > > > > > > > > 56. Ability to specify domain specific grammars. [14] > > > > > > > > 57. Web author able to write one speech experience that performs > > >identically across user agents and/or devices. [14] > > > > > > > > 58. Sythesis that is synchronized with other media (particular > > >visual display). [14] > > > > > > > > 59. Ability to effect barge-in (interrupt sythesis). [14] > > > > > > > > 60. Ability to mitigate false-barge-in scenarios. [14] > > > > > > > > 61. Playback controls (repeat, skip forward, skip backwards, not > > >just by time but by spoken language segments like words, sentences, > > >and paragraphs). [14] > > > > > > > > 62. A user agent needs to provide clear indication to the user > > >whenever it is using a microphone to listen to the user. [14] > > > > > > > > 63. Ability of users to explicitly grant permission for the > > >browser, or an application, to listen to them. [14] > > > > > > > > 64. Needs to be a way to have a trust relationship between the > > >user and whatever processes their utterance. [14] > > > > > > > > 65. Any user agent should work with any vendor's speech services, > > >provided it meets specific open protocol requirements. [14] > > > > > > > > 66. Grammars, TTS and media composition, and recognition results > > >should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14] > > > > > > > > 67. Ability to specify service capabilities and hints. [14] > > > > > > > > 68. Ability to enable multiple languages/dialects for the same > > >page. [15] > > > > > > > > 69. It is critical that the markup support specification of a > > >network speech resource to be used for recognition or synthesis. [16] > > > > > > > > 70. End users need a way to adjust properties such as timeouts. > > >[17] > > > > > > > > References: > > > > > > > > 1 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html > > >[referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and > > >repeated in > > > > > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html] > > > > > > > > 2 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html > > > > > > > > 3 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html > > > > > > > > 4 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html > > > > > > > > 5 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html > > > > > > > > 6 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html > > > > > > > > 7 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html > > >[referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz] > > > > > > > > 8 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html > > > > > > > > 9 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html > > > > > > > > 10 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html > > > > > > > > 11 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html > > > > > > > > 12 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html > > > > > > > > 13 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html > > > > > > > > 14 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html > > > > > > > > 15 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html > > > > > > > > 16 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html > > > > > > > > 17 - > > >http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html > > > > > > > > > > -- > > > Best Regards, > > > --raman > > > > > > Title: Research Scientist > > > Email: raman@google.com > > > WWW: http://emacspeak.sf.net/raman/ > > > Google: tv+raman > > > GTalk: raman@google.com > > > PGP: http://emacspeak.sf.net/raman/raman-almaden.asc > > > > > > > -- > > NOTICE TO RECIPIENT: > > THIS E-MAIL IS MEANT FOR ONLY THE INTENDED RECIPIENT OF THE > > TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGED BY LAW. IF YOU > > RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, > > DISTRIBUTION, OR COPYING OF THIS E-MAIL IS STRICTLY PROHIBITED. PLEASE > > NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE > > THIS MESSAGE FROM YOUR SYSTEM. THANK YOU IN ADVANCE FOR YOUR > > COOPERATION. > > Reply to : legal@openstream.com > > -- > Best Regards, > --raman > > Title: Research Scientist > Email: raman@google.com > WWW: http://emacspeak.sf.net/raman/ > Google: tv+raman > GTalk: raman@google.com > PGP: http://emacspeak.sf.net/raman/raman-almaden.asc >
Received on Thursday, 23 September 2010 17:19:34 UTC