- From: T.V Raman <raman@google.com>
- Date: Thu, 23 Sep 2010 08:22:40 -0700
- To: mbodell@microsoft.com
- Cc: public-xg-htmlspeech@w3.org
Good job Michael! Next step -- at this point we've pooled all the requirements of the last 10 + years of the MMIWG, plus a few additional ones to boot from VBWG. Now, given that those have not been addressed by a single solution in 10+ years, I believe it would be both naive and extremely egotistic of this XG to try to address all of them at one fell swoop -- we'll be here another 10 years -- during which time the requirements will only increase. I urge everyone to take a deep breath, then proceed in small practical steps toward building things that the Web needs today. Michael Bodell writes: > In order to make more structured progress on addressing all the requirements and use cases sent to the list I’ve collated them into one comprehensive set in > the order they were received. If anyone has use case or requirements that they didn’t send yet or that they don’t see in this list please send them by Monday > the 27^th. I've tried to be exhaustive here to be complete and fair and not worried at all if some of the requirements are similar to one another and if > other requirements are exact opposites. I’ll work on a more organized representation of both additional information sent and this list of requirements and > use cases next week. > > 1. Web search by voice: Speak a search query, and get search results. [1] > > 2. Speech translation: The app works as an interpreter between two users that speak different languages. [1] > > 3. Speech-enabled webmail client, e.g. for in-car use. Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet at 2 > pm", "forward to bob". [1] > > 4. Speech shell: Allows multiple comments, most of which take arguments, some of which are free-form. E.g. "call <number>", "call <contact>", > "calculate <arithmetic expression>", "search for <query>".. [1] > > 5. Turn-by-turn navigation: Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>", > "navigate to <business name>", "reroute", "suspend navigation". [1] > > 6. Dialog systems, e.g. flight booking, pizza ordering. [1] > > 7. Multimodal interaction: Say "I want to go here", and click on a map. [1] > > 8. VoiceXML interpreter: Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [1] > > 9. The HTML+Speech standard must allow specification of the speech resource (e.g. speech recognizer) to be used for processing of the audio collected > from the user. [2] > > 10.. The ability to switch between a grammar based recognition to free form recognition. [3] > > 11.. Ability to specify the field relationships. For example when a country field is selected, the state field selections change, so corresponding grammar/ > choices should also be changed. [3] > > 12.. The API must notify the web app when a spoken utterance has been recognized. [4] > > 13.. The API must notify the web app on speech recognition errors. [4] > > 14.. The API should provide access to a list of speech recognition hypotheses. [4] > > 15.. The API should allow, but not require, specifying a grammar for the speech recognizer to use. [4] > > 16.. The API should allow specifying the natural language in which to perform speech recognition. This will override the language of the web page. [4] > > 17.. For privacy reasons, the API should not allow web apps access to raw audio data but only provide recognition results. [4] > > 18.. For privacy reason, speech recognition should only be started in response to user action. [4] > > 19.. Web app developers should not have to run their own speech recognition services. [4] > > 20.. Provide temporal structure of synthesized speech. E.g., to highlight the word in a visual rendition of the speech, to synchronize with other > modalities in a multimodal presentation, to know when to interrupt [5] > > 21.. Allow streaming for longer stretches of spoken output. [5] > > 22.. Use full SSML features including gender, language, pronunciations, etc. [5] > > 23.. Web app developers should not be excluded from running their own speech recognition services. [6] > > 24.. End users should not be prevented from creating or extend existing grammars on both a global and per application basis. [6] > > 25.. End-user extensions should be accessible either from the desktop or from the cloud. [6] > > 26.. For reasons of privacy, the user should not be forced to store anything about their speech recognition environment on the cloud. [6] > > 27.. Any public interfaces for creating extensions should be "speakable". [6] > > 28.. TTS in Speech translation: The app works as an interpreter between two users that speak different languages. [7] > > 29.. TTS in Speech-enabled webmail client, e.g. for in-car use. Reads out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's meet > at 2 pm", "forward to bob". [7] > > 30.. TTS in Turn-by-turn navigation: Speaks driving instructions, and accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact name>", > "navigate to <business name>", "reroute", "suspend navigation". [7] > > 31.. TTS in Dialog systems, e.g. flight booking, pizza ordering. [7] > > 32.. TTS in VoiceXML interpreter: Fetches a VoiceXML app using XMLHttpRequest, and interprets it using JavaScript and DOM. [7] > > 33.. A developer creating a (multimodal) interface combining speech input with graphical output needs to have the ability to provide a consistent user > experience not just for graphical elements but also for voice. [8] > > 34.. Hello world example. [9] > > 35.. Basic VCR-like text reader example. [9] > > 36.. Free-form collector example. [9] > > 37.. Grammar-based collector example. [9] > > 38. User-selected recognizer. [10] > > 39. User-controlled speech parameters. [10] > > 40. Make it easy to integrate input from different modalities. [10] > > 41. Allow an author to specify an application-specific statistical language model. [10] > > 42. Make the use of speech optional. [10] > > 43. Support for completely hands-free operation. [10] > > 44. Make the standard easy to extend. [10] > > 45. Selection of the speech engine should be a user-setting in the browser, not a Web developer setting. [11] > > 46. It should be possible to specify a target TTS engine not only via the "URI" attribute, but via a more generic "source" attribute, which can point to a > local TTS engine as well. [12] > > 47. TTS should provide the user, or developer, with finer granularity in control over the text segments being synthesized. [13] > > 48. Interacting with multiple input elements. [14] > > 49. Interacting without visible input elements. [14] > > 50. Re-recognition. [14] > > 51. Continuous recognition. [14] > > 52. Voice activity detection. [14] > > 53. Minimize user perceived latency. [14] > > 54. High quality default, but application customizable, speech recognition graphical user interface. [14] > > 55. Rich recognition results allowing analysis and compex expression (I.e., confidence, alternatives, structured output). [14] > > 56. Ability to specify domain specific grammars. [14] > > 57. Web author able to write one speech experience that performs identically across user agents and/or devices. [14] > > 58. Sythesis that is synchronized with other media (particular visual display). [14] > > 59. Ability to effect barge-in (interrupt sythesis). [14] > > 60. Ability to mitigate false-barge-in scenarios. [14] > > 61. Playback controls (repeat, skip forward, skip backwards, not just by time but by spoken language segments like words, sentences, and paragraphs). [14] > > 62. A user agent needs to provide clear indication to the user whenever it is using a microphone to listen to the user. [14] > > 63. Ability of users to explicitly grant permission for the browser, or an application, to listen to them. [14] > > 64. Needs to be a way to have a trust relationship between the user and whatever processes their utterance. [14] > > 65. Any user agent should work with any vendor's speech services, provided it meets specific open protocol requirements. [14] > > 66. Grammars, TTS and media composition, and recognition results should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14] > > 67. Ability to specify service capabilities and hints. [14] > > 68. Ability to enable multiple languages/dialects for the same page. [15] > > 69. It is critical that the markup support specification of a network speech resource to be used for recognition or synthesis. [16] > > 70. End users need a way to adjust properties such as timeouts. [17] > > References: > > 1 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html [referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and repeated in > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html] > > 2 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html > > 3 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html > > 4 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html > > 5 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html > > 6 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html > > 7 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html [referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz] > > 8 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html > > 9 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html > > 10 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html > > 11 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html > > 12 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html > > 13 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html > > 14 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html > > 15 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html > > 16 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html > > 17 - http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html > -- Best Regards, --raman Title: Research Scientist Email: raman@google.com WWW: http://emacspeak.sf.net/raman/ Google: tv+raman GTalk: raman@google.com PGP: http://emacspeak.sf.net/raman/raman-almaden.asc
Received on Thursday, 23 September 2010 15:23:14 UTC