- From: Dave Burke <daveburke@google.com>
- Date: Thu, 23 Sep 2010 18:48:50 +0200
- To: "T.V Raman" <raman@google.com>
- Cc: mbodell <mbodell@microsoft.com>, public-xg-htmlspeech <public-xg-htmlspeech@w3.org>
- Message-ID: <AANLkTimg5_r1osXTCReZhFoUubWxSrou6wxRY8Y1MtSg@mail.gmail.com>
+1 And remember that the XG is on a much more compressed timescale than VBWG/MMI. Let's be focused and pragmatic... Dave 2010/9/23 T.V Raman <raman@google.com> > > Good job Michael! > > Next step -- at this point we've pooled all the requirements of > the last 10 + years of the MMIWG, plus a few additional ones to > boot from VBWG. > > Now, given that those have not been addressed by a single > solution in 10+ years, I believe it would be both naive and > extremely egotistic of this XG to try to address all of them at > one fell swoop -- we'll be here another 10 years -- during which > time the requirements will only increase. > > I urge everyone to take a deep breath, then proceed in small > practical steps toward building things that the Web needs today. > > Michael Bodell writes: > > In order to make more structured progress on addressing all the > requirements and use cases sent to the list I’ve collated them into one > comprehensive set in > > the order they were received. If anyone has use case or requirements > that they didn’t send yet or that they don’t see in this list please send > them by Monday > > the 27^th. I've tried to be exhaustive here to be complete and fair and > not worried at all if some of the requirements are similar to one another > and if > > other requirements are exact opposites. I’ll work on a more organized > representation of both additional information sent and this list of > requirements and > > use cases next week. > > > > 1. Web search by voice: Speak a search query, and get search > results. [1] > > > > 2. Speech translation: The app works as an interpreter between two > users that speak different languages. [1] > > > > 3. Speech-enabled webmail client, e.g. for in-car use. Reads out > e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's > meet at 2 > > pm", "forward to bob". [1] > > > > 4. Speech shell: Allows multiple comments, most of which take > arguments, some of which are free-form. E.g. "call <number>", "call > <contact>", > > "calculate <arithmetic expression>", "search for <query>".. [1] > > > > 5. Turn-by-turn navigation: Speaks driving instructions, and > accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact > name>", > > "navigate to <business name>", "reroute", "suspend navigation". [1] > > > > 6. Dialog systems, e.g. flight booking, pizza ordering. [1] > > > > 7. Multimodal interaction: Say "I want to go here", and click on > a map. [1] > > > > 8. VoiceXML interpreter: Fetches a VoiceXML app using > XMLHttpRequest, and interprets it using JavaScript and DOM. [1] > > > > 9. The HTML+Speech standard must allow specification of the speech > resource (e.g. speech recognizer) to be used for processing of the audio > collected > > from the user. [2] > > > > 10.. The ability to switch between a grammar based recognition to free > form recognition. [3] > > > > 11.. Ability to specify the field relationships. For example when a > country field is selected, the state field selections change, so > corresponding grammar/ > > choices should also be changed. [3] > > > > 12.. The API must notify the web app when a spoken utterance has been > recognized. [4] > > > > 13.. The API must notify the web app on speech recognition errors. [4] > > > > 14.. The API should provide access to a list of speech recognition > hypotheses. [4] > > > > 15.. The API should allow, but not require, specifying a grammar for > the speech recognizer to use. [4] > > > > 16.. The API should allow specifying the natural language in which to > perform speech recognition. This will override the language of the web page. > [4] > > > > 17.. For privacy reasons, the API should not allow web apps access to > raw audio data but only provide recognition results. [4] > > > > 18.. For privacy reason, speech recognition should only be started in > response to user action. [4] > > > > 19.. Web app developers should not have to run their own speech > recognition services. [4] > > > > 20.. Provide temporal structure of synthesized speech. E.g., to > highlight the word in a visual rendition of the speech, to synchronize with > other > > modalities in a multimodal presentation, to know when to interrupt [5] > > > > 21.. Allow streaming for longer stretches of spoken output. [5] > > > > 22.. Use full SSML features including gender, language, > pronunciations, etc. [5] > > > > 23.. Web app developers should not be excluded from running their own > speech recognition services. [6] > > > > 24.. End users should not be prevented from creating or extend > existing grammars on both a global and per application basis. [6] > > > > 25.. End-user extensions should be accessible either from the desktop > or from the cloud. [6] > > > > 26.. For reasons of privacy, the user should not be forced to store > anything about their speech recognition environment on the cloud. [6] > > > > 27.. Any public interfaces for creating extensions should be > "speakable". [6] > > > > 28.. TTS in Speech translation: The app works as an interpreter > between two users that speak different languages. [7] > > > > 29.. TTS in Speech-enabled webmail client, e.g. for in-car use. Reads > out e-mails and listens for commands, e.g. "archive", "star", "reply, ok, > let's meet > > at 2 pm", "forward to bob". [7] > > > > 30.. TTS in Turn-by-turn navigation: Speaks driving instructions, and > accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact > name>", > > "navigate to <business name>", "reroute", "suspend navigation". [7] > > > > 31.. TTS in Dialog systems, e.g. flight booking, pizza ordering. [7] > > > > 32.. TTS in VoiceXML interpreter: Fetches a VoiceXML app using > XMLHttpRequest, and interprets it using JavaScript and DOM. [7] > > > > 33.. A developer creating a (multimodal) interface combining speech > input with graphical output needs to have the ability to provide a > consistent user > > experience not just for graphical elements but also for voice. [8] > > > > 34.. Hello world example. [9] > > > > 35.. Basic VCR-like text reader example. [9] > > > > 36.. Free-form collector example. [9] > > > > 37.. Grammar-based collector example. [9] > > > > 38. User-selected recognizer. [10] > > > > 39. User-controlled speech parameters. [10] > > > > 40. Make it easy to integrate input from different modalities. [10] > > > > 41. Allow an author to specify an application-specific statistical > language model. [10] > > > > 42. Make the use of speech optional. [10] > > > > 43. Support for completely hands-free operation. [10] > > > > 44. Make the standard easy to extend. [10] > > > > 45. Selection of the speech engine should be a user-setting in the > browser, not a Web developer setting. [11] > > > > 46. It should be possible to specify a target TTS engine not only via > the "URI" attribute, but via a more generic "source" attribute, which can > point to a > > local TTS engine as well. [12] > > > > 47. TTS should provide the user, or developer, with finer granularity > in control over the text segments being synthesized. [13] > > > > 48. Interacting with multiple input elements. [14] > > > > 49. Interacting without visible input elements. [14] > > > > 50. Re-recognition. [14] > > > > 51. Continuous recognition. [14] > > > > 52. Voice activity detection. [14] > > > > 53. Minimize user perceived latency. [14] > > > > 54. High quality default, but application customizable, speech > recognition graphical user interface. [14] > > > > 55. Rich recognition results allowing analysis and compex expression > (I.e., confidence, alternatives, structured output). [14] > > > > 56. Ability to specify domain specific grammars. [14] > > > > 57. Web author able to write one speech experience that performs > identically across user agents and/or devices. [14] > > > > 58. Sythesis that is synchronized with other media (particular visual > display). [14] > > > > 59. Ability to effect barge-in (interrupt sythesis). [14] > > > > 60. Ability to mitigate false-barge-in scenarios. [14] > > > > 61. Playback controls (repeat, skip forward, skip backwards, not just > by time but by spoken language segments like words, sentences, and > paragraphs). [14] > > > > 62. A user agent needs to provide clear indication to the user whenever > it is using a microphone to listen to the user. [14] > > > > 63. Ability of users to explicitly grant permission for the browser, or > an application, to listen to them. [14] > > > > 64. Needs to be a way to have a trust relationship between the user and > whatever processes their utterance. [14] > > > > 65. Any user agent should work with any vendor's speech services, > provided it meets specific open protocol requirements. [14] > > > > 66. Grammars, TTS and media composition, and recognition results should > use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14] > > > > 67. Ability to specify service capabilities and hints. [14] > > > > 68. Ability to enable multiple languages/dialects for the same page. > [15] > > > > 69. It is critical that the markup support specification of a network > speech resource to be used for recognition or synthesis. [16] > > > > 70. End users need a way to adjust properties such as timeouts. [17] > > > > References: > > > > 1 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html[referencing > https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and repeated in > > > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html > ] > > > > 2 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html > > > > 3 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html > > > > 4 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html > > > > 5 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html > > > > 6 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html > > > > 7 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html[referencing > http://docs.google.com/View?id=dcfg79pz_4gnmp96cz] > > > > 8 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html > > > > 9 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html > > > > 10 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html > > > > 11 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html > > > > 12 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html > > > > 13 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html > > > > 14 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html > > > > 15 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html > > > > 16 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html > > > > 17 - > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html > > > > -- > Best Regards, > --raman > > Title: Research Scientist > Email: raman@google.com > WWW: http://emacspeak.sf.net/raman/ > Google: tv+raman > GTalk: raman@google.com > PGP: http://emacspeak.sf.net/raman/raman-almaden.asc > >
Received on Thursday, 23 September 2010 16:49:21 UTC