- From: Olli Pettay <Olli.Pettay@helsinki.fi>
- Date: Thu, 23 Sep 2010 20:10:14 +0300
- To: public-xg-htmlspeech@w3.org
On 09/23/2010 06:22 PM, T.V Raman wrote: > > Good job Michael! > > Next step -- at this point we've pooled all the requirements of the > last 10 + years of the MMIWG, plus a few additional ones to boot from > VBWG. > > Now, given that those have not been addressed by a single solution in > 10+ years, I believe it would be both naive and extremely egotistic > of this XG to try to address all of them at one fell swoop -- we'll > be here another 10 years -- during which time the requirements will > only increase. > > I urge everyone to take a deep breath, then proceed in small > practical steps toward building things that the Web needs today. +1 ...which is why I hope we end up to some simple API to control ASR and TTS. Mini-SALT? If we had that kind of API, script libries could then add support for binding ASR/TTS to HTML form elements etc. -Olli > > Michael Bodell writes: >> In order to make more structured progress on addressing all the >> requirements and use cases sent to the list I’ve collated them into >> one comprehensive set in the order they were received. If anyone >> has use case or requirements that they didn’t send yet or that they >> don’t see in this list please send them by Monday the 27^th. I've >> tried to be exhaustive here to be complete and fair and not worried >> at all if some of the requirements are similar to one another and >> if other requirements are exact opposites. I’ll work on a more >> organized representation of both additional information sent and >> this list of requirements and use cases next week. >> >> 1. Web search by voice: Speak a search query, and get search >> results. [1] >> >> 2. Speech translation: The app works as an interpreter >> between two users that speak different languages. [1] >> >> 3. Speech-enabled webmail client, e.g. for in-car use. Reads >> out e-mails and listens for commands, e.g. "archive", "star", >> "reply, ok, let's meet at 2 pm", "forward to bob". [1] >> >> 4. Speech shell: Allows multiple comments, most of which >> take arguments, some of which are free-form. E.g. "call<number>", >> "call<contact>", "calculate<arithmetic expression>", "search >> for<query>".. [1] >> >> 5. Turn-by-turn navigation: Speaks driving instructions, and >> accepts spoken commands, e.g. "navigate to<address>", "navigate >> to<contact name>", "navigate to<business name>", "reroute", >> "suspend navigation". [1] >> >> 6. Dialog systems, e.g. flight booking, pizza ordering. [1] >> >> 7. Multimodal interaction: Say "I want to go here", and >> click on a map. [1] >> >> 8. VoiceXML interpreter: Fetches a VoiceXML app using >> XMLHttpRequest, and interprets it using JavaScript and DOM. [1] >> >> 9. The HTML+Speech standard must allow specification of the >> speech resource (e.g. speech recognizer) to be used for processing >> of the audio collected from the user. [2] >> >> 10.. The ability to switch between a grammar based recognition to >> free form recognition. [3] >> >> 11.. Ability to specify the field relationships. For example when >> a country field is selected, the state field selections change, so >> corresponding grammar/ choices should also be changed. [3] >> >> 12.. The API must notify the web app when a spoken utterance has >> been recognized. [4] >> >> 13.. The API must notify the web app on speech recognition >> errors. [4] >> >> 14.. The API should provide access to a list of speech >> recognition hypotheses. [4] >> >> 15.. The API should allow, but not require, specifying a grammar >> for the speech recognizer to use. [4] >> >> 16.. The API should allow specifying the natural language in >> which to perform speech recognition. This will override the >> language of the web page. [4] >> >> 17.. For privacy reasons, the API should not allow web apps >> access to raw audio data but only provide recognition results. [4] >> >> 18.. For privacy reason, speech recognition should only be >> started in response to user action. [4] >> >> 19.. Web app developers should not have to run their own speech >> recognition services. [4] >> >> 20.. Provide temporal structure of synthesized speech. E.g., to >> highlight the word in a visual rendition of the speech, to >> synchronize with other modalities in a multimodal presentation, to >> know when to interrupt [5] >> >> 21.. Allow streaming for longer stretches of spoken output. [5] >> >> 22.. Use full SSML features including gender, language, >> pronunciations, etc. [5] >> >> 23.. Web app developers should not be excluded from running their >> own speech recognition services. [6] >> >> 24.. End users should not be prevented from creating or extend >> existing grammars on both a global and per application basis. [6] >> >> 25.. End-user extensions should be accessible either from the >> desktop or from the cloud. [6] >> >> 26.. For reasons of privacy, the user should not be forced to >> store anything about their speech recognition environment on the >> cloud. [6] >> >> 27.. Any public interfaces for creating extensions should be >> "speakable". [6] >> >> 28.. TTS in Speech translation: The app works as an interpreter >> between two users that speak different languages. [7] >> >> 29.. TTS in Speech-enabled webmail client, e.g. for in-car use. >> Reads out e-mails and listens for commands, e.g. "archive", "star", >> "reply, ok, let's meet at 2 pm", "forward to bob". [7] >> >> 30.. TTS in Turn-by-turn navigation: Speaks driving >> instructions, and accepts spoken commands, e.g. "navigate >> to<address>", "navigate to<contact name>", "navigate to<business >> name>", "reroute", "suspend navigation". [7] >> >> 31.. TTS in Dialog systems, e.g. flight booking, pizza ordering. >> [7] >> >> 32.. TTS in VoiceXML interpreter: Fetches a VoiceXML app using >> XMLHttpRequest, and interprets it using JavaScript and DOM. [7] >> >> 33.. A developer creating a (multimodal) interface combining >> speech input with graphical output needs to have the ability to >> provide a consistent user experience not just for graphical >> elements but also for voice. [8] >> >> 34.. Hello world example. [9] >> >> 35.. Basic VCR-like text reader example. [9] >> >> 36.. Free-form collector example. [9] >> >> 37.. Grammar-based collector example. [9] >> >> 38. User-selected recognizer. [10] >> >> 39. User-controlled speech parameters. [10] >> >> 40. Make it easy to integrate input from different modalities. >> [10] >> >> 41. Allow an author to specify an application-specific statistical >> language model. [10] >> >> 42. Make the use of speech optional. [10] >> >> 43. Support for completely hands-free operation. [10] >> >> 44. Make the standard easy to extend. [10] >> >> 45. Selection of the speech engine should be a user-setting in the >> browser, not a Web developer setting. [11] >> >> 46. It should be possible to specify a target TTS engine not only >> via the "URI" attribute, but via a more generic "source" attribute, >> which can point to a local TTS engine as well. [12] >> >> 47. TTS should provide the user, or developer, with finer >> granularity in control over the text segments being synthesized. >> [13] >> >> 48. Interacting with multiple input elements. [14] >> >> 49. Interacting without visible input elements. [14] >> >> 50. Re-recognition. [14] >> >> 51. Continuous recognition. [14] >> >> 52. Voice activity detection. [14] >> >> 53. Minimize user perceived latency. [14] >> >> 54. High quality default, but application customizable, speech >> recognition graphical user interface. [14] >> >> 55. Rich recognition results allowing analysis and compex >> expression (I.e., confidence, alternatives, structured output). >> [14] >> >> 56. Ability to specify domain specific grammars. [14] >> >> 57. Web author able to write one speech experience that performs >> identically across user agents and/or devices. [14] >> >> 58. Sythesis that is synchronized with other media (particular >> visual display). [14] >> >> 59. Ability to effect barge-in (interrupt sythesis). [14] >> >> 60. Ability to mitigate false-barge-in scenarios. [14] >> >> 61. Playback controls (repeat, skip forward, skip backwards, not >> just by time but by spoken language segments like words, sentences, >> and paragraphs). [14] >> >> 62. A user agent needs to provide clear indication to the user >> whenever it is using a microphone to listen to the user. [14] >> >> 63. Ability of users to explicitly grant permission for the >> browser, or an application, to listen to them. [14] >> >> 64. Needs to be a way to have a trust relationship between the >> user and whatever processes their utterance. [14] >> >> 65. Any user agent should work with any vendor's speech services, >> provided it meets specific open protocol requirements. [14] >> >> 66. Grammars, TTS and media composition, and recognition results >> should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14] >> >> 67. Ability to specify service capabilities and hints. [14] >> >> 68. Ability to enable multiple languages/dialects for the same >> page. [15] >> >> 69. It is critical that the markup support specification of a >> network speech resource to be used for recognition or synthesis. >> [16] >> >> 70. End users need a way to adjust properties such as timeouts. >> [17] >> >> References: >> >> 1 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html >> [referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and >> repeated in >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html] > >> > >> 2 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html > >> > >> 3 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html > >> > >> 4 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html > >> > >> 5 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html > >> > >> 6 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html > >> > >> 7 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html >> [referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz] >> >> 8 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html > >> > >> 9 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html > >> > >> 10 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html > >> > >> 11 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html > >> > >> 12 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html > >> > >> 13 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html > >> > >> 14 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html > >> > >> 15 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html > >> > >> 16 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html > >> > >> 17 - >> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html > >> > >
Received on Thursday, 23 September 2010 17:10:47 UTC