Re: Collection of requirements and use cases

+1

And remember that the XG is on a much more compressed timescale than
VBWG/MMI. Let's be focused and pragmatic...

Dave

2010/9/23 T.V Raman <raman@google.com>

>
> Good job Michael!
>
> Next step -- at this point we've pooled all the requirements of
> the last 10 + years of the MMIWG, plus a few additional ones to
> boot from VBWG.
>
> Now, given that those have not been addressed by a single
> solution in 10+ years,  I believe it would be both naive and
> extremely egotistic of this XG  to  try to address all of them at
> one fell swoop -- we'll be here another 10 years -- during which
> time the requirements will only increase.
>
> I urge everyone to take a deep breath, then proceed in small
> practical steps toward building things that the Web needs today.
>
> Michael Bodell writes:
>  > In order to make more structured progress on addressing all the
> requirements and use cases sent to the list I’ve collated them into one
> comprehensive set in
>  > the order they were received.  If anyone has use case or requirements
> that they didn’t send yet or that they don’t see in this list please send
> them by Monday
>  > the 27^th.  I've tried to be exhaustive here to be complete and fair and
> not worried at all if some of the requirements are similar to one another
> and if
>  > other requirements are exact opposites.  I’ll work on a more organized
> representation of both additional information sent and this list of
> requirements and
>  > use cases next week.
>  >
>  > 1.       Web search by voice:  Speak a search query, and get search
> results. [1]
>  >
>  > 2.       Speech translation: The app works as an interpreter between two
> users that speak different languages. [1]
>  >
>  > 3.       Speech-enabled webmail client, e.g. for in-car use. Reads out
> e-mails and listens for commands, e.g. "archive", "star", "reply, ok, let's
> meet at 2
>  > pm", "forward to bob". [1]
>  >
>  > 4.       Speech shell:  Allows multiple comments, most of which take
> arguments, some of which are free-form. E.g. "call <number>", "call
> <contact>",
>  > "calculate <arithmetic expression>", "search for <query>".. [1]
>  >
>  > 5.       Turn-by-turn navigation:  Speaks driving instructions, and
> accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact
> name>",
>  > "navigate to <business name>", "reroute", "suspend navigation". [1]
>  >
>  > 6.       Dialog systems, e.g. flight booking, pizza ordering. [1]
>  >
>  > 7.       Multimodal interaction:  Say "I want to go here", and click on
> a map. [1]
>  >
>  > 8.       VoiceXML interpreter:  Fetches a VoiceXML app using
> XMLHttpRequest, and interprets it using JavaScript and DOM. [1]
>  >
>  > 9.       The HTML+Speech standard must allow specification of the speech
> resource (e.g. speech recognizer) to be used for processing of the audio
> collected
>  > from the user. [2]
>  >
>  > 10..   The ability to switch between a grammar based recognition to free
> form recognition. [3]
>  >
>  > 11..   Ability to specify the field relationships. For example when a
> country field is selected, the state field selections change, so
> corresponding grammar/
>  > choices should also be changed. [3]
>  >
>  > 12..   The API must notify the web app when a spoken utterance has been
> recognized. [4]
>  >
>  > 13..   The API must notify the web app on speech recognition errors. [4]
>  >
>  > 14..   The API should provide access to a list of speech recognition
> hypotheses. [4]
>  >
>  > 15..   The API should allow, but not require, specifying a grammar for
> the speech recognizer to use. [4]
>  >
>  > 16..   The API should allow specifying the natural language in which to
> perform speech recognition. This will override the language of the web page.
> [4]
>  >
>  > 17..   For privacy reasons, the API should not allow web apps access to
> raw audio data but only provide recognition results. [4]
>  >
>  > 18..   For privacy reason, speech recognition should only be started in
> response to user action. [4]
>  >
>  > 19..   Web app developers should not have to run their own speech
> recognition services. [4]
>  >
>  > 20..   Provide temporal structure of synthesized speech.  E.g., to
> highlight the word in a visual rendition of the speech, to synchronize with
> other
>  > modalities in a multimodal presentation, to know when to interrupt [5]
>  >
>  > 21..   Allow streaming for longer stretches of spoken output. [5]
>  >
>  > 22..   Use full SSML features including gender, language,
> pronunciations, etc. [5]
>  >
>  > 23..   Web app developers should not be excluded from running their own
> speech recognition services. [6]
>  >
>  > 24..   End users should not be prevented from creating or extend
> existing grammars on both a global and per application basis. [6]
>  >
>  > 25..   End-user extensions should be accessible either from the desktop
> or from the cloud. [6]
>  >
>  > 26..   For reasons of privacy, the user should not be forced to store
> anything about their speech recognition environment on the cloud. [6]
>  >
>  > 27..   Any public interfaces for creating extensions should be
> "speakable". [6]
>  >
>  > 28..   TTS in Speech translation: The app works as an interpreter
> between two users that speak different languages. [7]
>  >
>  > 29..   TTS in Speech-enabled webmail client, e.g. for in-car use.  Reads
> out e-mails and listens for commands, e.g. "archive", "star", "reply, ok,
> let's meet
>  > at 2 pm", "forward to bob". [7]
>  >
>  > 30..   TTS in Turn-by-turn navigation:  Speaks driving instructions, and
> accepts spoken commands, e.g. "navigate to <address>", "navigate to <contact
> name>",
>  > "navigate to <business name>", "reroute", "suspend navigation". [7]
>  >
>  > 31..   TTS in Dialog systems, e.g. flight booking, pizza ordering. [7]
>  >
>  > 32..   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using
> XMLHttpRequest, and interprets it using JavaScript and DOM. [7]
>  >
>  > 33..   A developer creating a (multimodal) interface combining speech
> input with graphical output needs to have the ability to provide a
> consistent user
>  > experience not just for graphical elements but also for voice. [8]
>  >
>  > 34..   Hello world example. [9]
>  >
>  > 35..   Basic VCR-like text reader example. [9]
>  >
>  > 36..   Free-form collector example. [9]
>  >
>  > 37..   Grammar-based collector example. [9]
>  >
>  > 38.  User-selected recognizer. [10]
>  >
>  > 39.  User-controlled speech parameters. [10]
>  >
>  > 40.  Make it easy to integrate input from different modalities. [10]
>  >
>  > 41.  Allow an author to specify an application-specific statistical
> language model. [10]
>  >
>  > 42.  Make the use of speech optional. [10]
>  >
>  > 43.  Support for completely hands-free operation. [10]
>  >
>  > 44.  Make the standard easy to extend. [10]
>  >
>  > 45.  Selection of the speech engine should be a user-setting in the
> browser, not a Web developer setting. [11]
>  >
>  > 46.  It should be possible to specify a target TTS engine not only via
> the "URI" attribute, but via a more generic "source" attribute, which can
> point to a
>  > local TTS engine as well. [12]
>  >
>  > 47.  TTS should provide the user, or developer, with finer granularity
> in control over the text segments being synthesized. [13]
>  >
>  > 48.  Interacting with multiple input elements. [14]
>  >
>  > 49.  Interacting without visible input elements. [14]
>  >
>  > 50.  Re-recognition. [14]
>  >
>  > 51.  Continuous recognition. [14]
>  >
>  > 52.  Voice activity detection. [14]
>  >
>  > 53.  Minimize user perceived latency. [14]
>  >
>  > 54.  High quality default, but application customizable, speech
> recognition graphical user interface. [14]
>  >
>  > 55.  Rich recognition results allowing analysis and compex expression
> (I.e., confidence, alternatives, structured output). [14]
>  >
>  > 56.  Ability to specify domain specific grammars. [14]
>  >
>  > 57.  Web author able to write one speech experience that performs
> identically across user agents and/or devices. [14]
>  >
>  > 58.  Sythesis that is synchronized with other media (particular visual
> display). [14]
>  >
>  > 59.  Ability to effect barge-in (interrupt sythesis). [14]
>  >
>  > 60.  Ability to mitigate false-barge-in scenarios. [14]
>  >
>  > 61.  Playback controls (repeat, skip forward, skip backwards, not just
> by time but by spoken language segments like words, sentences, and
> paragraphs). [14]
>  >
>  > 62.  A user agent needs to provide clear indication to the user whenever
> it is using a microphone to listen to the user. [14]
>  >
>  > 63.  Ability of users to explicitly grant permission for the browser, or
> an application, to listen to them. [14]
>  >
>  > 64.  Needs to be a way to have a trust relationship between the user and
> whatever processes their utterance. [14]
>  >
>  > 65.  Any user agent should work with any vendor's speech services,
> provided it meets specific open protocol requirements. [14]
>  >
>  > 66.  Grammars, TTS and media composition, and recognition results should
> use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]
>  >
>  > 67.  Ability to specify service capabilities and hints. [14]
>  >
>  > 68.  Ability to enable multiple languages/dialects for the same page.
> [15]
>  >
>  > 69.  It is critical that the markup support specification of a network
> speech resource to be used for recognition or synthesis. [16]
>  >
>  > 70.  End users need a way to adjust properties such as timeouts. [17]
>  >
>  > References:
>  >
>  > 1 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html[referencing
> https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and repeated in
>  >
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html
> ]
>  >
>  > 2 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
>  >
>  > 3 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
>  >
>  > 4 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
>  >
>  > 5 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
>  >
>  > 6 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
>  >
>  > 7 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html[referencing
> http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
>  >
>  > 8 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
>  >
>  > 9 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
>  >
>  > 10 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
>  >
>  > 11 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
>  >
>  > 12 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
>  >
>  > 13 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
>  >
>  > 14 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
>  >
>  > 15 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
>  >
>  > 16 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
>  >
>  > 17 -
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
>  >
>
> --
> Best Regards,
> --raman
>
> Title:  Research Scientist
> Email:  raman@google.com
> WWW:    http://emacspeak.sf.net/raman/
> Google: tv+raman
> GTalk:  raman@google.com
> PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc
>
>

Received on Thursday, 23 September 2010 16:49:21 UTC