Re: Collection of requirements and use cases from Olli Pettay on 2010-09-23 (public-xg-htmlspeech@w3.org from September 2010)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Thu, 23 Sep 2010 20:10:14 +0300
To: public-xg-htmlspeech@w3.org
Message-ID: <4C9B89F6.6060207@helsinki.fi>
On 09/23/2010 06:22 PM, T.V Raman wrote:
> 
> Good job Michael!
> 
> Next step -- at this point we've pooled all the requirements of the
> last 10 + years of the MMIWG, plus a few additional ones to boot from
> VBWG.
> 
> Now, given that those have not been addressed by a single solution in
> 10+ years,  I believe it would be both naive and extremely egotistic
> of this XG  to  try to address all of them at one fell swoop -- we'll
> be here another 10 years -- during which time the requirements will
> only increase.
> 
> I urge everyone to take a deep breath, then proceed in small 
> practical steps toward building things that the Web needs today.

+1

...which is why I hope we end up to some simple API to control ASR and
TTS. Mini-SALT?
If we had that kind of API, script libries could then add support
for binding ASR/TTS to HTML form elements etc.

-Olli


> 
> Michael Bodell writes:
>> In order to make more structured progress on addressing all the
>> requirements and use cases sent to the list I’ve collated them into
>> one comprehensive set in the order they were received.  If anyone
>> has use case or requirements that they didn’t send yet or that they
>> don’t see in this list please send them by Monday the 27^th.  I've
>> tried to be exhaustive here to be complete and fair and not worried
>> at all if some of the requirements are similar to one another and
>> if other requirements are exact opposites.  I’ll work on a more
>> organized representation of both additional information sent and
>> this list of requirements and use cases next week.
>> 
>> 1.       Web search by voice:  Speak a search query, and get search
>> results. [1]
>> 
>> 2.       Speech translation: The app works as an interpreter
>> between two users that speak different languages. [1]
>> 
>> 3.       Speech-enabled webmail client, e.g. for in-car use. Reads
>> out e-mails and listens for commands, e.g. "archive", "star",
>> "reply, ok, let's meet at 2 pm", "forward to bob". [1]
>> 
>> 4.       Speech shell:  Allows multiple comments, most of which
>> take arguments, some of which are free-form. E.g. "call<number>",
>> "call<contact>", "calculate<arithmetic expression>", "search
>> for<query>".. [1]
>> 
>> 5.       Turn-by-turn navigation:  Speaks driving instructions, and
>> accepts spoken commands, e.g. "navigate to<address>", "navigate
>> to<contact name>", "navigate to<business name>", "reroute",
>> "suspend navigation". [1]
>> 
>> 6.       Dialog systems, e.g. flight booking, pizza ordering. [1]
>> 
>> 7.       Multimodal interaction:  Say "I want to go here", and
>> click on a map. [1]
>> 
>> 8.       VoiceXML interpreter:  Fetches a VoiceXML app using
>> XMLHttpRequest, and interprets it using JavaScript and DOM. [1]
>> 
>> 9.       The HTML+Speech standard must allow specification of the
>> speech resource (e.g. speech recognizer) to be used for processing
>> of the audio collected from the user. [2]
>> 
>> 10..   The ability to switch between a grammar based recognition to
>> free form recognition. [3]
>> 
>> 11..   Ability to specify the field relationships. For example when
>> a country field is selected, the state field selections change, so
>> corresponding grammar/ choices should also be changed. [3]
>> 
>> 12..   The API must notify the web app when a spoken utterance has
>> been recognized. [4]
>> 
>> 13..   The API must notify the web app on speech recognition
>> errors. [4]
>> 
>> 14..   The API should provide access to a list of speech
>> recognition hypotheses. [4]
>> 
>> 15..   The API should allow, but not require, specifying a grammar
>> for the speech recognizer to use. [4]
>> 
>> 16..   The API should allow specifying the natural language in
>> which to perform speech recognition. This will override the
>> language of the web page. [4]
>> 
>> 17..   For privacy reasons, the API should not allow web apps
>> access to raw audio data but only provide recognition results. [4]
>> 
>> 18..   For privacy reason, speech recognition should only be
>> started in response to user action. [4]
>> 
>> 19..   Web app developers should not have to run their own speech
>> recognition services. [4]
>> 
>> 20..   Provide temporal structure of synthesized speech.  E.g., to
>> highlight the word in a visual rendition of the speech, to
>> synchronize with other modalities in a multimodal presentation, to
>> know when to interrupt [5]
>> 
>> 21..   Allow streaming for longer stretches of spoken output. [5]
>> 
>> 22..   Use full SSML features including gender, language,
>> pronunciations, etc. [5]
>> 
>> 23..   Web app developers should not be excluded from running their
>> own speech recognition services. [6]
>> 
>> 24..   End users should not be prevented from creating or extend
>> existing grammars on both a global and per application basis. [6]
>> 
>> 25..   End-user extensions should be accessible either from the
>> desktop or from the cloud. [6]
>> 
>> 26..   For reasons of privacy, the user should not be forced to
>> store anything about their speech recognition environment on the
>> cloud. [6]
>> 
>> 27..   Any public interfaces for creating extensions should be
>> "speakable". [6]
>> 
>> 28..   TTS in Speech translation: The app works as an interpreter
>> between two users that speak different languages. [7]
>> 
>> 29..   TTS in Speech-enabled webmail client, e.g. for in-car use.
>> Reads out e-mails and listens for commands, e.g. "archive", "star",
>> "reply, ok, let's meet at 2 pm", "forward to bob". [7]
>> 
>> 30..   TTS in Turn-by-turn navigation:  Speaks driving
>> instructions, and accepts spoken commands, e.g. "navigate
>> to<address>", "navigate to<contact name>", "navigate to<business
>> name>", "reroute", "suspend navigation". [7]
>> 
>> 31..   TTS in Dialog systems, e.g. flight booking, pizza ordering.
>> [7]
>> 
>> 32..   TTS in VoiceXML interpreter:  Fetches a VoiceXML app using
>> XMLHttpRequest, and interprets it using JavaScript and DOM. [7]
>> 
>> 33..   A developer creating a (multimodal) interface combining
>> speech input with graphical output needs to have the ability to
>> provide a consistent user experience not just for graphical
>> elements but also for voice. [8]
>> 
>> 34..   Hello world example. [9]
>> 
>> 35..   Basic VCR-like text reader example. [9]
>> 
>> 36..   Free-form collector example. [9]
>> 
>> 37..   Grammar-based collector example. [9]
>> 
>> 38.  User-selected recognizer. [10]
>> 
>> 39.  User-controlled speech parameters. [10]
>> 
>> 40.  Make it easy to integrate input from different modalities.
>> [10]
>> 
>> 41.  Allow an author to specify an application-specific statistical
>> language model. [10]
>> 
>> 42.  Make the use of speech optional. [10]
>> 
>> 43.  Support for completely hands-free operation. [10]
>> 
>> 44.  Make the standard easy to extend. [10]
>> 
>> 45.  Selection of the speech engine should be a user-setting in the
>> browser, not a Web developer setting. [11]
>> 
>> 46.  It should be possible to specify a target TTS engine not only
>> via the "URI" attribute, but via a more generic "source" attribute,
>> which can point to a local TTS engine as well. [12]
>> 
>> 47.  TTS should provide the user, or developer, with finer
>> granularity in control over the text segments being synthesized.
>> [13]
>> 
>> 48.  Interacting with multiple input elements. [14]
>> 
>> 49.  Interacting without visible input elements. [14]
>> 
>> 50.  Re-recognition. [14]
>> 
>> 51.  Continuous recognition. [14]
>> 
>> 52.  Voice activity detection. [14]
>> 
>> 53.  Minimize user perceived latency. [14]
>> 
>> 54.  High quality default, but application customizable, speech
>> recognition graphical user interface. [14]
>> 
>> 55.  Rich recognition results allowing analysis and compex
>> expression (I.e., confidence, alternatives, structured output).
>> [14]
>> 
>> 56.  Ability to specify domain specific grammars. [14]
>> 
>> 57.  Web author able to write one speech experience that performs
>> identically across user agents and/or devices. [14]
>> 
>> 58.  Sythesis that is synchronized with other media (particular
>> visual display). [14]
>> 
>> 59.  Ability to effect barge-in (interrupt sythesis). [14]
>> 
>> 60.  Ability to mitigate false-barge-in scenarios. [14]
>> 
>> 61.  Playback controls (repeat, skip forward, skip backwards, not
>> just by time but by spoken language segments like words, sentences,
>> and paragraphs). [14]
>> 
>> 62.  A user agent needs to provide clear indication to the user
>> whenever it is using a microphone to listen to the user. [14]
>> 
>> 63.  Ability of users to explicitly grant permission for the
>> browser, or an application, to listen to them. [14]
>> 
>> 64.  Needs to be a way to have a trust relationship between the
>> user and whatever processes their utterance. [14]
>> 
>> 65.  Any user agent should work with any vendor's speech services,
>> provided it meets specific open protocol requirements. [14]
>> 
>> 66.  Grammars, TTS and media composition, and recognition results
>> should use standard formats (e.g. SRGS, SSML, SMIL, EMMA). [14]
>> 
>> 67.  Ability to specify service capabilities and hints. [14]
>> 
>> 68.  Ability to enable multiple languages/dialects for the same
>> page. [15]
>> 
>> 69.  It is critical that the markup support specification of a
>> network speech resource to be used for recognition or synthesis.
>> [16]
>> 
>> 70.  End users need a way to adjust properties such as timeouts.
>> [17]
>> 
>> References:
>> 
>> 1 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0001.html
>> [referencing https://docs.google.com/View?id=dcfg79pz_5dhnp23f5 and
>> repeated in 
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0043.html]
>
>> 
> 
>> 2 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0007.html
>
>> 
> 
>> 3 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0011.html
>
>> 
> 
>> 4 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0012.html
>
>> 
> 
>> 5 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0014.html
>
>> 
> 
>> 6 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0015.html
>
>> 
> 
>> 7 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0018.html
>> [referencing http://docs.google.com/View?id=dcfg79pz_4gnmp96cz]
>> 
>> 8 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0024.html
>
>> 
> 
>> 9 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0029.html
>
>> 
> 
>> 10 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0032.html
>
>> 
> 
>> 11 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0035.html
>
>> 
> 
>> 12 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0041.html
>
>> 
> 
>> 13 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0044.html
>
>> 
> 
>> 14 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0046.html
>
>> 
> 
>> 15 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0047.html
>
>> 
> 
>> 16 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0048.html
>
>> 
> 
>> 17 -
>> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0049.html
>
>> 
> 
>
Received on Thursday, 23 September 2010 17:10:47 UTC