Voice on the Web - message from the Voice Browser/Multimodal Interaction WGs

Dear HTML5 Working Group,

The purpose of this message is to initiate discussion between the HTML
Working Group and the W3C Working Groups involved with voice -- the Voice
Browser and Multimodal Interaction Working Groups. Paul Cotton has 
encouraged us to start this discussion on the HTML mailing list in order 
to collect use cases for how voice can be used in HTML applications and to
collect your ideas and requirements about the best ways for HTML authors
to access voice capabilities. 

By "voice" we mean capabilities such as speech recognition, text to speech,
speaker verification (confirming someone's identity through their voice),
audio capture and audio playback, and the ability to coordinate all these
capabilities by means of a dialog.

Some possible voice use cases that occur to us include:
1. form-filling by voice; that is, speaking form values rather than typing
   or selecting them with a mouse
2. initiating a search (for example, a web search, site search or page
   search) by speaking the search terms rather than typing them
3. using text to speech to read portions of a screen (for example if the
   user's eyes are busy or if the user is illiterate or dyslexic).
4. using voice for general text input on mobile devices with hard to use
   keyboards
5. using speaker verification to confirm the user's identity, for example as
   a supplement to a user id and password
6. combinations of the above, for example, selecting part of the screen and
   saying "read that"

We are very interested in hearing your reactions to these use cases and any
other use cases that you might be thinking about.

An important consideration for voice applications is where the actual speech
technology comes from. Some platforms, like Windows 7, have speech
recognition built into the OS, and this includes even some small mobile
devices. Another option is for speech to be built into the browser. In the
past, some browsers (for example Opera and IE) have included speech
recognition and text to speech in the browser. Speech technologies are also
available in the cloud, using either standard protocols like MRCP, or other
services such as the AT&T Speech Mashup or MIT's WAMI. In fact, in typical
voice-only applications the browser itself runs in the network because the
application must be accessible from very limited input devices such as
traditional land-line phones. There are pros and cons to all these
approaches that we would be happy to discuss if there is interest. We are
hoping to get your opinions about which of these are the most critical to
support. 

Finally, regardless of where the speech processing is actually done, we are
also very interested in discussing requirements for different ways authors
could access speech functionality from HTML. Two possibilities, although
there may be others, are JavaScript libraries that link to speech services
or declarative markup. 

Kazuyuki Ashimura
W3C Multimodal Interaction & Voice Browser Activity Lead

-- 
Kazuyuki Ashimura / W3C Multimodal & Voice Activity Lead
mailto: ashimura@w3.org
voice: +81.466.49.1170 / fax: +81.466.49.1171

Received on Wednesday, 24 March 2010 18:23:54 UTC