Interacting with WebRTC, the Web Audio API and other external sources from Peter Beverloo on 2012-06-13 (public-speech-api@w3.org from June 2012)

From: Peter Beverloo <beverloo@google.com>
Date: Wed, 13 Jun 2012 15:49:56 +0100
To: public-speech-api@w3.org
Message-ID: <CALt3x6nPDnDni5A-btan38acX1i6T=e_y-VxDgeWh=_6MdVcOg@mail.gmail.com>

Currently, the SpeechRecognition[1] interface defines three methods to
start, stop or abort speech recognition, the source of which will be an
audio input device as controlled by the user agent. Similarly, the
TextToSpeech (TTS) interface defines play, pause and stop, which will
output the generated speech to an output device, again, as controlled by
the user agent.

There are various other media and interaction APIs in development right
now, and I believe it would be good for the Speech API to more tightly
integrate with them. In this e-mail, I'd like to focus on some additional
features for integration with WebRTC and the Web Audio API.

** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>

WebRTC provides the ability to interact with the user's microphone and
camera through the getUserMedia() method. As such, an important use-case is
(video and --) audio chatting between two or more people. Audio is
available through a MediaStream object, which can be re-used to power, for
example, an <audio> element, transmitted to other people through a
peer-to-peer connection, but can also integrate with the Web Audio API
through an Audio Context's createMediaStreamSource() method.

** Web Audio API <
https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>

The Web Audio API provides the ability to process, analyze, synthesize and
modify audio through JavaScript. It can get its input from media files
through XMLHttpRequest, from media elements such as <audio> and <video> and
from any kind of other system, which includes WebRTC, that is able to
provide an audio-based MediaStream.

Since speech recognition and synthesis does not have to be limited to live
input from and output to the user, I'd like to present two new use-cases.

1) Transcripts for (live) communication.

While the specification does not mandate a maximum duration of a speech
input stream, this suggestion is most appropriate for implementations
utilizing a local recognizer. Allowing MediaStreams to be used as an input
for a SpeechRecognition object, for example through a new "inputStream"
property as an alternative to the start, stop and abort methods, would
enable authors to supply external input to be recognized. This may include,
but is not limited to, prerecorded audio files and WebRTC live streams,
both from local and remote parties.

2) Storing and processing text-to-speech fragments.

Rather than mandating immediate output of the synthesized audio stream, it
should be considered to introduce an "outputStream" property on a
TextToSpeech object which provides a MediaStream object. This allows the
synthesized stream to be played through the <audio> element, processed
through the Web Audio API or even to be stored locally for caching, in case
the user is using a device which is not always connected to the internet
(and when no local recognizer is available). Furthermore, this would allow
websites to store the synthesized audio to a wave file and save this on the
server, allowing it to be re-used for user agents or other clients which do
not provide an implementation.

The Web platform gains its power by the ability to combine technologies,
and I think it would be great to see the Speech API playing a role in that.

Best,
Peter

[1]
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section

Received on Wednesday, 13 June 2012 14:50:33 UTC