Re: Interacting with WebRTC, the Web Audio API and other external sources from Peter Beverloo on 2012-07-19 (public-speech-api@w3.org from July 2012)

From: Peter Beverloo <beverloo@google.com>
Date: Thu, 19 Jul 2012 15:38:14 +0100
To: public-speech-api@w3.org
Message-ID: <CALt3x6mBdV_prNfB6bDig47c-DCXAgKBP+_h2Y8=Pz4M-JXQDg@mail.gmail.com>
With all major browser vendors being members of the WebRTC working group,
it may actually be worth considering to slim down the APIs and re-use the
interface they'll provide.

As an addendum to the quoted proposal:

* Drop the "start", "stop" and "abort" methods from the SpeechRecognition
object in favor of an input MediaStream acquired through getUserMedia()[1].

Alternatively, the three methods could be re-purposed allowing
partial/timed recognition in case of continuous media streams, rather than
the whole stream.

Best,
Peter

[1]
http://dev.w3.org/2011/webrtc/editor/getusermedia.html#navigatorusermedia


On Wed, Jun 13, 2012 at 3:49 PM, Peter Beverloo <beverloo@google.com> wrote:

> Currently, the SpeechRecognition[1] interface defines three methods to
> start, stop or abort speech recognition, the source of which will be an
> audio input device as controlled by the user agent. Similarly, the
> TextToSpeech (TTS) interface defines play, pause and stop, which will
> output the generated speech to an output device, again, as controlled by
> the user agent.
>
> There are various other media and interaction APIs in development right
> now, and I believe it would be good for the Speech API to more tightly
> integrate with them. In this e-mail, I'd like to focus on some additional
> features for integration with WebRTC and the Web Audio API.
>
> ** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>
>
> WebRTC provides the ability to interact with the user's microphone and
> camera through the getUserMedia() method. As such, an important use-case is
> (video and --) audio chatting between two or more people. Audio is
> available through a MediaStream object, which can be re-used to power, for
> example, an <audio> element, transmitted to other people through a
> peer-to-peer connection, but can also integrate with the Web Audio API
> through an Audio Context's createMediaStreamSource() method.
>
> ** Web Audio API <
> https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>
>
> The Web Audio API provides the ability to process, analyze, synthesize and
> modify audio through JavaScript. It can get its input from media files
> through XMLHttpRequest, from media elements such as <audio> and <video> and
> from any kind of other system, which includes WebRTC, that is able to
> provide an audio-based MediaStream.
>
> Since speech recognition and synthesis does not have to be limited to live
> input from and output to the user, I'd like to present two new use-cases.
>
> 1) Transcripts for (live) communication.
>
> While the specification does not mandate a maximum duration of a speech
> input stream, this suggestion is most appropriate for implementations
> utilizing a local recognizer. Allowing MediaStreams to be used as an input
> for a SpeechRecognition object, for example through a new "inputStream"
> property as an alternative to the start, stop and abort methods, would
> enable authors to supply external input to be recognized. This may include,
> but is not limited to, prerecorded audio files and WebRTC live streams,
> both from local and remote parties.
>
> 2) Storing and processing text-to-speech fragments.
>
> Rather than mandating immediate output of the synthesized audio stream, it
> should be considered to introduce an "outputStream" property on a
> TextToSpeech object which provides a MediaStream object. This allows the
> synthesized stream to be played through the <audio> element, processed
> through the Web Audio API or even to be stored locally for caching, in case
> the user is using a device which is not always connected to the internet
> (and when no local recognizer is available). Furthermore, this would allow
> websites to store the synthesized audio to a wave file and save this on the
> server, allowing it to be re-used for user agents or other clients which do
> not provide an implementation.
>
> The Web platform gains its power by the ability to combine technologies,
> and I think it would be great to see the Speech API playing a role in that.
>
> Best,
> Peter
>
> [1]
> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section
>
Received on Thursday, 19 July 2012 14:38:49 UTC