RE: Interacting with WebRTC, the Web Audio API and other external sources from Adam Sobieski on 2012-09-15 (public-speech-api@w3.org from September 2012)

From: Adam Sobieski <adamsobieski@hotmail.com>
Date: Sat, 15 Sep 2012 21:36:30 +0000
To: Peter Beverloo <beverloo@google.com>
CC: "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <SNT002-W101C4D67E304FF7A59DBA3CC5970@phx.gbl>
Speech API Community Group, Greetings.  I was reading about recent developments with regard to the WebRTC stack and I wanted to express that, as the WebRTC stack will be available upcoming, that it could be useful to the Speech API.  WebRTC includes MediaStream, DataChannel, and PeerConnection interfaces. In addition to video calls and video conferencing are possible video forums and scenarios with streaming content to media repository services.  Some technologists are additionally excited about 3D video and microphone array functionality. Speech recognition can facilitate numerous technologies and features including generating hypertext transcripts, computers as teleprompters, and other human-computer interaction and user interface topics pertaining to web-based multimedia blogging.   Kind regards, Adam Sobieski Date: Thu, 19 Jul 2012 15:38:14 +0100
From: beverloo@google.com
To: public-speech-api@w3.org
Subject: Re: Interacting with WebRTC, the Web Audio API and other external sources

With all major browser vendors being members of the WebRTC working group, it may actually be worth considering to slim down the APIs and re-use the interface they'll provide.
As an addendum to the quoted proposal:

* Drop the "start", "stop" and "abort" methods from the SpeechRecognition object in favor of an input MediaStream acquired through getUserMedia()[1].

Alternatively, the three methods could be re-purposed allowing partial/timed recognition in case of continuous media streams, rather than the whole stream.
Best,Peter

[1] http://dev.w3.org/2011/webrtc/editor/getusermedia.html#navigatorusermedia


On Wed, Jun 13, 2012 at 3:49 PM, Peter Beverloo <beverloo@google.com> wrote:

Currently, the SpeechRecognition[1] interface defines three methods to start, stop or abort speech recognition, the source of which will be an audio input device as controlled by the user agent. Similarly, the TextToSpeech (TTS) interface defines play, pause and stop, which will output the generated speech to an output device, again, as controlled by the user agent.




There are various other media and interaction APIs in development right now, and I believe it would be good for the Speech API to more tightly integrate with them. In this e-mail, I'd like to focus on some additional features for integration with WebRTC and the Web Audio API.




** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>
WebRTC provides the ability to interact with the user's microphone and camera through the getUserMedia() method. As such, an important use-case is (video and --) audio chatting between two or more people. Audio is available through a MediaStream object, which can be re-used to power, for example, an <audio> element, transmitted to other people through a peer-to-peer connection, but can also integrate with the Web Audio API through an Audio Context's createMediaStreamSource() method. 



** Web Audio API <https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>



The Web Audio API provides the ability to process, analyze, synthesize and modify audio through JavaScript. It can get its input from media files through XMLHttpRequest, from media elements such as <audio> and <video> and from any kind of other system, which includes WebRTC, that is able to provide an audio-based MediaStream.



Since speech recognition and synthesis does not have to be limited to live input from and output to the user, I'd like to present two new use-cases.
1) Transcripts for (live) communication.



While the specification does not mandate a maximum duration of a speech input stream, this suggestion is most appropriate for implementations utilizing a local recognizer. Allowing MediaStreams to be used as an input for a SpeechRecognition object, for example through a new "inputStream" property as an alternative to the start, stop and abort methods, would enable authors to supply external input to be recognized. This may include, but is not limited to, prerecorded audio files and WebRTC live streams, both from local and remote parties.



2) Storing and processing text-to-speech fragments.
Rather than mandating immediate output of the synthesized audio stream, it should be considered to introduce an "outputStream" property on a TextToSpeech object which provides a MediaStream object. This allows the synthesized stream to be played through the <audio> element, processed through the Web Audio API or even to be stored locally for caching, in case the user is using a device which is not always connected to the internet (and when no local recognizer is available). Furthermore, this would allow websites to store the synthesized audio to a wave file and save this on the server, allowing it to be re-used for user agents or other clients which do not provide an implementation.



The Web platform gains its power by the ability to combine technologies, and I think it would be great to see the Speech API playing a role in that.
Best,Peter



[1] http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section
Received on Saturday, 15 September 2012 21:36:58 UTC