RE: Interacting with WebRTC, the Web Audio API and other external sources

Adam,

I'm participating in the WebRTC work and hope that it can be made useful
to the SpeechAPI.  One problem is that WebRTC relies on UDP, while I
understand from Milan that recognizers do better with TCP.  I don't know
if we'll be able to add TCP to the WebRTC work.  If not, at least we
will make sure that it is possible for the application to access the
user's speech input.  It can then construct its own socket and transmit
them to the ASR engine, if necessary.

 

-          Jim

 

From: Adam Sobieski [mailto:adamsobieski@hotmail.com] 
Sent: Saturday, September 15, 2012 5:36 PM
To: Peter Beverloo
Cc: public-speech-api@w3.org
Subject: RE: Interacting with WebRTC, the Web Audio API and other
external sources

 

Speech API Community Group,
 
Greetings.  I was reading about recent developments with regard to the
WebRTC stack and I wanted to express that, as the WebRTC stack will be
available upcoming, that it could be useful to the Speech API.  WebRTC
includes MediaStream, DataChannel, and PeerConnection interfaces.
 
In addition to video calls and video conferencing are possible video
forums and scenarios with streaming content to media repository
services.  Some technologists are additionally excited about 3D video
and microphone array functionality.
 
Speech recognition can facilitate numerous technologies and features
including generating hypertext transcripts, computers as teleprompters,
and other human-computer interaction and user interface topics
pertaining to web-based multimedia blogging.
 
 
 
Kind regards,
 
Adam Sobieski
 

________________________________

Date: Thu, 19 Jul 2012 15:38:14 +0100
From: beverloo@google.com
To: public-speech-api@w3.org
Subject: Re: Interacting with WebRTC, the Web Audio API and other
external sources

With all major browser vendors being members of the WebRTC working
group, it may actually be worth considering to slim down the APIs and
re-use the interface they'll provide.

 

As an addendum to the quoted proposal:

 

* Drop the "start", "stop" and "abort" methods from the
SpeechRecognition object in favor of an input MediaStream acquired
through getUserMedia()[1].

 

Alternatively, the three methods could be re-purposed allowing
partial/timed recognition in case of continuous media streams, rather
than the whole stream.

 

Best,

Peter

 

[1]
http://dev.w3.org/2011/webrtc/editor/getusermedia.html#navigatorusermedi
a

 

On Wed, Jun 13, 2012 at 3:49 PM, Peter Beverloo <beverloo@google.com>
wrote:

Currently, the SpeechRecognition[1] interface defines three methods to
start, stop or abort speech recognition, the source of which will be an
audio input device as controlled by the user agent. Similarly, the
TextToSpeech (TTS) interface defines play, pause and stop, which will
output the generated speech to an output device, again, as controlled by
the user agent.

 

There are various other media and interaction APIs in development right
now, and I believe it would be good for the Speech API to more tightly
integrate with them. In this e-mail, I'd like to focus on some
additional features for integration with WebRTC and the Web Audio API.

 

** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>

 

WebRTC provides the ability to interact with the user's microphone and
camera through the getUserMedia() method. As such, an important use-case
is (video and --) audio chatting between two or more people. Audio is
available through a MediaStream object, which can be re-used to power,
for example, an <audio> element, transmitted to other people through a
peer-to-peer connection, but can also integrate with the Web Audio API
through an Audio Context's createMediaStreamSource() method. 

 

** Web Audio API
<https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>

 

The Web Audio API provides the ability to process, analyze, synthesize
and modify audio through JavaScript. It can get its input from media
files through XMLHttpRequest, from media elements such as <audio> and
<video> and from any kind of other system, which includes WebRTC, that
is able to provide an audio-based MediaStream.

 

Since speech recognition and synthesis does not have to be limited to
live input from and output to the user, I'd like to present two new
use-cases.

 

1) Transcripts for (live) communication.

 

While the specification does not mandate a maximum duration of a speech
input stream, this suggestion is most appropriate for implementations
utilizing a local recognizer. Allowing MediaStreams to be used as an
input for a SpeechRecognition object, for example through a new
"inputStream" property as an alternative to the start, stop and abort
methods, would enable authors to supply external input to be recognized.
This may include, but is not limited to, prerecorded audio files and
WebRTC live streams, both from local and remote parties.

 

2) Storing and processing text-to-speech fragments.

 

Rather than mandating immediate output of the synthesized audio stream,
it should be considered to introduce an "outputStream" property on a
TextToSpeech object which provides a MediaStream object. This allows the
synthesized stream to be played through the <audio> element, processed
through the Web Audio API or even to be stored locally for caching, in
case the user is using a device which is not always connected to the
internet (and when no local recognizer is available). Furthermore, this
would allow websites to store the synthesized audio to a wave file and
save this on the server, allowing it to be re-used for user agents or
other clients which do not provide an implementation.

 

The Web platform gains its power by the ability to combine technologies,
and I think it would be great to see the Speech API playing a role in
that.

 

Best,

Peter

 

[1]
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-
section

 

Received on Monday, 17 September 2012 12:48:44 UTC