RE: Interacting with WebRTC, the Web Audio API and other external sources from Jim Barnett on 2012-09-18 (public-speech-api@w3.org from September 2012)

From: Jim Barnett <Jim.Barnett@genesyslab.com>
Date: Tue, 18 Sep 2012 05:30:30 -0700
To: "Young, Milan" <Milan.Young@nuance.com>, "Adam Sobieski" <adamsobieski@hotmail.com>, "Peter Beverloo" <beverloo@google.com>
Cc: <public-speech-api@w3.org>
Message-ID: <E17CAD772E76C742B645BD4DC602CD8106B5F93B@NAHALD.us.int.genesyslab.com>
Yes, it's certainly too late for this draft.  As I recall, we have a 'url' property somewhere that it supposed to allow the app to specify the location of the remote recognizer.  Maybe we could  put in a note along the lines of:   [Need to specify how access to remote recognition is going to work]

 

-          Jim

 

From: Young, Milan [mailto:Milan.Young@nuance.com] 
Sent: Monday, September 17, 2012 6:54 PM
To: Jim Barnett; Adam Sobieski; Peter Beverloo
Cc: public-speech-api@w3.org
Subject: RE: Interacting with WebRTC, the Web Audio API and other external sources

 

Transport problems aside, the idea of integrating with getUserMedia() is a good one.  In addition to bringing unity across standards, we are able to push a good deal of the privacy/consent problems to that subgroup who are more focused on that task.

 

Unfortunately, I think it's a bit too late for this effort to consider such a relatively major rewrite.  I suggest that we add this to the issue lists that Glen and Hans have offered to compile [1].

 

Speaking of that, I'd appreciate an update on how that is going.  It's been over a month now.  Glen and Hans, any progress to report?

 

[1] http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0026.html 

 

 

 

From: Jim Barnett [mailto:Jim.Barnett@genesyslab.com] 
Sent: Monday, September 17, 2012 5:49 AM
To: Adam Sobieski; Peter Beverloo
Cc: public-speech-api@w3.org
Subject: RE: Interacting with WebRTC, the Web Audio API and other external sources

 

Adam,

I'm participating in the WebRTC work and hope that it can be made useful to the SpeechAPI.  One problem is that WebRTC relies on UDP, while I understand from Milan that recognizers do better with TCP.  I don't know if we'll be able to add TCP to the WebRTC work.  If not, at least we will make sure that it is possible for the application to access the  user's speech input.  It can then construct its own socket and transmit them to the ASR engine, if necessary.

 

-          Jim

 

From: Adam Sobieski [mailto:adamsobieski@hotmail.com] 
Sent: Saturday, September 15, 2012 5:36 PM
To: Peter Beverloo
Cc: public-speech-api@w3.org
Subject: RE: Interacting with WebRTC, the Web Audio API and other external sources

 

Speech API Community Group,
 
Greetings.  I was reading about recent developments with regard to the WebRTC stack and I wanted to express that, as the WebRTC stack will be available upcoming, that it could be useful to the Speech API.  WebRTC includes MediaStream, DataChannel, and PeerConnection interfaces.
 
In addition to video calls and video conferencing are possible video forums and scenarios with streaming content to media repository services.  Some technologists are additionally excited about 3D video and microphone array functionality.
 
Speech recognition can facilitate numerous technologies and features including generating hypertext transcripts, computers as teleprompters, and other human-computer interaction and user interface topics pertaining to web-based multimedia blogging.
 
 
 
Kind regards,
 
Adam Sobieski
 

________________________________

Date: Thu, 19 Jul 2012 15:38:14 +0100
From: beverloo@google.com
To: public-speech-api@w3.org
Subject: Re: Interacting with WebRTC, the Web Audio API and other external sources

With all major browser vendors being members of the WebRTC working group, it may actually be worth considering to slim down the APIs and re-use the interface they'll provide.

 

As an addendum to the quoted proposal:

 

* Drop the "start", "stop" and "abort" methods from the SpeechRecognition object in favor of an input MediaStream acquired through getUserMedia()[1].

 

Alternatively, the three methods could be re-purposed allowing partial/timed recognition in case of continuous media streams, rather than the whole stream.

 

Best,

Peter

 

[1] http://dev.w3.org/2011/webrtc/editor/getusermedia.html#navigatorusermedia

 

On Wed, Jun 13, 2012 at 3:49 PM, Peter Beverloo <beverloo@google.com> wrote:

Currently, the SpeechRecognition[1] interface defines three methods to start, stop or abort speech recognition, the source of which will be an audio input device as controlled by the user agent. Similarly, the TextToSpeech (TTS) interface defines play, pause and stop, which will output the generated speech to an output device, again, as controlled by the user agent.

 

There are various other media and interaction APIs in development right now, and I believe it would be good for the Speech API to more tightly integrate with them. In this e-mail, I'd like to focus on some additional features for integration with WebRTC and the Web Audio API.

 

** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>

 

WebRTC provides the ability to interact with the user's microphone and camera through the getUserMedia() method. As such, an important use-case is (video and --) audio chatting between two or more people. Audio is available through a MediaStream object, which can be re-used to power, for example, an <audio> element, transmitted to other people through a peer-to-peer connection, but can also integrate with the Web Audio API through an Audio Context's createMediaStreamSource() method. 

 

** Web Audio API <https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>

 

The Web Audio API provides the ability to process, analyze, synthesize and modify audio through JavaScript. It can get its input from media files through XMLHttpRequest, from media elements such as <audio> and <video> and from any kind of other system, which includes WebRTC, that is able to provide an audio-based MediaStream.

 

Since speech recognition and synthesis does not have to be limited to live input from and output to the user, I'd like to present two new use-cases.

 

1) Transcripts for (live) communication.

 

While the specification does not mandate a maximum duration of a speech input stream, this suggestion is most appropriate for implementations utilizing a local recognizer. Allowing MediaStreams to be used as an input for a SpeechRecognition object, for example through a new "inputStream" property as an alternative to the start, stop and abort methods, would enable authors to supply external input to be recognized. This may include, but is not limited to, prerecorded audio files and WebRTC live streams, both from local and remote parties.

 

2) Storing and processing text-to-speech fragments.

 

Rather than mandating immediate output of the synthesized audio stream, it should be considered to introduce an "outputStream" property on a TextToSpeech object which provides a MediaStream object. This allows the synthesized stream to be played through the <audio> element, processed through the Web Audio API or even to be stored locally for caching, in case the user is using a device which is not always connected to the internet (and when no local recognizer is available). Furthermore, this would allow websites to store the synthesized audio to a wave file and save this on the server, allowing it to be re-used for user agents or other clients which do not provide an implementation.

 

The Web platform gains its power by the ability to combine technologies, and I think it would be great to see the Speech API playing a role in that.

 

Best,

Peter

 

[1] http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section
Received on Tuesday, 18 September 2012 12:30:07 UTC