Re: Interacting with WebRTC, the Web Audio API and other external sources from Glen Shires on 2012-09-18 (public-speech-api@w3.org from September 2012)

From: Glen Shires <gshires@google.com>
Date: Tue, 18 Sep 2012 14:28:40 -0700
To: Jim Barnett <Jim.Barnett@genesyslab.com>
Cc: "Young, Milan" <Milan.Young@nuance.com>, Adam Sobieski <adamsobieski@hotmail.com>, Peter Beverloo <beverloo@google.com>, public-speech-api@w3.org
Message-ID: <CAEE5bcjGpVkDJxQ92xZ54O0f9zBkbNz9SBNHYNY1AAB3nFgCqg@mail.gmail.com>
I've updated the spec with editor notes for open issues: grammar formats,
confidence and WebRTC.
https://dvcs.w3.org/hg/speech-api/rev/f3777d4b107e

Milan, regarding an update the status of the open issue list, I believe the
spec now contains an editor note for each.

As always, the current draft spec is at:
http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

/Glen Shires


On Tue, Sep 18, 2012 at 5:30 AM, Jim Barnett <Jim.Barnett@genesyslab.com>wrote:

> Yes, it’s certainly too late for this draft.  As I recall, we have a ‘url’
> property somewhere that it supposed to allow the app to specify the
> location of the remote recognizer.  Maybe we could  put in a note along the
> lines of:   [Need to specify how access to remote recognition is going to
> work]****
>
> ** **
>
> **-          **Jim****
>
> ** **
>
> *From:* Young, Milan [mailto:Milan.Young@nuance.com]
> *Sent:* Monday, September 17, 2012 6:54 PM
> *To:* Jim Barnett; Adam Sobieski; Peter Beverloo
>
> *Cc:* public-speech-api@w3.org
> *Subject:* RE: Interacting with WebRTC, the Web Audio API and other
> external sources****
>
> ** **
>
> Transport problems aside, the idea of integrating with getUserMedia() is a
> good one.  In addition to bringing unity across standards, we are able to
> push a good deal of the privacy/consent problems to that subgroup who are
> more focused on that task.****
>
> ** **
>
> Unfortunately, I think it’s a bit too late for this effort to consider
> such a relatively major rewrite.  I suggest that we add this to the issue
> lists that Glen and Hans have offered to compile [1].****
>
> ** **
>
> Speaking of that, I’d appreciate an update on how that is going.  It’s
> been over a month now.  *Glen* and *Hans*, any progress to report?****
>
> ** **
>
> [1]
> http://lists.w3.org/Archives/Public/public-speech-api/2012Aug/0026.html **
> **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Jim Barnett [mailto:Jim.Barnett@genesyslab.com<Jim.Barnett@genesyslab.com>]
>
> *Sent:* Monday, September 17, 2012 5:49 AM
> *To:* Adam Sobieski; Peter Beverloo
> *Cc:* public-speech-api@w3.org
> *Subject:* RE: Interacting with WebRTC, the Web Audio API and other
> external sources****
>
> ** **
>
> Adam,****
>
> I’m participating in the WebRTC work and hope that it can be made useful
> to the SpeechAPI.  One problem is that WebRTC relies on UDP, while I
> understand from Milan that recognizers do better with TCP.  I don’t know if
> we’ll be able to add TCP to the WebRTC work.  If not, at least we will make
> sure that it is possible for the application to access the  user’s speech
> input.  It can then construct its own socket and transmit them to the ASR
> engine, if necessary.****
>
> ** **
>
> **-          **Jim****
>
> ** **
>
> *From:* Adam Sobieski [mailto:adamsobieski@hotmail.com]
> *Sent:* Saturday, September 15, 2012 5:36 PM
> *To:* Peter Beverloo
> *Cc:* public-speech-api@w3.org
> *Subject:* RE: Interacting with WebRTC, the Web Audio API and other
> external sources****
>
> ** **
>
> Speech API Community Group,
>
> Greetings.  I was reading about recent developments with regard to the
> WebRTC stack and I wanted to express that, as the WebRTC stack will be
> available upcoming, that it could be useful to the Speech API.  WebRTC
> includes MediaStream, DataChannel, and PeerConnection interfaces.
>
> In addition to video calls and video conferencing are possible video
> forums and scenarios with streaming content to media repository services.
> Some technologists are additionally excited about 3D video and microphone
> array functionality.
>
> Speech recognition can facilitate numerous technologies and features
> including generating hypertext transcripts, computers as teleprompters, and
> other human-computer interaction and user interface topics pertaining to
> web-based multimedia blogging.
>
>
>
> Kind regards,
>
> Adam Sobieski
>  ****
> ------------------------------
>
> Date: Thu, 19 Jul 2012 15:38:14 +0100
> From: beverloo@google.com
> To: public-speech-api@w3.org
> Subject: Re: Interacting with WebRTC, the Web Audio API and other external
> sources
>
> With all major browser vendors being members of the WebRTC working group,
> it may actually be worth considering to slim down the APIs and re-use the
> interface they'll provide.****
>
> ** **
>
> As an addendum to the quoted proposal:****
>
> ** **
>
> * Drop the "start", "stop" and "abort" methods from the SpeechRecognition
> object in favor of an input MediaStream acquired through getUserMedia()[1].
> ****
>
> ** **
>
> Alternatively, the three methods could be re-purposed allowing
> partial/timed recognition in case of continuous media streams, rather than
> the whole stream.****
>
> ** **
>
> Best,****
>
> Peter****
>
> ** **
>
> [1]
> http://dev.w3.org/2011/webrtc/editor/getusermedia.html#navigatorusermedia*
> ***
>
> ** **
>
> On Wed, Jun 13, 2012 at 3:49 PM, Peter Beverloo <beverloo@google.com>
> wrote:****
>
> Currently, the SpeechRecognition[1] interface defines three methods to
> start, stop or abort speech recognition, the source of which will be an
> audio input device as controlled by the user agent. Similarly, the
> TextToSpeech (TTS) interface defines play, pause and stop, which will
> output the generated speech to an output device, again, as controlled by
> the user agent.****
>
> ** **
>
> There are various other media and interaction APIs in development right
> now, and I believe it would be good for the Speech API to more tightly
> integrate with them. In this e-mail, I'd like to focus on some additional
> features for integration with WebRTC and the Web Audio API.****
>
> ** **
>
> ** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html>****
>
> ** **
>
> WebRTC provides the ability to interact with the user's microphone and
> camera through the getUserMedia() method. As such, an important use-case is
> (video and --) audio chatting between two or more people. Audio is
> available through a MediaStream object, which can be re-used to power, for
> example, an <audio> element, transmitted to other people through a
> peer-to-peer connection, but can also integrate with the Web Audio API
> through an Audio Context's createMediaStreamSource() method. ****
>
> ** **
>
> ** Web Audio API <
> https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html>****
>
> ** **
>
> The Web Audio API provides the ability to process, analyze, synthesize and
> modify audio through JavaScript. It can get its input from media files
> through XMLHttpRequest, from media elements such as <audio> and <video> and
> from any kind of other system, which includes WebRTC, that is able to
> provide an audio-based MediaStream.****
>
> ** **
>
> Since speech recognition and synthesis does not have to be limited to live
> input from and output to the user, I'd like to present two new use-cases.*
> ***
>
> ** **
>
> 1) Transcripts for (live) communication.****
>
> ** **
>
> While the specification does not mandate a maximum duration of a speech
> input stream, this suggestion is most appropriate for implementations
> utilizing a local recognizer. Allowing MediaStreams to be used as an input
> for a SpeechRecognition object, for example through a new "inputStream"
> property as an alternative to the start, stop and abort methods, would
> enable authors to supply external input to be recognized. This may include,
> but is not limited to, prerecorded audio files and WebRTC live streams,
> both from local and remote parties.****
>
> ** **
>
> 2) Storing and processing text-to-speech fragments.****
>
> ** **
>
> Rather than mandating immediate output of the synthesized audio stream, it
> should be considered to introduce an "outputStream" property on a
> TextToSpeech object which provides a MediaStream object. This allows the
> synthesized stream to be played through the <audio> element, processed
> through the Web Audio API or even to be stored locally for caching, in case
> the user is using a device which is not always connected to the internet
> (and when no local recognizer is available). Furthermore, this would allow
> websites to store the synthesized audio to a wave file and save this on the
> server, allowing it to be re-used for user agents or other clients which do
> not provide an implementation.****
>
> ** **
>
> The Web platform gains its power by the ability to combine technologies,
> and I think it would be great to see the Speech API playing a role in that.
> ****
>
> ** **
>
> Best,****
>
> Peter****
>
> ** **
>
> [1]
> http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section
> ****
>
> ** **
>
Received on Tuesday, 18 September 2012 21:29:48 UTC