RE: UA <=> SS Protocol from Young, Milan on 2010-12-07 (public-xg-htmlspeech@w3.org from December 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 7 Dec 2010 07:47:24 -0800
To: "Satish Sampath" <satish@google.com>, "Marc Schroeder" <marc.schroeder@dfki.de>
Cc: "Robert Brown" <Robert.Brown@microsoft.com>, "Dave Burke" <daveburke@google.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD098E0356@SUN-EXCH01.nuance.com>

Hello Satish,

 

I'm not familiar with Device APIs, but from what you've written below,
this may be a good fit.  Nice to keep this in mind as we talk about
protocol requirements because realistically ability to implement plays
into what we are willing to require.

 

But before we proceed, we still need to hear from Google on two points:

*         Should our group define the requirements of the protocol?

*         Should our group include a concrete protocol definition in our
recommendation?

 

(Note that whether we factor the protocol definition to another
specification or contain it in our spec is still TBD.  In my opinion, it
should probably remain TBD until we complete the requirements phase.)

 

Can you or some other Google representative please comment?

 

Thanks

 

 

________________________________

From: Satish Sampath [mailto:satish@google.com] 
Sent: Tuesday, December 07, 2010 4:32 AM
To: Marc Schroeder
Cc: Robert Brown; Young, Milan; Dave Burke; public-xg-htmlspeech@w3.org
Subject: Re: UA <=> SS Protocol

 

I've been thinking about the various speech input use cases brought up
in the recent requirements discussion, in particular the
website-specified speech service and the UA <> Speech Service protocol.
>From what I can see, the Device API spec
<http://www.whatwg.org/specs/web-apps/current-work/#devices>  by WHATWG
addresses them nicely and we should be making use of their work.

 

Here is an example of how a plain 'click-button-to-speak' use case can
be implemented using the Device API:

 

	<device type="media" onchange="startRecording(this.data)">

	 

	<script>

	function startRecording(stream) {

	  var recorder = stream.record();

	  // Record for 5 seconds. Ideally this will be replaced with an
end-pointer.

	  setTimeout(function() {

	    File audioData = recorder.stop();

	    var xhr = new XMLHttpRequest();

	    xhr.open("POST", "http://path-to-your-speech-server
<http://path-to-your-speech-server/> ", true);

	    xhr.send(audioData);

	    xhr.onreadystatechange = function () {

	      if (xhr.readyState != 4) return;

	      window.alert("You spoke: " + xhr.responseText);

	    }

	  }, 5000);

	}

	</script>

 

Some salient points:

1.   With the Device API, you can start or stop capturing audio at any
time from JavaScript.

2.   The audio data is sent to the speech service using the standard
XMLHttpRequest object in Javascript.

o   This allows vendor specific parameters to be sent as part of the
POST data or custom headers with the request.

o   No need to define a new protocol here for the request.

3.   The server response comes back via the standard XMLHttpRequest
object as well.

o   Vendors are free to implement their protocol on top of HTTP.

o   Vendors can provide a JS library which encapsulates all of this for
their speech service.

o   There is enough precedence in this area with the various data APIs.

4.   For streaming out audio while recording, there is a ConnectionPeer
<http://www.whatwg.org/specs/web-apps/current-work/#connectionpeer>
proposal.

o   This is specifically aimed at real-time use cases such as video
chat, video record/upload. Speech input will fit in here well.

o   Audio, text and images can be sent via the same channel in real time
to a server or another peer.

o   Responses can be received in real time as well, making it easy to
implement continuous speech recognition.

5.   The code above records for 5 seconds but ideally there would be an
end-pointer here. This can either be:

o   Implemented as part of the Device API (i.e. we should propose it to
the WHATWG) or

o   Implemented in Javascript with raw audio samples. The Audio XG
<http://www.w3.org/2005/Incubator/audio/>  is defining an API for that.
I think Olli Pettay is active in that XG as well and Mozilla has a
related Audio Data API in the works.

6.   Device and Audio APIs are work-in-progress, so we could suggest
requirements to them for enabling our use cases.

o   For e.g. we can suggest "type=audio" for the Device API.

There is a team at Google working on implementing the Device API for
Chrome/webkit.

 

--

Cheers

Satish

Received on Tuesday, 7 December 2010 15:48:02 UTC