Re: UA <=> SS Protocol

I've been thinking about the various speech input use cases brought up in
the recent requirements discussion, in particular the website-specified
speech service and the UA <> Speech Service protocol. From what I can see,
the Device API spec<http://www.whatwg.org/specs/web-apps/current-work/#devices>
by
WHATWG addresses them nicely and we should be making use of their work.

Here is an example of how a plain 'click-button-to-speak' use case can be
implemented using the Device API:

<device type="media" onchange="startRecording(this.data)">

<script>
function startRecording(stream) {
  var recorder = stream.record();
  // Record for 5 seconds. Ideally this will be replaced with an
end-pointer.
  setTimeout(function() {
    File audioData = recorder.stop();
    var xhr = new XMLHttpRequest();
    xhr.open("POST", "http://path-to-your-speech-server", true);
    xhr.send(audioData);
    xhr.onreadystatechange = function () {
      if (xhr.readyState != 4) return;
      window.alert("You spoke: " + xhr.responseText);
    }
  }, 5000);
}
</script>


Some salient points:

   1. With the Device API, you can start or stop capturing audio at any time
   from JavaScript.
   2. The audio data is sent to the speech service using the standard
   XMLHttpRequest object in Javascript.
      - This allows vendor specific parameters to be sent as part of the
      POST data or custom headers with the request.
      - No need to define a new protocol here for the request.
   3. The server response comes back via the standard XMLHttpRequest object
   as well.
      - Vendors are free to implement their protocol on top of HTTP.
      - Vendors can provide a JS library which encapsulates all of this for
      their speech service.
      - There is enough precedence in this area with the various data APIs.
   4. For streaming out audio while recording, there is a
ConnectionPeer<http://www.whatwg.org/specs/web-apps/current-work/#connectionpeer>
    proposal.
      - This is specifically aimed at real-time use cases such as video
      chat, video record/upload. Speech input will fit in here well.
      - Audio, text and images can be sent via the same channel in real time
      to a server or another peer.
      - Responses can be received in real time as well, making it easy to
      implement continuous speech recognition.
   5. The code above records for 5 seconds but ideally there would be an
   end-pointer here. This can either be:
      - Implemented as part of the Device API (i.e. we should propose it to
      the WHATWG) or
      - Implemented in Javascript with raw audio samples. The Audio
XG<http://www.w3.org/2005/Incubator/audio/> is
      defining an API for that.
      I think Olli Pettay is active in that XG as well and Mozilla has a
      related Audio Data API in the works.
   6. Device and Audio APIs are work-in-progress, so we could suggest
   requirements to them for enabling our use cases.
      - For e.g. we can suggest "type=audio" for the Device API.

There is a team at Google working on implementing the Device API for
Chrome/webkit.

--
Cheers
Satish

Received on Tuesday, 7 December 2010 12:32:13 UTC