remote recongition and tts

A couple of quick comments on remote resources:  the current proposal is
to add a URL property to asr/tts resources, identifying the location of
the remote resources.  If we want to pursue this path, we would have to
define the protocol for the browser to connect to the resources, to
send/receive audio, to get results, etc.  It will certainly be easier if
we can re-use an existing protocol.  The obvious  one that comes to mind
is the one that the rtcWeb group is defining (it's a joint effort of the
IETF and the W3C).  rtcWeb is intended to allow browser-to-browser
voice, video, and data communications.  There's a lot of interest in
this group, including participation from major browser vendors, so
there's a good chance that at some point we will see this capability
built into the popular web browsers.  Overall, it will provide a
superset of what we need (as far as I know, we don't need video), so it
would make sense for us to reuse it, rather than asking browser vendors
to support an additional protocol.  (The fact that we will have media
servers at one end of the call, rather than a second browser is not a
problem - as long as the media server speaks the appropriate protocol,
the user's browser will never know the difference.)

 

There is a slight complication, though.  In the current draft of the
IETF spec, the browser does not have the capability to set up a call on
its own - the call must be set up in Java Script (this is to allow
flexibility in complex situations.)  However, as part of call set up, a
PeerConnection object is created, which will contain one or more media
streams.  (See the draft API at
http://www.w3.org/TR/2012/WD-webrtc-20120209/ .  For the call set up
protocol, see http://datatracker.ietf.org/doc/draft-ietf-rtcweb-jsep/
Be aware that these are both working drafts.)  So I think it would make
sense for our API to allow the developer to provide a PeerConnection
object when creating an ASR or TTS resource.  If such an object is
provided, the browser must use it to communicate with the remote
resources.  Otherwise it will use its local defaults.  

 

The ASR and TTS resources would each receive their own PeerConnection
(or one could receive a PeerConnection and the other use the browser
defaults.)  Each PeerConnection should contain an audio stream and a
data channel (for TTS, the data channel is used to pass the text to play
to the resource; for ASR, the data channel is used to return results.)  

 

There will be a bunch of error cases to consider (what if the
PeerConnection lacks a data channel, or has two audio channels, etc.)  I
would think that in most of these cases the browser should signal an
error and reject any subsequent attempt to use the relevant resource.  

 

On the whole, the fact that rtcWeb requires that the JS author set up
the call will make programming remote resources more complex, but also
much more flexible, than if we counted on the browser to do the job.
But I don't think it adds any complexity to our job of standard
definition.  We just specify that the optional parameter is
PeerConnection, rather than URI, and the rest of the spec doesn't have
to change much (we'd have to handle various error cases in the uri-based
version of the API as well.)

 

Anyhow, this subject will need more discussion, but I'd like to get
started on it soon. 

 

-          Jim

Received on Friday, 20 April 2012 19:37:01 UTC