Re: about Microphone API from Jerry Carter on 2011-04-08 (public-xg-htmlspeech@w3.org from April 2011)

From: Jerry Carter <jerry@jerrycarter.org>
Date: Thu, 7 Apr 2011 21:47:43 -0400
To: Olli@pettay.fi
Cc: public-xg-htmlspeech@w3.org
Message-Id: <B493605D-8BF5-447B-8B4C-06EE619A9897@jerrycarter.org>
On Apr 7, 2011, at 8:42 PM, Olli Pettay wrote:

> On 04/07/2011 05:26 PM, Jerry Carter wrote:
>> These proposals are certainly in the right direction, but in recent
>> years, I've tended to favor more general media streams such as that
>> proposed here by the Device APIs and Policy Working Group.
>> 
>> <http://dev.w3.org/2009/dap/camera/#captureparam>
> 
> So far all the Stream APIs I've seen have been very general data
> streams.
> I couldn't find any Stream api in the spec you linked.

Poor choice of words on my part.  This is certainly not a stream API in the sense of <https://wiki.mozilla.org/MediaStreamAPI> as the DAP work merely indicates placement (i.e. the capture attribute).  I raised the point about richer media types because so much of the recent discussion has focused solely on audio.

Now allow me to respond to the substance of your original message.

When recognition is initiated, the media stream(s) captured by the device must be connected to the recognition engine.  There will likely be an negotiation process which must be completed before this can happen.  The recognizer might suggest preferred types (e.g. near-field and far-field, 44k PCM).  The device may only be able to offer something less (e.g. 16k single channel PCM).  And the user may want to offer something even less, perhaps to reduce data costs or for privacy (e.g. a user answering a call in her bathrobe may not want to provide full video).  I expect that once the media type is agreed, the media stream will remain available for the duration of the recognition or until the some event or agent terminates the stream. These are pretty simple requirements.

The plumbing requirements are simple and could be easily satisfied by Mozilla MediaStreamAPI, the WhatWG, and RTCStreamAPI.  The first, for instance, proposes getUserMedia and addStream functions.  The big differences are in the area of media type negotiation.  Here is MediaStreamAPI is underspecified.  Both the RTCStreamAPI and the proposal from WhatWG embrace SDP.  Their allowance for ICE, STUN and TURN is quite welcome.  I believe that either RTCStreamAPI or the WhatWG conferencing could provide a solid starting point.

As you rightly note, browsers need to pick some API.  The good news is that I don't believe it really matters which is picked so long as the appropriate negotiation mechanisms exist.  Anything created for audio/video mixing applications or multi-person video conferencing will probably be more than enough.  The XG should be able to make considerable progress simply by assuming the existence of such an API.

>> Most typically, I would expect only the audio information to be sent
>> and consumed.  Unlike the telephony case, automobiles and mobile
>> devices can often provide audio from multiple microphones which
>> allows for better noise rejection and, more rarely, assists with
>> speaker identification in multi-speaker contexts.  There is certainly
>> value to the video stream, when available, for correlation with
>> facial features.  Various studies have shown reduced error rates from
>> combining facial content with the audio.  And for speaker
>> identification / verification, the advantages of video over
>> audio-only are clear.
>> 
>> Again, most typically, I expect audio information from a single
>> microphone.  But I would not want to exclude richer data sources when
>> available.
>> 
>> -=- Jerry
>> 
>> 
>> On Apr 7, 2011, at 3:19 PM, Olli Pettay wrote:
>> 
>>> Hi,
>>> 
>>> as the last or almost last comment in the conf call there was
>>> something about microphone API.
>>> 
>>> As Dan mentioned there has been lots of work happening in RTC and
>>> related areas. HTML spec (WhatWG) has now a proposal for
>>> audio/video conferencing, but there are also other proposals. One
>>> about audio handling (not about communication) is
>>> https://wiki.mozilla.org/MediaStreamAPI
>>> 
>>> For handling audio and video it seems that all the proposals are
>>> using some kind of Stream object. So, if the recognizer API was
>>> using a Stream as an input, we wouldn't need to care microphone
>>> API. This approach would also let us rely on the other specs to
>>> handle many security and privacy related issues. (Of course we'd
>>> need to choose which Stream API to use, but that is more broad
>>> problem atm. Browsers will need to implement just one API, but what
>>> that will look like exactly isn't clear yet.)
>>> 
>>> The API could be, for example, close to
>>> SpeechRequest/SpeechRecognizer, but instead of using the default
>>> microphone, or CaptureAPI, there could be an attribute for the
>>> Stream.
>>> 
>>> [Constructor(in optional DOMString recognizerURI, in optional
>>> DOMString recognizerParams)] interface Recognizer { attribute
>>> Stream input; ....
>>> 
>>> This would allow using all sorts of audio streams, not only
>>> microphone. (For example for Streams from other users via VoIP/RTC,
>>> or audio from a video so that web app could do automatic
>>> subtitling. I know, these examples are something for the future.)
>>> 
>>> 
>>> 
>>> -Olli
>>> 
>>> 
>> 
>> 
> 
>
Received on Friday, 8 April 2011 01:50:27 UTC