W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > April 2011

RE: about Microphone API

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 12 Apr 2011 13:46:22 -0700
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0AD08D45@SUN-EXCH01.nuance.com>
To: "Jerry Carter" <jerry@jerrycarter.org>, <Olli@pettay.fi>
Cc: <public-xg-htmlspeech@w3.org>
Hello Jerry,

FPR33 suggests that all browsers and engines would all support a
particular codec for interoperability.  From what I recall, we choose
this mandatory codec approach over negotiation because it was an easier

I agree that we need to find a way to piggyback off the media/transport
work of the RTC and others.  My only concern is that their timeline may
not match up with the more aggressive folks in our group.  I'd prefer if
we could get something working before then.

On the last call, you expressed an aversion to an MRCP subset over
WebSockets.  Was that just because you hadn't yet understood the full
context of the proposal, or is their an Achilles heel we are yet to


-----Original Message-----
From: public-xg-htmlspeech-request@w3.org
[mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Jerry Carter
Sent: Thursday, April 07, 2011 6:48 PM
To: Olli@pettay.fi
Cc: public-xg-htmlspeech@w3.org
Subject: Re: about Microphone API

On Apr 7, 2011, at 8:42 PM, Olli Pettay wrote:

> On 04/07/2011 05:26 PM, Jerry Carter wrote:
>> These proposals are certainly in the right direction, but in recent
>> years, I've tended to favor more general media streams such as that
>> proposed here by the Device APIs and Policy Working Group.
>> <http://dev.w3.org/2009/dap/camera/#captureparam>
> So far all the Stream APIs I've seen have been very general data
> streams.
> I couldn't find any Stream api in the spec you linked.

Poor choice of words on my part.  This is certainly not a stream API in
the sense of <https://wiki.mozilla.org/MediaStreamAPI> as the DAP work
merely indicates placement (i.e. the capture attribute).  I raised the
point about richer media types because so much of the recent discussion
has focused solely on audio.

Now allow me to respond to the substance of your original message.

When recognition is initiated, the media stream(s) captured by the
device must be connected to the recognition engine.  There will likely
be an negotiation process which must be completed before this can
happen.  The recognizer might suggest preferred types (e.g. near-field
and far-field, 44k PCM).  The device may only be able to offer something
less (e.g. 16k single channel PCM).  And the user may want to offer
something even less, perhaps to reduce data costs or for privacy (e.g. a
user answering a call in her bathrobe may not want to provide full
video).  I expect that once the media type is agreed, the media stream
will remain available for the duration of the recognition or until the
some event or agent terminates the stream. These are pretty simple

The plumbing requirements are simple and could be easily satisfied by
Mozilla MediaStreamAPI, the WhatWG, and RTCStreamAPI.  The first, for
instance, proposes getUserMedia and addStream functions.  The big
differences are in the area of media type negotiation.  Here is
MediaStreamAPI is underspecified.  Both the RTCStreamAPI and the
proposal from WhatWG embrace SDP.  Their allowance for ICE, STUN and
TURN is quite welcome.  I believe that either RTCStreamAPI or the WhatWG
conferencing could provide a solid starting point.

As you rightly note, browsers need to pick some API.  The good news is
that I don't believe it really matters which is picked so long as the
appropriate negotiation mechanisms exist.  Anything created for
audio/video mixing applications or multi-person video conferencing will
probably be more than enough.  The XG should be able to make
considerable progress simply by assuming the existence of such an API.

>> Most typically, I would expect only the audio information to be sent
>> and consumed.  Unlike the telephony case, automobiles and mobile
>> devices can often provide audio from multiple microphones which
>> allows for better noise rejection and, more rarely, assists with
>> speaker identification in multi-speaker contexts.  There is certainly
>> value to the video stream, when available, for correlation with
>> facial features.  Various studies have shown reduced error rates from
>> combining facial content with the audio.  And for speaker
>> identification / verification, the advantages of video over
>> audio-only are clear.
>> Again, most typically, I expect audio information from a single
>> microphone.  But I would not want to exclude richer data sources when
>> available.
>> -=- Jerry
>> On Apr 7, 2011, at 3:19 PM, Olli Pettay wrote:
>>> Hi,
>>> as the last or almost last comment in the conf call there was
>>> something about microphone API.
>>> As Dan mentioned there has been lots of work happening in RTC and
>>> related areas. HTML spec (WhatWG) has now a proposal for
>>> audio/video conferencing, but there are also other proposals. One
>>> about audio handling (not about communication) is
>>> https://wiki.mozilla.org/MediaStreamAPI
>>> For handling audio and video it seems that all the proposals are
>>> using some kind of Stream object. So, if the recognizer API was
>>> using a Stream as an input, we wouldn't need to care microphone
>>> API. This approach would also let us rely on the other specs to
>>> handle many security and privacy related issues. (Of course we'd
>>> need to choose which Stream API to use, but that is more broad
>>> problem atm. Browsers will need to implement just one API, but what
>>> that will look like exactly isn't clear yet.)
>>> The API could be, for example, close to
>>> SpeechRequest/SpeechRecognizer, but instead of using the default
>>> microphone, or CaptureAPI, there could be an attribute for the
>>> Stream.
>>> [Constructor(in optional DOMString recognizerURI, in optional
>>> DOMString recognizerParams)] interface Recognizer { attribute
>>> Stream input; ....
>>> This would allow using all sorts of audio streams, not only
>>> microphone. (For example for Streams from other users via VoIP/RTC,
>>> or audio from a video so that web app could do automatic
>>> subtitling. I know, these examples are something for the future.)
>>> -Olli
Received on Tuesday, 12 April 2011 20:47:22 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:16:49 UTC