Re: Requirements for Web audio APIs

On Thu, May 19, 2011 at 2:58 AM, Robert O'Callahan <robert@ocallahan.org>wrote:

> On Thu, Apr 14, 2011 at 10:11 AM, Chris Rogers <crogers@google.com> wrote:
>
>> On Wed, Apr 13, 2011 at 10:44 AM, Robert O'Callahan <robert@ocallahan.org
>> > wrote:
>>
>>
>>> Some of these requirements have arisen very recently.
>>>
>>> 1) Integrate with media capture and peer-to-peer streaming APIs
>>> There's a lot of energy right now around APIs and protocols for real-time
>>> communication in Web browsers, in particular proposed WHATWG APIs for media
>>> capture and peer-to-peer streaming:
>>> http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html
>>> Ian Hickson's proposed API creates a "Stream" abstraction representing a
>>> stream of audio and video data. Many use-cases require integration of media
>>> capture and/or peer-to-peer streaming with audio effects processing.
>>>
>>
>> To a small extent, I've been involved with some of the Google engineers
>> working on this.  I would like to make sure the API is coherent with an
>> overall web audio architecture.  I believe it should be possible to design
>> the API in such a way that it's scalable to work with my graph-based
>> proposal (AudioContext and AudioNodes).
>>
>
> Have you made any progress on that?
>

Not yet.  There's an increasing amount of engineering work being done for
webrtc that people are busy with and things are rapidly evolving.  It would
be good to involve Ian Hickson in this greater discussion.



>
> My concern is that having multiple abstractions representing streams of
> media data --- AudioNodes and Streams --- would be redundant.
>

Agreed, there's a need to look at this carefully.  It might be workable if
there were appropriate ways to easily use them together even if they remain
separate types of objects.  In graphics, for example, there are different
objects such as Image, ImageData, and WebGL textures which have different
relationships with each other.  I don't know what the right answer is, but
there are probably various reasonable ways to approach the problem.



>
>>
>>> 2) Need to handle streams containing synchronized audio and video
>>> Many use-cases require effects to be applied to an audio stream which is
>>> then played back alongside a video track with synchronization. This can
>>> require the video to be delayed, so we need a framework that handles both
>>> audio and video. Also, the WHATWG Stream abstraction contains video as well
>>> as audio, so integrating with it will mean pulling in video.
>>>
>>
>> I assume you mean dealing with latency compensation here?  In other words,
>> some audio processing may create a delay which needs to be compensated for
>> by an equivalent delay in presenting the video stream.
>>
>
> Correct.
>
>
>> This is a topic which came up in my early discussions with Apple, as they
>> were also interested in this.  We talked about having a .latency attribute
>> on every processing node (AudioNode) in the rendering graph.  That way the
>> graph can be queried and the appropriate delay can be factored into the
>> video presentation.  A .latency attribute is also useful for synchronizing
>> two audio streams, each of which may have different latency characteristics.
>>  In modern digital audio workstation software, this kind of compensation is
>> very important.
>>
>
> You seem to be suggesting exposing latency information to Web apps, which
> then adjust the video presentation somehow ... but how? HTML media elements
> have no API that allows the author to introduce extra buffering of video
> output. Even if there was such an API, it would be clumsy to use for this
> purpose and I'm pretty sure the quality of A/V sync would be reduced. Media
> engines currently work hard to make sure that video frames are presented at
> the right moment, based on the audio hardware clock, and a fixed latency
> parameter would interfere with that.
>
> I would like to see an API that integrates video and audio into a single
> processing architecture so that we can get high-quality A/V sync with audio
> processing, and authors don't have to manage latency explicitly.
>

Ok, there are ways to have the underlying system automatically infer the
audio processing latency of an individual audio source such that internally
the video stream can be adjusted.  When you have an audio rendering graph,
each "box" (or AudioNode) in the graph can potentially introduce processing
latency, so at least the underlying implementation needs to be aware of that
latency.  Internally, it could walk through the graph and calculate latency
from point to point.  I do think that in some cases, it could also be useful
to expose this value as an attribute to script, but additionally as you
suggest, the right synchronization can happen internally.


>
> Another use-case to think about is the Xbox 360 chat "voice distortion"
> feature: the user's voice is captured via a microphone, and a distortion
> effect is applied before it's sent over the network. Perhaps video is also
> being captured and we want to send it in sync with that processed audio.
> Having authors manually manage latency in that scenario sounds very
> difficult.
>

Yes, perhaps so.  As I mentioned above there are ways this can be made to
happen automatically by inspecting the rendering graph and calculating the
latency.



>
>  3) Need to handle synchronization of streams from multiple sources
>>> There's ongoing work to define APIs for playing multiple media resources
>>> with synchronization, including a WHATWG proposal:
>>> http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#mediacontroller
>>> Many use-cases require audio effects to be applied to some of those
>>> streams while maintaining synchronization.
>>>
>>
>> I admit that I haven't been closely following this particular proposal.
>>  But, I'll try to present my understanding of the problem as it relates to
>> <audio> and <video> right now.  Both the play() and pause() methods of
>> HTMLMediaElement don't allow a way to specify a time when the event should
>> occur.  Ideally, the web platform would have a high-resolution clock,
>> similar to the Date class, with its getTime() method, but higher-resolution.
>>  This clock can be used as a universal reference time.  Then, for example,
>> the play() method could be extended to something like play(time), where
>> |time| is based on this clock.  That way, multiple <audio> and <video>
>> elements could be synchronized precisely.
>>
>
> That sounds good, but I was thinking of other sorts of problems. Consider
> for example the use-case of a <video> movie with a regular audio track, and
> an auxiliary <audio> element referencing a commentary track, where we apply
> an audio ducking effect to overlay the commentary over the regular audio.
> How would you combine audio from both streams and keep everything in sync
> (including the video), especially in the face of issues such as one of the
> streams temporarily pausing to buffer due to a network glitch?
>

In general this sounds like a very difficult problem to solve.  Because if
you had two <video> streams playing together, then either one of them could
pause momentarily due to buffer underrun, so each one would have to adjust
to the other.  Then if you had more than two, any of them could require
adjustments in all of the others.  In any case, in my proposal the <audio>
and <video> elements represent audio sources which can be wired into further
effect processing.  In my proposal, I'm not trying to solve the media
streaming synchronization problem as you describe, but if HTMLMediaElement
finds solutions to that issue then <audio> and <video> sources can enter
into a Web Audio API effect processing graph already synchronized
(potentially compensating for any rendering graph latency).

Chris

Received on Friday, 20 May 2011 19:17:49 UTC