Re: active speaker information in mixed streams from Bernard Aboba on 2014-04-03 (public-ortc@w3.org from April 2014)

From: Bernard Aboba <Bernard.Aboba@microsoft.com>
Date: Thu, 3 Apr 2014 23:21:01 +0000
To: "public-ortc@w3.org" <public-ortc@w3.org>
CC: Emil Ivov <emcho@jitsi.org>
Message-ID: <c3ee1220914d459e9603ab313f2ef10d@SN2PR03MB031.namprd03.prod.outlook.com>

Since this issue (#27, see: https://github.com/openpeer/ortc/issues/27) was posted in late January, the thread has petered out, so it seemed like a good idea to recap where we are and figure out what the disposition is.

The requirement was to enable an application to obtain the audio levels of contributing sources. This could be used for several purposes:

a) To indicate in the application UI which speakers are active and the levels of the speakers.

b) To help select the video(s) to be shown in enhanced resolution/framerate.

It was noted that for use a) and possibly b), sampling at ~5 Hz (every 200 ms) could be sufficient.

Several ways of solving the problem were discussed. These included:

1. Having an audio mixer provide the level information to the application via the data channel. Disadvantage of this is that most existing audio mixers do not implement a data channel.

2. Forwarding the non-muted audio streams to the application (e.g. implementing an RTP translator for audio rather than a mixer). Advantage is saving audio encode/decode on the RTP translator. Note: within existing implementations remote audio cannot be utilized by the web audio API, but presumably this will be fixed at some point.

3. Implementing the mixer-to-client RTP extension (RFC 6465) in the audio mixer and providing the application with a way of obtaining the information.

Of these approaches, only approach #3 requires support in the ORTC API.

Since providing an event on level changes could generate a high event rate (50+ PPS), the basic proposal was for a polling interface:

dictionary RTCRtpContributingSource {
double packetTime;
unsigned int csrc;
int audioLevel;
{

partial interface RTCRtpReceiver {
sequence <RTCRtpContributingSource> getContributingSources();
}

Questions:

A. Is there agreement that this problem should be solved within the ORTC API? (as opposed to solving it via the alternative approaches #1 and #2, which don't require API support)

B. If such an API were to be supported is there someone who would implement it?

C. Assuming that the problem needs to be solved in the ORTC API and there is someone interested in using the solution, is the polling approach described above acceptable?

Note that using audio levels for "dominant speaker identification" isn't necessarily a good idea because a high audio level could represent noise rather than speech, and waiting until an audio level became "dominant" could yield an annoying level of delay (e.g. I interrupt you but my video isn't shown in high resolution until you finish and I am speaking with the highest level). However, using audio levels to determine the "last N speakers" or set of speakers to show in higher resolution could be workable (e.g. as soon as I interrupt you, my video goes to high resolution along with yours).

In terms of the information to be provided on each contributing source, discussion indicated the audio level, as well as a timestamp indicating when the source last contributed:

Various aspects of problem b) were also discussed. It was noted that "dominant speaker identification" cannot necessarily be accomplished solely from the audio levels of the contributing sources. For example, one source can be speaking and another contributor may be providing noise because they are not muted even though they are not speaking. The audio could be analyzed on the mixer to determine which speakers should be shown in better resolution/framerate instead of a thumbnail, and then the SFU would reflect the choice in the stream sent to the browser.

Hey all,

I just posted this to the WebRTC list here:

http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html

But I believe it's a question that is also very much worth resolving

for ORTC, so I am also asking it here:

One requirement that we often bump against is the possibility to

extract active speaker information from an incoming *mixed* audio

stream. Acquiring the CSRC list from RTP would be a good start. Audio

levels as per RFC6465 would be even better.

Thoughts?

Emil

https://jitsi.org<https://jitsi.org/>

Received on Thursday, 3 April 2014 23:21:33 UTC