Re: active speaker information in mixed streams from Emil Ivov on 2014-04-07 (public-ortc@w3.org from April 2014)

From: Emil Ivov <emcho@jitsi.org>
Date: Mon, 07 Apr 2014 23:12:40 +0200
To: Bernard Aboba <Bernard.Aboba@microsoft.com>, "public-ortc@w3.org" <public-ortc@w3.org>
Message-ID: <534314C8.6030602@jitsi.org>
Hey Bernard,

Thanks for the great summary!

Comments inline	

On 04.04.14, 01:21, Bernard Aboba wrote:
> Since this issue (#27, see:
> https://github.com/openpeer/ortc/issues/27) was posted in late January,
>   the thread has petered out, so it seemed like a good idea to recap
> where we are and figure out what the disposition is.
>
> The requirement was to enable an application to obtain the audio levels
> of contributing sources.   This could be used for several purposes:
>
> a)To indicate in the application UI which speakers are active and the
> levels of the speakers.
>
> b)To help select the video(s) to be shown in enhanced resolution/framerate.
>
> It was noted that for use a) and possibly b), sampling at ~5 Hz (every
> 200 ms) could be sufficient.
>
> Several ways of solving the problem were discussed.  These included:
>
> 1.Having an audio mixer provide the level information to the application
> via the data channel.  Disadvantage of this is that most existing audio
> mixers do not implement a data channel.
>
> 2.Forwarding the non-muted audio streams to the application (e.g.
> implementing an RTP translator for audio rather than a mixer).
> Advantage is saving audio encode/decode on the RTP translator.    Note:
> within existing implementations remote audio cannot be utilized by the
> web audio API, but presumably this will be fixed at some point.
>
> 3.Implementing the mixer-to-client RTP extension (RFC 6465) in the audio
> mixer and providing the application with a way of obtaining the
> information.
>
> Of these approaches, only approach #3 requires support in the ORTC API.

The relayed streams in #2 might contain SSRC audio levels (RFC 6464 
which happens to be already supported by the webrtc.org implementation). 
Note that this specific extension also contains a VAD flag.

I think making those levels available to the application could be quite 
handy:

1. It greatly reduces complexity for the JS developer because it may 
well spare them the need to develop implement or integrate audio 
processing in their app (and as you point out yourself, none of the 
existing WebRTC implementations today give that option anyway).

2. FWIW it reduces the overall CPU footprint of the conference as VAD 
would only be run once per stream, rather than N*(N-1) times.

> Since providing an event on level changes could generate a high event
> rate (50+ PPS), the basic proposal was for a polling interface:
>
> dictionary RTCRtpContributingSource {
> double packetTime;
>
> unsigned int csrc;
> int audioLevel;
>
> {
>
> partial interface RTCRtpReceiver {
>
> sequence <RTCRtpContributingSource> getContributingSources();
>
> }
>
> Questions:
>
> A. Is there agreement that this problem should be solved within the ORTC
> API?

I certainly think this would be very helpful.

> (as opposed to solving it via the alternative approaches #1 and
> #2,  which don’t require API support)
>
> B. If such an API were to be supported is there someone who would
> implement it?
>
> C.Assuming that the problem needs to be solved in the ORTC API and there
> is someone interested in using the solution,

+1

> is the polling approach
> described above acceptable?

* It is acceptable, although I do think an event with configurable 
granularity and a default of ~5Hz would would also be helpful. As I've 
said before however, I do think polling is better than nothing.

* Given how essentially the same extension can come for both CSRC and 
SSRC I think we could do the following slight change to the above to 
something like this:

dictionary RTCRtpSynchronisationSource {
     double packetTime;

     unsigned int csrc;
     int audioLevel;
     boolean vad = false;

|

partial interface RTCRtpReceiver {
     RTCRtpSynchronisationSource getSynchronisationSource();
     sequence <RTCRtpSynchronisationSource> getContributingSources();

}

> Note that using audio levels for “dominant speaker identification” isn’t
> necessarily a good idea because a high audio level could represent noise
> rather than speech,

I think it's not necessarily a bad one either. If someone is introducing 
noise then it might make sense to focus on them so that users can take 
action (e.g. request that the offender be muted).

> and waiting until an audio level became “dominant”
> could yield an annoying level of delay (e.g. I interrupt you but my
> video isn’t shown in high resolution until you finish and I am speaking
> with the highest level).

I am sure we can come up with workable ways here too. The app could use 
a spike of the audio levels as a cue to switch active speakers even if 
the new levels are not yet louder than the previous dominant speaker. If 
that happens to only be a spike and not an actual change, then focus 
could return to the previous speaker.

In the end, even if not perfectly predictive, such behaviour would still 
remain natural and close to what a person may do in real life.

> However, using audio levels to determine the
> “last N speakers” or set of speakers to show in higher resolution could
> be workable (e.g. as soon as I interrupt you, my video goes to high
> resolution along with yours).

Agreed

> In terms of the information to be provided on each contributing source,
> discussion indicated the audio level, as well as a timestamp indicating
> when the source last contributed:
>
> Various aspects of problem b) were also discussed.   It was noted that
> “dominant speaker identification” cannot necessarily be accomplished
> solely from the audio levels of the contributing sources.  For example,
> one source can be speaking and another contributor may be providing
> noise because they are not muted even though they are not speaking.  The
> audio could be analyzed on the mixer to determine which speakers should
> be shown in better resolution/framerate instead of a thumbnail, and then
> the SFU would reflect the choice in the stream sent to the browser.

Same comment as above.

Cheers,
Emil

>
> Hey all,
>
>
>
> I just posted this to the WebRTC list here:
>
>
>
> http://lists.w3.org/Archives/Public/public-webrtc/2014Jan/0256.html
>
>
>
> But I believe it's a question that is also very much worth resolving
>
> for ORTC, so I am also asking it here:
>
>
>
> One requirement that we often bump against is the possibility to
>
> extract active speaker information from an incoming *mixed* audio
>
> stream. Acquiring the CSRC list from RTP would be a good start. Audio
>
> levels as per RFC6465 would be even better.
>
>
>
> Thoughts?
>
>
>
> Emil
>
>
>
> --
>
> https://jitsi.org  <https://jitsi.org/>
>

-- 
https://jitsi.org
Received on Monday, 7 April 2014 21:13:12 UTC