Re: active speaker information in mixed streams from Justin Uberti on 2014-01-31 (public-orca@w3.org from January 2014)

From: Justin Uberti <juberti@google.com>
Date: Fri, 31 Jan 2014 14:31:35 -0800
To: Roman Shpount <rshpount@turbobridge.com>
Cc: Peter Thatcher <pthatcher@google.com>, Martin Thomson <martin.thomson@gmail.com>, Emil Ivov <emcho@jitsi.org>, Bernard Aboba <Bernard.Aboba@microsoft.com>, "public-orca@w3.org" <public-orca@w3.org>
Message-ID: <CAOJ7v-2+vZBnaiFHR14vVXKyofwsyB9v7FHOZ29_7o4qi-_BOw@mail.gmail.com>
The notion of 'Nodes' is specific to WebAudio, and the idea of adding a
CSRC processor object to be vended from RTCRtpReceiver feels heavy given
the fact this is a bit of an edge case.

I am OK with either polling via receiver.getContributingSources, or an
event such as receiver.oncontributingsourcesupdate, where the frequency is
configurable but defaults to zero.


On Fri, Jan 31, 2014 at 12:52 PM, Roman Shpount <rshpount@turbobridge.com>wrote:

> I just copied the design from WebAudio. The benefit is ability to specify
> multiple callbacks which are called at different frequencies and ability to
> get rid of the call back at will. You should be able to code the same thing
> with your API as well.
>
> _____________
> Roman Shpount
>
>
> On Fri, Jan 31, 2014 at 3:19 PM, Peter Thatcher <pthatcher@google.com>wrote:
>
>> Looks more complicated.  What's the benefit?  The callback-based
>> version of my proposal already allows specifying the frequency, and is
>> more simple.
>>
>>
>> On Thu, Jan 30, 2014 at 10:39 AM, Roman Shpount
>> <rshpount@turbobridge.com> wrote:
>> > How about something like this:
>> >
>> > ContributingSourceProcessorNode
>> createContributingSourceProcessor(optional
>> > unsigned long interval = 100,
>> >                                       optional unsigned long
>> > maxContributingSources = 16);
>> >
>> > interface ContributingSourceProcessorNode {
>> >     attribute EventHandler onContributingSourceProcess;
>> > };
>> >
>> > dictionary ContributingSource {
>> >   readonly attribute double packetTime;
>> >   unsigned int csrc;
>> >   int audioLevel;
>> > }
>> >
>> > interface ContributingSourceProcessingEvent : Event {
>> >     readonly attribute sequence<ContributingSource> contributingSources;
>> > };
>> >
>> > This way you can create a processor node and specify the frequency with
>> > which it should be called.
>> >
>> > _____________
>> > Roman Shpount
>> >
>> >
>> > On Thu, Jan 30, 2014 at 12:08 PM, Peter Thatcher <pthatcher@google.com>
>> > wrote:
>> >>
>> >> Would it make sense to have an async getter that calls the callback
>> >> function more than once?  For example, to get the current value once,
>> >> call like this:
>> >>
>> >> rtpReceiver.getContributorSources(function(contributorSources) {
>> >>   // Use the contributor sources just once.
>> >> });
>> >>
>> >> And to get called back every 100ms, call like this:
>> >>
>> >> rtpReceiver.getContributorSources(function(contributorSources) {
>> >>   // Use the contributor sources every 100ms.
>> >>   return true;
>> >> }, 100);
>> >>
>> >> And to stop the callback:
>> >>
>> >> rtpReceiver.getContributorSources(function(contributorSources) {
>> >>   if (iAmAllDone) {
>> >>     // I'm all done.  Don't call me anymore.
>> >>     return false;
>> >>   }
>> >>   return true;
>> >> }, 100);
>> >>
>> >>
>> >> That's somewhat halfway between an async getter and an event.  Are
>> >> there any existing HTML5 APIs like that?
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jan 30, 2014 at 8:21 AM, Martin Thomson
>> >> <martin.thomson@gmail.com> wrote:
>> >> > If it is an event, I think that the api should choose the rate. One
>> >> > event
>> >> > per packet makes little sense. I think that I would run at 5-10
>> updates
>> >> > per
>> >> > second, but that might depend on circumstances.
>> >> >
>> >> > On Jan 30, 2014 6:17 AM, "Emil Ivov" <emcho@jitsi.org> wrote:
>> >> >>
>> >> >> On Thu, Jan 30, 2014 at 2:10 AM, Justin Uberti <juberti@google.com>
>> >> >> wrote:
>> >> >> > As others have mentioned, the event rate here could be very high
>> (50+
>> >> >> > PPS),
>> >> >> > and I don't think that resolution is really needed for active
>> speaker
>> >> >> > identification. I have seen systems that work well even when
>> sampling
>> >> >> > this
>> >> >> > information at ~ 5 Hz.
>> >> >> >
>> >> >> > As such I am still inclined to leave this as a polling interface
>> and
>> >> >> > allow
>> >> >> > apps to control the resolution by their poll rate.
>> >> >>
>> >> >> Just to make sure I understand. What is the disadvantage of making
>> >> >> this an event with an application controlled granularity?
>> >> >>
>> >> >> The two main advantages I see to keeping an event-based mechanism
>> are:
>> >> >>
>> >> >> * streams where levels don't change that often (e.g. muted streams)
>> >> >> would not cause any events, while polls would continue running.
>> >> >> * it is unlikely that people would ever need to only do a single
>> poll
>> >> >> so there would always be need for periodicity. It would therefore be
>> >> >> helpful if the API provided the infrastructure for the most common
>> use
>> >> >> case.
>> >> >>
>> >> >> Again, if the choice is between polling and not having access to
>> these
>> >> >> fields at all, then polling it is.
>> >> >>
>> >> >> Emil
>> >> >>
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Jan 29, 2014 at 6:53 AM, Emil Ivov <emcho@jitsi.org>
>> wrote:
>> >> >> >>
>> >> >> >> On Wed, Jan 29, 2014 at 3:14 PM, Bernard Aboba
>> >> >> >> <Bernard.Aboba@microsoft.com> wrote:
>> >> >> >> > Emil said:
>> >> >> >> >
>> >> >> >> > +1. While polling is obviously much better than nothing at all,
>> >> >> >> > having a
>> >> >> >> > change event would be quite convenient.
>> >> >> >> >
>> >> >> >> > With regard to energy levels, there are two main use cases:
>> >> >> >> >
>> >> >> >> > 1.  acting on changes of the current speaker (e.g. in order to
>> >> >> >> > upscale
>> >> >> >> > their corresponding video and thumbnail everyone else)
>> >> >> >> > 2.  showing energy levels for all participants
>> >> >> >> >
>> >> >> >> > [BA] I believe that the polling proposal could address need #2
>> by
>> >> >> >> > delivering a list of CSRCs as well as an (averaged) level, but
>> I'm
>> >> >> >> > not sure
>> >> >> >> > about #1.
>> >> >> >>
>> >> >> >> Yup, agreed.
>> >> >> >>
>> >> >> >> > #1 is about timely dominant speaker identification, presumably
>> >> >> >> > without
>> >> >> >> > false speaker switches.
>> >> >> >> >
>> >> >> >> > To do this well, you may need to do more than firing an event
>> >> >> >> > based
>> >> >> >> > on
>> >> >> >> > changes in a ranked list of speakers based on averaged levels;
>> >> >> >> > better
>> >> >> >> > approaches tend to actually process the audio.
>> >> >> >> >
>> >> >> >> > For example, see
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> http://webee.technion.ac.il/Sites/People/IsraelCohen/Publications/CSL_2012_Volfin.pdf
>> >> >> >>
>> >> >> >> Right. That's why per-packet hdr extensions carrying the CSRC
>> levels
>> >> >> >> would be the best (and only in the case of mixed streams) way to
>> >> >> >> implement any of the above. So, if we could have events triggered
>> >> >> >> for
>> >> >> >> every new level, then we should be good. Unless I am missing
>> >> >> >> something, this should be covered by Peter's suggested API.
>> >> >> >>
>> >> >> >> Emil
>> >> >> >>
>> >> >> >> --
>> >> >> >> https://jitsi.org
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Emil Ivov, Ph.D.                       67000 Strasbourg,
>> >> >> Project Lead                           France
>> >> >> Jitsi
>> >> >> emcho@jitsi.org                        PHONE: +33.1.77.62.43.30
>> >> >> https://jitsi.org                       FAX:   +33.1.77.62.47.31
>> >> >>
>> >
>> >
>>
>
>
Received on Friday, 31 January 2014 22:32:23 UTC