Name The Bird: An example Use machine learning use case for WebRTC

Hi,



We would like to propose a more elaborate WebRTC use case with machine learning in the devices to complement the Funny Hat use case.



The intent of this use case is to facilitate a WG discussion about federated learning (machine learning models trained in 

devices and sharing updates to shared model learnt locally in each device) in WebRTC.nv on audio and video media streams in 

the mid to long term time horizon and illustrate the importance of WebRTC.nv API’s being easy to use together with other Web API’s 

that are likely useful in this category of web applications.



We would like to have a chance to present and discuss this topic at the face2face, and propose to add it to the use case list.



Best Regards

Göran and Stefan



Proposed use case: 



A Web game: NameTheBird.com



Participants name the bird the web app detects in the participating devices and responds with proper augmentation in returning 

media stream injected based on spoken commands or gestures (i.e. not necessarily touch based). 



The web application has site specific federated learning-based classifier for contextual object detection and user intent prediction 

and media manipulation stream augmentation it want to inject into the media streams of sending and receiving device.



The shared classification models are trained on the birds found in context of the of the participants and based on the feedback 

of the participants. Each device client updates of the model are up-streamed to a shared model server that pushes updates of the global model to the clients.



Implementation outline:



1) Originating media (raw) stream cloned into different copies for inference and training purposes, denoted “inference stream” and “training stream”, 

     the inferences stream also being the media stream shared with peer(s). The cloning can occur any time of a session.

2) Inference stream: Web site specific classifier acting on inference stream, raw, and result used to guide custom encoder in sender device and send metadata to server 

    and peer devices outside the media stream. Encoder adds proper augmentation, e.g. sign with “name this bird” hovering over the enlarged bird in case of video enrichment, 

    or enhanced bird song if audio. 

3) Training stream: Model in training classifies the raw data and evaluate the classification using user feedback, said feedback loop being web site specific. 

     The evaluation may be  “online” or “offline”, offline meaning the training is done at a later stage on the recorded encoded media set.

4) Both inference stream and training streams may use payload protection depending on trust model on compute resources for optional intermedia server side of app.

5) Both inference stream and training streams use transport object for communicating with peers or servers, the communication in some cases can

    be a site specific QUIC based transport solution, in other RTP based.

Received on Friday, 15 June 2018 14:00:33 UTC