Name The Bird: An example Use machine learning use case for WebRTC from Göran Eriksson AP on 2018-06-15 (public-webrtc@w3.org from June 2018)

From: Göran Eriksson AP <goran.ap.eriksson@ericsson.com>
Date: Fri, 15 Jun 2018 13:59:58 +0000
To: W3C WEBRTC <public-webrtc@w3.org>
CC: Stefan Håkansson LK <stefan.lk.hakansson@ericsson.com>
Message-ID: <90437E53-4FEA-4B70-8C15-FE3B0F25F22A@ericsson.com>

Hi,

We would like to propose a more elaborate WebRTC use case with machine learning in the devices to complement the Funny Hat use case.

The intent of this use case is to facilitate a WG discussion about federated learning (machine learning models trained in

devices and sharing updates to shared model learnt locally in each device) in WebRTC.nv on audio and video media streams in

the mid to long term time horizon and illustrate the importance of WebRTC.nv API’s being easy to use together with other Web API’s

that are likely useful in this category of web applications.

We would like to have a chance to present and discuss this topic at the face2face, and propose to add it to the use case list.

Best Regards

Göran and Stefan

Proposed use case:

A Web game: NameTheBird.com

Participants name the bird the web app detects in the participating devices and responds with proper augmentation in returning

media stream injected based on spoken commands or gestures (i.e. not necessarily touch based).

The web application has site specific federated learning-based classifier for contextual object detection and user intent prediction

and media manipulation stream augmentation it want to inject into the media streams of sending and receiving device.

The shared classification models are trained on the birds found in context of the of the participants and based on the feedback

of the participants. Each device client updates of the model are up-streamed to a shared model server that pushes updates of the global model to the clients.

Implementation outline:

1) Originating media (raw) stream cloned into different copies for inference and training purposes, denoted “inference stream” and “training stream”,

the inferences stream also being the media stream shared with peer(s). The cloning can occur any time of a session.

2) Inference stream: Web site specific classifier acting on inference stream, raw, and result used to guide custom encoder in sender device and send metadata to server

and peer devices outside the media stream. Encoder adds proper augmentation, e.g. sign with “name this bird” hovering over the enlarged bird in case of video enrichment,

or enhanced bird song if audio.

3) Training stream: Model in training classifies the raw data and evaluate the classification using user feedback, said feedback loop being web site specific.

The evaluation may be “online” or “offline”, offline meaning the training is done at a later stage on the recorded encoded media set.

4) Both inference stream and training streams may use payload protection depending on trust model on compute resources for optional intermedia server side of app.

5) Both inference stream and training streams use transport object for communicating with peers or servers, the communication in some cases can

be a site specific QUIC based transport solution, in other RTP based.

Received on Friday, 15 June 2018 14:00:33 UTC