Re: Use cases / requirements for raw data access functions from youenn fablet on 2018-05-17 (public-webrtc@w3.org from May 2018)

From: youenn fablet <yfablet@apple.com>
Date: Thu, 17 May 2018 10:15:14 -0700
To: Harald Alvestrand <harald@alvestrand.no>
Cc: "public-webrtc@w3.org" <public-webrtc@w3.org>
Message-id: <9121F59B-E416-45C9-A332-771533CEA7A3@apple.com>
Thanks Harald for writing all of this,
Some early feedback below.
 Y

> On May 16, 2018, at 10:06 AM, Harald Alvestrand <harald@alvestrand.no> wrote:
> 
> 
> This is a copy of a document we've been working on in order to collect thoughts about the need for new APIs in WebRTC "TNG".
> 
> It should bring out certain requirements that make some API proposals obvious (or not).
> Please comment!
> 
> (PDF version attached, so that the picture survives)
> 
> 
> Certain things are hard to do in the present WebRTC / MediaStreamTrack API.
> In particular, anything involving manipulation of raw data involves convoluted interfaces that impose burdens of format conversion and/or buffer copying on the user.
> 
> This document sketches the use cases that can be made possible if this access is made a lot easier and with lower overhead.
> 
> For reference, a model of the encoding / decoding pipeline in the communications use case:
> 
> 
> When doing other types of processing, the pipeline stages may be connected elsewhere; for instance, when saving to file (MediaRecorder), the “Encode” step links to a “Storage” step, not “Transport”.
> 
> The “Decode” process will include alignment of media timing with real time (NetEq / jitter buffer); the process from raw data to display will happen “as fast as possible”.
> Raw Image Use Cases
> This set of use cases involves the manipulation of video after it comes from the camera, but before it goes out for transmission, or vice versa.
> Examples of apps that consume raw data from a camera or other source, producing raw data that goes out for processing:
> 
> Funny hats
> Background removal
> In-browser compositing (merge video streams)
> 
> Needed APIs:
> Get raw frames from input device or path
> Insert (processed) raw frames into output device or path


This makes huge sense to me.
It would make sense to mirror the capabilities of web audio here:
- The API should be able to process any source (camera, peer connection, canvas probably, meaning handling of potentially different frame formats)
- The API should be able to produce a source consumable by peer connection, video elements.
- The API should allow to do as much processing (ideally the whole processing) off the main thread.
- The API should allow leveraging existing APIs such as WASM, WebGL...

> 
> Non-Standard Encoders
> This set of tools can be useful for either special types of operations (like detecting face movement and sending only those for backprojection on a model head rather than sending the picture of the face) or for testing out experimental codecs without involving browser changes (such as novel SVC or simulcast strategies).

Given the potential complexity here and below, compelling use cases seem really important to me.
I am not sure experimental codecs meet the bar and require a standard API.
An experiment can always be done using a proprietary API, available to browser extensions for instance.

As of special types of operation like detecting face movement, there might be alternatives using the raw image API:
- Skip frames (say there is no head being detected)
- Generate structured data (image descriptor eg.) and send it over data channel
- Transform an image before encoding/after decoding

> 
> Needed APIs, send side:
> 
> Get raw frames from input device
> Insert encoded frames on output transmission channel
> Manipulate transmission setup so that normal encoder resources are not needed
> 
> Needed APIs, receive side:
> Signalling access so that one knows what codec has been agreed for use
> Get encoded frames from the input transmission channel
> Insert raw (decoded) frames into output device or path
> Pre/post-transmission processing - Bring Your Own Encryption
> 
> This is the inverse of the situation above: One has a video stream and wishes to encode it into a known codec, but process the data further in some way before sending it.
> The example in the title is one use case.
> The same APIs will also allow the usage of different transmission media (media over the data channel, or media over protobufs over QUIC streams, for instance).
> 

I like this BYO encryption use case.
Note though that it does not specifically require to get access to the encoded frames before doing the encryption.
We could envision an API to provide the encryption parameters (keys e.g.) so that the browser does the encryption by itself.
Of course, it has pros (simple to implement, simple to use) and cons (narrow scope).

I am not against adding support for scripting between encoding frames and sending the encoded frames.
It seems like a powerful API.
We must weight though how much ground we gain versus how much complexity we add, how much we resolve actual needs of the community...

Also to be noted that getting the encoded frames, processing them and sending them to the network is currently done off the main thread.
One general concern is that the more we add JavaScript at various points of the pipeline, the more we might decrease the efficiency/stability/interoperability of the realtime pipeline.

> Needed APIs, encode:
> Codec configuration - the stuff that usually happens at offer/answer time
> Getting the encoded frames from the “output” channel
> Inserting the processed encoded frames into the real “output” channel
> Reaction to congestion information from the output channel
> Feeding congestion signals into the encoder 
> 
> Needed APIs, decode:
> Codec configuration information
> Getting the encoded frames from the input transport
> Inserting the processed encoded frames into the input decoding process
> 
> The same APIs are needed for other functions, such as:
> ML-NetEq: Jitter buffer control in other ways than the built-in browser
> This also needs the ability to turn off the built-in jitter buffer, and therefore makes this API have the same timing requirements as dealing with raw data
> ML-FEC: Application-defined strategies for recovering from lost packets.
> Alternative transmission: Using something other than browser’s built-in realtime transport (currently SRTP) to move the media data
> 
> 
> -- 
> Surveillance is pervasive. Go Dark.
> <Raw Data Access - Explainer.pdf>
Received on Thursday, 17 May 2018 17:15:47 UTC