Re: Use cases / requirements for raw data access functions from Justin Uberti on 2018-05-18 (public-webrtc@w3.org from May 2018)

From: Justin Uberti <juberti@google.com>
Date: Thu, 17 May 2018 22:17:28 -0700
To: yfablet@apple.com
Cc: Harald Alvestrand <harald@alvestrand.no>, public-webrtc@w3.org
Message-ID: <CAOJ7v-1bhrKHNyj0bz3m60a_3LQXAYRkZKegqXDu+Yw4zQJtBw@mail.gmail.com>
On Thu, May 17, 2018 at 10:20 AM youenn fablet <yfablet@apple.com> wrote:

> Thanks Harald for writing all of this,
> Some early feedback below.
> Y
>
> On May 16, 2018, at 10:06 AM, Harald Alvestrand <harald@alvestrand.no>
> wrote:
>
> *This is a copy of a document we've been working on in order to collect
> thoughts about the need for new APIs in WebRTC "TNG".*
>
>
>
>
>
>
>
>
>
>
>
> * It should bring out certain requirements that make some API proposals
> obvious (or not).Please comment! (PDF version attached, so that the picture
> survives) Certain things are hard to do in the present WebRTC /
> MediaStreamTrack API.In particular, anything involving manipulation of raw
> data involves convoluted interfaces that impose burdens of format
> conversion and/or buffer copying on the user. This document sketches the
> use cases that can be made possible if this access is made a lot easier and
> with lower overhead. For reference, a model of the encoding / decoding
> pipeline in the communications use case: When doing other types of
> processing, the pipeline stages may be connected elsewhere; for instance,
> when saving to file (MediaRecorder), the “Encode” step links to a “Storage”
> step, not “Transport”. The “Decode” process will include alignment of media
> timing with real time (NetEq / jitter buffer); the process from raw data to
> display will happen “as fast as possible”. Raw Image Use CasesThis set of
> use cases involves the manipulation of video after it comes from the
> camera, but before it goes out for transmission, or vice versa.Examples of
> apps that consume raw data from a camera or other source, producing raw
> data that goes out for processing: - Funny hats - Background removal -
> In-browser compositing (merge video streams) Needed APIs: - Get raw frames
> from input device or path - Insert (processed) raw frames into output
> device or path*
>
>
>
> This makes huge sense to me.
> It would make sense to mirror the capabilities of web audio here:
> - The API should be able to process any source (camera, peer connection,
> canvas probably, meaning handling of potentially different frame formats)
> - The API should be able to produce a source consumable by peer
> connection, video elements.
> - The API should allow to do as much processing (ideally the whole
> processing) off the main thread.
> - The API should allow leveraging existing APIs such as WASM, WebGL...
>
>
>
>
> * Non-Standard EncodersThis set of tools can be useful for either special
> types of operations (like detecting face movement and sending only those
> for backprojection on a model head rather than sending the picture of the
> face) or for testing out experimental codecs without involving browser
> changes (such as novel SVC or simulcast strategies).*
>
>
> Given the potential complexity here and below, compelling use cases seem
> really important to me.
> I am not sure experimental codecs meet the bar and require a standard API.
> An experiment can always be done using a proprietary API, available to
> browser extensions for instance.
>
> As of special types of operation like detecting face movement, there might
> be alternatives using the raw image API:
> - Skip frames (say there is no head being detected)
> - Generate structured data (image descriptor eg.) and send it over data
> channel
> - Transform an image before encoding/after decoding
>
>
>
>
>
>
>
>
>
>
> * Needed APIs, send side: - Get raw frames from input device - Insert
> encoded frames on output transmission channel - Manipulate transmission
> setup so that normal encoder resources are not needed Needed APIs, receive
> side: - Signalling access so that one knows what codec has been agreed for
> use - Get encoded frames from the input transmission channel - Insert raw
> (decoded) frames into output device or path Pre/post-transmission
> processing - Bring Your Own Encryption This is the inverse of the situation
> above: One has a video stream and wishes to encode it into a known codec,
> but process the data further in some way before sending it. The example in
> the title is one use case. The same APIs will also allow the usage of
> different transmission media (media over the data channel, or media over
> protobufs over QUIC streams, for instance). *
>
>
> I like this BYO encryption use case.
> Note though that it does not specifically require to get access to the
> encoded frames before doing the encryption.
> We could envision an API to provide the encryption parameters (keys e.g.)
> so that the browser does the encryption by itself.
> Of course, it has pros (simple to implement, simple to use) and cons
> (narrow scope).
>
> I am not against adding support for scripting between encoding frames and
> sending the encoded frames.
> It seems like a powerful API.
> We must weight though how much ground we gain versus how much complexity
> we add, how much we resolve actual needs of the community...
>

It should be noted that mobile platforms currently provide this level of
access via MediaCodec (Android) and VideoToolbox (iOS). But I agree that
having compelling use cases is important.

>
> Also to be noted that getting the encoded frames, processing them and
> sending them to the network is currently done off the main thread.
> One general concern is that the more we add JavaScript at various points
> of the pipeline, the more we might decrease the
> efficiency/stability/interoperability of the realtime pipeline.
>

These encoded frames are typically going to go over the network, where
unexpected delays are a fact of life, and so the system is already prepared
to deal with them, e.g., via jitter buffers. (Or, the frames will be
written to a file, and this issue is entirely moot.)

This is in contrast to the "raw image" cases, that will often be operating
in a performance-critical part of the pipeline.

>
>
>
>
>
>
>
>
>
>
> *Needed APIs, encode: - Codec configuration - the stuff that usually
> happens at offer/answer time - Getting the encoded frames from the “output”
> channel - Inserting the processed encoded frames into the real “output”
> channel - Reaction to congestion information from the output channel -
> Feeding congestion signals into the encoder Needed APIs, decode: - Codec
> configuration information - Getting the encoded frames from the input
> transport - Inserting the processed encoded frames into the input decoding
> process The same APIs are needed for other functions, such as: - ML-NetEq:
> Jitter buffer control in other ways than the built-in browser - This also
> needs the ability to turn off the built-in jitter buffer, and therefore
> makes this API have the same timing requirements as dealing with raw data -
> ML-FEC: Application-defined strategies for recovering from lost packets. -
> Alternative transmission: Using something other than browser’s built-in
> realtime transport (currently SRTP) to move the media data *
>
> --
> Surveillance is pervasive. Go Dark.
>
> <Raw Data Access - Explainer.pdf>
>
>
>
Received on Friday, 18 May 2018 05:18:11 UTC