Encoded data access - Requirements for a new API

I have been working on trying to distil the essential properties of the
encoded data access ("Encoded Insertable Streams"), to figure out why the
current interface is not right, and trying to make a basis for
something that is righter.

Current state of my thinking below. Comments welcome.
We will put this up as an agenda topic in Vancouver.

Harald

WebRTC Encoded Data Access - requirements


There are many potential uses of Web applications that have access to
real-time video and audio channels in encoded form (in WebRTC terminology:
between the encoder/decoder and the transport).

Envisioned applications

   -

   End to end encryption (app-controlled) of video and audio streams
   -

   “SFU in the browser”: Selective forwarding of encoded frames to other
   network entities
   -

   Alternative transport: Moving frames over mechanisms other than RTP
   -

   Alternative generators: Generating frames using other mechanisms such as
   WebCodecs rather than WebRTC
   -

   Alternative consumers: Feeding frames to WebCodecs, MSE-type mechanisms
   or other destinations rather than WebRTC for decoding
   -

   Integration with MSE-type content protection mechanisms


Shortcomings of Encoded Transform

Today, we have one interface that permits this - the Encoded Transform
interface (also known as InsertableStream), which is implemented for
workers on Safari Tech preview, implemented on the main thread in Chrome
(where worker processing can be achieved using Transferable Streams) - with
a bit of API differences.

This interface has proved useful for its initial purpose (app-driven
encryption), but has shown itself to be less flexible than desired for
other applications.

In particular:


   -

   Outgoing processing: Since it does not affect SDP negotiation, the
   format of the media streams after processing can be different from what the
   packetization layer (which is configured using SDP) expects.
   -

   Incoming processing: Since it does not affect SDP negotiation, there is
   no way to ensure that the processing expected on the sending side has been
   done.
   -

   Interactions with flow control: If frames change properties after
   outgoing processing, the flow control’s feedback to the codec will be
   wrong. In particular, if the stream is diverted, feedback will say “nothing
   is coming”, and the lower layers may take inappropriate actions.
   -

   Interactions with bandwidth estimation: the encoder will usually match
   the target bitrate and the available bandwidth and can not take into
   account overhead added e.g. by encryption. This overhead can be significant
   in particular for audio, for example encryption with GCM-256 can add 16
   bytes of authentication tag with a common input length of 100 bytes.


In contrast to these, the Breakout Box API (Media Stream Track Generator /
Processor), which deals with raw media, has been immune to many of these
concerns, since it does not admit of any linkage between the source and the
destination; all control has to be explicit.

Design: Separation of concerns

The above considerations lead to some design principles that should be
followed for a new paradigm of encoded-media processing.


   -

   There should be minimal coupling required between sources and
   destinations of processing. In particular, requiring that both ends are
   connected to a “PeerConnection”-type object is a complexifying factor and
   needs avoiding.
   -

   The information about the format of a frame needs to be carried with the
   frame, not assumed or signaled by out-of-band means. The codec descriptions
   used in WebRTC and Media Capabilities + the metainformation carried by the
   Dependency Descriptor is probably sufficient for this purpose.
   -

   Each frame needs to carry a timestamp. For RTP-related usage, the RTP
   timestamp of incoming frames needs to be preserved; for other usages, a
   timestamp derived from the stream start + position relative to the stream
   start needs to be carried.
   -

   There needs to be information (“reverse data”) returned from the
   processing of frames (varying available bandwidth due to congestion being
   an important example, but also, for instance, key frame requests or loss
   percentage). The meaning of this may need interpretation by the inserted
   processing element, and the ultimate destination of the information is
   unknown to the downstream element (see “minimal coupling” above). This may
   need to be mediated as a separate interface rather than being piggybacked
   on media processing.
   -

   The processing element needs to be able to inform its upstream and
   downstream elements of what kinds of data it intends to consume / produce
   ahead of time. This information can be used to configure codecs or
   transports - and in particular to influence SDP negotiation. However,
   experience shows that requiring SDP is not a good idea.


Sample design

This space is intentionally left blank. We should have a clear idea of what
we want to achieve before we start sketching out the IDL.

Received on Saturday, 27 August 2022 09:50:43 UTC