- From: Harald Alvestrand <hta@google.com>
- Date: Sat, 27 Aug 2022 11:49:14 +0200
- To: "public-webrtc@W3.org" <public-webrtc@w3.org>
- Message-ID: <CAOqqYVEwE2eG=m5CD1=ATRq4Cfi-+zH5N9iBZVGMVYLf86gk+g@mail.gmail.com>
I have been working on trying to distil the essential properties of the
encoded data access ("Encoded Insertable Streams"), to figure out why the
current interface is not right, and trying to make a basis for
something that is righter.
Current state of my thinking below. Comments welcome.
We will put this up as an agenda topic in Vancouver.
Harald
WebRTC Encoded Data Access - requirements
There are many potential uses of Web applications that have access to
real-time video and audio channels in encoded form (in WebRTC terminology:
between the encoder/decoder and the transport).
Envisioned applications
-
End to end encryption (app-controlled) of video and audio streams
-
“SFU in the browser”: Selective forwarding of encoded frames to other
network entities
-
Alternative transport: Moving frames over mechanisms other than RTP
-
Alternative generators: Generating frames using other mechanisms such as
WebCodecs rather than WebRTC
-
Alternative consumers: Feeding frames to WebCodecs, MSE-type mechanisms
or other destinations rather than WebRTC for decoding
-
Integration with MSE-type content protection mechanisms
Shortcomings of Encoded Transform
Today, we have one interface that permits this - the Encoded Transform
interface (also known as InsertableStream), which is implemented for
workers on Safari Tech preview, implemented on the main thread in Chrome
(where worker processing can be achieved using Transferable Streams) - with
a bit of API differences.
This interface has proved useful for its initial purpose (app-driven
encryption), but has shown itself to be less flexible than desired for
other applications.
In particular:
-
Outgoing processing: Since it does not affect SDP negotiation, the
format of the media streams after processing can be different from what the
packetization layer (which is configured using SDP) expects.
-
Incoming processing: Since it does not affect SDP negotiation, there is
no way to ensure that the processing expected on the sending side has been
done.
-
Interactions with flow control: If frames change properties after
outgoing processing, the flow control’s feedback to the codec will be
wrong. In particular, if the stream is diverted, feedback will say “nothing
is coming”, and the lower layers may take inappropriate actions.
-
Interactions with bandwidth estimation: the encoder will usually match
the target bitrate and the available bandwidth and can not take into
account overhead added e.g. by encryption. This overhead can be significant
in particular for audio, for example encryption with GCM-256 can add 16
bytes of authentication tag with a common input length of 100 bytes.
In contrast to these, the Breakout Box API (Media Stream Track Generator /
Processor), which deals with raw media, has been immune to many of these
concerns, since it does not admit of any linkage between the source and the
destination; all control has to be explicit.
Design: Separation of concerns
The above considerations lead to some design principles that should be
followed for a new paradigm of encoded-media processing.
-
There should be minimal coupling required between sources and
destinations of processing. In particular, requiring that both ends are
connected to a “PeerConnection”-type object is a complexifying factor and
needs avoiding.
-
The information about the format of a frame needs to be carried with the
frame, not assumed or signaled by out-of-band means. The codec descriptions
used in WebRTC and Media Capabilities + the metainformation carried by the
Dependency Descriptor is probably sufficient for this purpose.
-
Each frame needs to carry a timestamp. For RTP-related usage, the RTP
timestamp of incoming frames needs to be preserved; for other usages, a
timestamp derived from the stream start + position relative to the stream
start needs to be carried.
-
There needs to be information (“reverse data”) returned from the
processing of frames (varying available bandwidth due to congestion being
an important example, but also, for instance, key frame requests or loss
percentage). The meaning of this may need interpretation by the inserted
processing element, and the ultimate destination of the information is
unknown to the downstream element (see “minimal coupling” above). This may
need to be mediated as a separate interface rather than being piggybacked
on media processing.
-
The processing element needs to be able to inform its upstream and
downstream elements of what kinds of data it intends to consume / produce
ahead of time. This information can be used to configure codecs or
transports - and in particular to influence SDP negotiation. However,
experience shows that requiring SDP is not a good idea.
Sample design
This space is intentionally left blank. We should have a clear idea of what
we want to achieve before we start sketching out the IDL.
Received on Saturday, 27 August 2022 09:50:43 UTC