- From: Harald Alvestrand <hta@google.com>
- Date: Sat, 27 Aug 2022 11:49:14 +0200
- To: "public-webrtc@W3.org" <public-webrtc@w3.org>
- Message-ID: <CAOqqYVEwE2eG=m5CD1=ATRq4Cfi-+zH5N9iBZVGMVYLf86gk+g@mail.gmail.com>
I have been working on trying to distil the essential properties of the encoded data access ("Encoded Insertable Streams"), to figure out why the current interface is not right, and trying to make a basis for something that is righter. Current state of my thinking below. Comments welcome. We will put this up as an agenda topic in Vancouver. Harald WebRTC Encoded Data Access - requirements There are many potential uses of Web applications that have access to real-time video and audio channels in encoded form (in WebRTC terminology: between the encoder/decoder and the transport). Envisioned applications - End to end encryption (app-controlled) of video and audio streams - “SFU in the browser”: Selective forwarding of encoded frames to other network entities - Alternative transport: Moving frames over mechanisms other than RTP - Alternative generators: Generating frames using other mechanisms such as WebCodecs rather than WebRTC - Alternative consumers: Feeding frames to WebCodecs, MSE-type mechanisms or other destinations rather than WebRTC for decoding - Integration with MSE-type content protection mechanisms Shortcomings of Encoded Transform Today, we have one interface that permits this - the Encoded Transform interface (also known as InsertableStream), which is implemented for workers on Safari Tech preview, implemented on the main thread in Chrome (where worker processing can be achieved using Transferable Streams) - with a bit of API differences. This interface has proved useful for its initial purpose (app-driven encryption), but has shown itself to be less flexible than desired for other applications. In particular: - Outgoing processing: Since it does not affect SDP negotiation, the format of the media streams after processing can be different from what the packetization layer (which is configured using SDP) expects. - Incoming processing: Since it does not affect SDP negotiation, there is no way to ensure that the processing expected on the sending side has been done. - Interactions with flow control: If frames change properties after outgoing processing, the flow control’s feedback to the codec will be wrong. In particular, if the stream is diverted, feedback will say “nothing is coming”, and the lower layers may take inappropriate actions. - Interactions with bandwidth estimation: the encoder will usually match the target bitrate and the available bandwidth and can not take into account overhead added e.g. by encryption. This overhead can be significant in particular for audio, for example encryption with GCM-256 can add 16 bytes of authentication tag with a common input length of 100 bytes. In contrast to these, the Breakout Box API (Media Stream Track Generator / Processor), which deals with raw media, has been immune to many of these concerns, since it does not admit of any linkage between the source and the destination; all control has to be explicit. Design: Separation of concerns The above considerations lead to some design principles that should be followed for a new paradigm of encoded-media processing. - There should be minimal coupling required between sources and destinations of processing. In particular, requiring that both ends are connected to a “PeerConnection”-type object is a complexifying factor and needs avoiding. - The information about the format of a frame needs to be carried with the frame, not assumed or signaled by out-of-band means. The codec descriptions used in WebRTC and Media Capabilities + the metainformation carried by the Dependency Descriptor is probably sufficient for this purpose. - Each frame needs to carry a timestamp. For RTP-related usage, the RTP timestamp of incoming frames needs to be preserved; for other usages, a timestamp derived from the stream start + position relative to the stream start needs to be carried. - There needs to be information (“reverse data”) returned from the processing of frames (varying available bandwidth due to congestion being an important example, but also, for instance, key frame requests or loss percentage). The meaning of this may need interpretation by the inserted processing element, and the ultimate destination of the information is unknown to the downstream element (see “minimal coupling” above). This may need to be mediated as a separate interface rather than being piggybacked on media processing. - The processing element needs to be able to inform its upstream and downstream elements of what kinds of data it intends to consume / produce ahead of time. This information can be used to configure codecs or transports - and in particular to influence SDP negotiation. However, experience shows that requiring SDP is not a good idea. Sample design This space is intentionally left blank. We should have a clear idea of what we want to achieve before we start sketching out the IDL.
Received on Saturday, 27 August 2022 09:50:43 UTC