Re: Transport / encoder coupling (Re: Use cases / requirements for raw data access functions) from Peter Thatcher on 2018-05-23 (public-webrtc@w3.org from May 2018)

From: Peter Thatcher <pthatcher@google.com>
Date: Wed, 23 May 2018 09:34:40 -0700
To: Sergio Garcia Murillo <sergio.garcia.murillo@gmail.com>
Cc: public-webrtc@w3.org
Message-ID: <CAJrXDUGWgBZmLA4Hrvbd_Q6nQawDVb0K3S7=uZ-Ky4CMKeY=YA@mail.gmail.com>

On Wed, May 23, 2018 at 4:04 AM Sergio Garcia Murillo <
sergio.garcia.murillo@gmail.com> wrote:

> On 22/05/2018 23:21, Peter Thatcher wrote:
>
> I made a diagram of my thinking of the different possible API layers.
> Here it is:
>
> [image: RTP object stack.png]
>
>
> IMHO jitter buffer should not be coupled with the decoder as it is an RTP
> thing,
>

The jitter buffer and decoder are not RTP things.  They can work with any
transport (they can work with media over QUIC or SCTP, for example).

Although, I should be clear, I'm talking about the *frame* jitter buffer,
not the *packet* buffer.  For audio, there is no distinction between the
two.  But for video, there are different parts.  I include the packet
buffer in the (de)packetizer, or the RtpFrameTransport, not the frame
buffer.  The RTP packet buffer is, indeed, RTP specific.  But the frame
buffer isn't.

> if you are going to feed full frames to the decoder, it has to be on the
> RTPFrameTransport (why  "frame" ?), or better, split it in two different
> components.
>

The packet buffer, yes.  The frame buffer, no.  It's better that the frames
coming out of the frame transport are out of order and the code for dealing
with out of order frames is common across transports and more united with
the decoding, not more united with the transport.

For video, it's fine if there are two different components (a frame buffer
and a decoder).  In fact, I'd be in favor of that if we can make sure the
performance is good and it doesn't add too much extra complexity to the
API.

However, it's not that simple for audio.  Doing so is theoretically
possible for audio, but the current implementation of the audio jitter
buffer by all but 1 WebRTC-capable browser (NetEq) has the jitter buffer
and decoder tightly coupled.  Decoupling those would be a very large amount
of work, assuming it's possible while retaining the same level of
performance/quality.

Plus, it's nice conceptual model:  you put in out-of-order encoded frames
and you get out in-order decoded frames.

As for the name "frame", that's what it's called everywhere in the WebRTC
implementation and how everyone talks about it.  Do you have a suggestion
for a better name for a collection of pixels or a chunk of audio samples?

>
>
> This highlights the question: what happens to RtpParameters?  Some are
> per-transport, some are per-stream, some are per-encoder (which is
> different than per-stream), and some don't make sense any more.
>
> Per-transport:
> - Payload type <-> codec mapping
> - RTCP reduced size
> - Header extension ID <-> URI mapping
>
> Per-stream:
> - PTs used for PT-based demux
> - priority
> - mid
> - cname
> - ssrcs
>
> How about simulcast and rids?
>

Yeah, simulcast/SVC is a rathole of a topic I was trying to avoid dragging
into this email thread.

> Per-encoder:
> - dtx
> - ptime
> - codec parameters
> - bitrate
> - maxFramerate
> - scaleResolutionDownBy
>
> Don't make sense:
> - active (just stop the encoder)
> - degradation preference (just change the encoder directly)
>
> Not sure if I want to get rid of the degradation preference, as currently
> the encoder is affected by both bandwidth reported by the transport layer +
> cpu utilization.
>

Yeah, I could see a degradation preference on the encoder.  Good point.

> Would it make sense to extend the degradation preference to also cover
> that scenario? Also, I am not sure if i want to get rid of a backchannel
> between the encoder and the transport and have to do it manually.
>

This is something that WG can discuss.  I think the app has to be able to
control it because the rules for bitrate allocation between audio, multiple
streams of video, simulcast, and SVC just get too complex and we should let
the app decide directly how to allocate bits.

However, I do see your point that easy things should be easy, so perhaps
there's an automatic default behavior that works well.

On the other hand, I'm in favor of low-level APIs with libraries on top of
that.  Wiring a BWE value into an encoder "manually" is not very hard.

>
> Best regards
>
> sergio
>

Attachments

image/png attachment: RTP_object_stack.png

Received on Wednesday, 23 May 2018 16:35:23 UTC