Re: Raw data API - 6 - Encoders/decoders from Peter Thatcher on 2018-06-14 (public-webrtc@w3.org from June 2018)

From: Peter Thatcher <pthatcher@google.com>
Date: Wed, 13 Jun 2018 19:35:25 -0700
To: Sergio Garcia Murillo <sergio.garcia.murillo@gmail.com>
Cc: public-webrtc@w3.org
Message-ID: <CAJrXDUGP_EX4K1PArYUygPys9fnod6+KSfNrJvLoG+44cZcDqQ@mail.gmail.com>
On Wed, Jun 13, 2018 at 2:10 AM Sergio Garcia Murillo <
sergio.garcia.murillo@gmail.com> wrote:

> Hi Peter,
>
> I think this proposal fails to address some of the concerns on the
> previous emails on the subject and does not cover some of the proposed
> use cases.
>
> The encoders take a "track in; encoded frame out", so it is not possible
> to implement the "funny hat" use cases as it is not possible to provide
> raw images/audio to the audio.

It could be done by creating a "RawTrack"
> wrapper that you could feed the raw data to, but there is a missing
> piece in the puzzle.


That's true.  If we come up with a way to do raw processing, we'd need to
add "raw frame in; encoded frame out".  But that's just one method away if
we need it (.encodedFrame(rawFrame)).   Seems like a minor thing to add if
we decide to cover the funny hat use case.


>

> Also, with the "encoded frame out" we are forcing everyone to deal with
> the encoded frames in JS, even if they just want to do the "basic
> standard pipelining". There have been several concerns raised about the
> performance/viability of this approach, so I don't feel it is wise to
> force this model to everyone. As I have said previously I am not against
> allowing the "frame out" mode, but I am against it being the only and
> default mode of operations.
>

Yes, that's a valid concern.  But that concern is the same as for any time
the app gets in the media flow, such as when it does processing for funny
hats or adds encryption.  It might take some implementation experience to
figure out where the really perf issues are.


>
> What I am missing is an optimized/simple streaming mode that do not
> require the encoded/raw frames to be handled individually in order to
> work. This get back to my proposal of considering the whatwg streams as
> a viable implementation option that will cover all the use cases without
> the potential performance/lag issues of "frame mode", but still allowing
> direct frame access and manipulation.
>

If the app can't process things per-frame, then you might as well just go
back to having an RtpSender like in ORTC that doesn't require the app to do
anything on the media path.


>
> Using the whatwg-like api, it could be possible to do
>
> source.pipeThrough(funnyHatsWebWorker)
>              .pipeTo(encoder)
>              .pipeThrough(rtpPacketizer)
>              .pipeTo(rtpSender)
>              .pipeTo(rtpTransport)
>
> As you can see it allows raw modification of the stream (if required),
> but allows the browser to optimize the pipelining between the encoder
> and the transport without requiring going into the main js thead at any
> time.
>

But that requires every piece to be implemented by the browser, and I don't
see how that buys us much vs. an ORTC-style RtpSender.   It would be much
better if we could find a way to make wasm/js performant in the media
path.


>
> Even the discussion of where the jitter buffer should be placed would be
> elegantly solved:
>
> rtpReceiver.pipeThrough(rtpDepacketizer)
>                      .pipeThrough(jitterBuffer)
>                      .pipeTo(decoder)
>

As mentioned in previous emails, there is a large cost to splitting the
decoders and jitter buffers, but I don't see much benefit, if any.  I like
have separate components, but that split seems useless.


>
> Again, I am not saying that whatwg streams are THE solution, but this is
> definitively the kind of API i would like to get in webrtc nv.
>

It would be easy to make the encoders and decoders use WHATWG streams
instead of events.  I just don't see the benefit of having an encoder
stream tied to a transport stream with no app in between except plugging it
together and then hoping that it will be performant because we expect a
sufficiently smart browser implementation.


>
> Best regards
> Sergio
>
> On 13/06/2018 8:59, Peter Thatcher wrote:
> > Emails #3 and #4 of Harald's recent set of 5 emails covered how to get
> > encoded data in and out of RtpSender/RtpReceiver.  And could work fine
> > if you do the encode and decode in wasm/js.   But what if you want the
> > browser to handle the codecs, or provide hardware codecs?
> >
> > There's one more piece to the puzzle: an API for encoders and
> > decoders.  So here is email #6 (which Harald asked me to write)
> > describing how those would look.
> >
> >
> > Basically an encoder is "track in; encoded frame out" and a decoder is
> > "encoded frame in; track out".  An encoded frame is the encoded bytes
> > of the pixels of a raw video frame at a particular point in time or
> > the encoded bytes of the samples of a raw audio "frame" over a range
> > of time.
> >
> > While the app doesn't feed the raw frames directly from the track to
> > the encoder  (or from the decoder to the track), it does have direct
> > control over how the encoder encodes and can change it at any time.
> >
> > Here is how the objects could look:
> >
> > interface AudioEncoder {
> >   // Can be called any time to change parameters
> >   void start(MediaStreamTrack rawAudio, AudioEncodeParameters
> > encodeParameters);
> >   void stop();
> >   attribute eventhandler onencodedaudio;  // of EncodedAudioFrame
> > }
> >
> > dictionary AudioEncodeParameters {
> >   unsigned long frameLength;  // aka ptime, in ms
> >   unsigned long bitrate;
> >   // ...
> > }
> >
> > dictionary EncodedAudioFrame {
> >   // Start timestamp in the samplerate clock
> >   unsigned long startSampleIndex;
> >   unsigned int sampleCount;
> >   unsigned int channelCount;
> >   CodecType codecType;
> >   ByteArray encodedData;
> > }
> >
> > interface AudioDecoder {
> >   void decodeAudio(EncodedAudioFrame frame);
> >   readonly attribute MediaStreamTrack decodedAudio;
> > }
> >
> > interface VideoEncoder {
> >   // Can be called any time to change parameters
> >   void start(MediaStreamTrack rawVideo, VideoEncodeParameters
> > encodeParameters);
> >   void stop();
> >   attribute eventhandler onencodedvideo;  // of EncodedVideoFrame
> > }
> >
> > dictionary VideoEncodeParameters {
> >   unsigned long bitrate;
> >   boolean generateKeyFrame;
> >   // TODO: SVS/simulcast, resolutionScale, framerateScale, ...
> >   // ...
> > }
> >
> > dictionary EncodedVideoFrame {
> >   unsigned short width;
> >   unsigned short height;
> >   unsigned short rotationDegrees;
> >   unsigned long timestampMs;
> >   CodecType codecType;
> >   ByteArray encodedData;
> > }
> >
> > interface VideoDecoder {
> >   void decodeAudio(EncodedVideoFrame frame);
> >   readonly attribute MediaStreamTrack decodedVideo;
> > }
> >
> >
> > If you're paying attention, you may be wondering the following:
> >
> > 1.  Where is the jitter buffer?  Answer: it's in the decoder.  The
> > decoder can take out-of-order encoded frames and produce an in-order
> > track.  This is much more simple than exposing separate jitter buffer
> > and decoder objects.
> >
> > 2.  What about SVC/simulcast.  There are few ways we could go about
> > it, depending on what we want "encoder" and "encoded frame" to mean (1
> > layer or many?).  I'm sure we'll cover that in the f2f.
> >
> >
> >
> >
>
>
>
Received on Thursday, 14 June 2018 02:36:02 UTC