Re: Raw data API - 6 - Encoders/decoders from Sergio Garcia Murillo on 2018-06-13 (public-webrtc@w3.org from June 2018)

From: Sergio Garcia Murillo <sergio.garcia.murillo@gmail.com>
Date: Wed, 13 Jun 2018 11:10:02 +0200
To: public-webrtc@w3.org
Message-ID: <6606c80f-3239-5292-2aa5-627c5efd6bac@gmail.com>
Hi Peter,

I think this proposal fails to address some of the concerns on the 
previous emails on the subject and does not cover some of the proposed 
use cases.

The encoders take a "track in; encoded frame out", so it is not possible 
to implement the "funny hat" use cases as it is not possible to provide 
raw images/audio to the audio. It could be done by creating a "RawTrack" 
wrapper that you could feed the raw data to, but there is a missing 
piece in the puzzle.

Also, with the "encoded frame out" we are forcing everyone to deal with 
the encoded frames in JS, even if they just want to do the "basic 
standard pipelining". There have been several concerns raised about the 
performance/viability of this approach, so I don't feel it is wise to 
force this model to everyone. As I have said previously I am not against 
allowing the "frame out" mode, but I am against it being the only and 
default mode of operations.

What I am missing is an optimized/simple streaming mode that do not 
require the encoded/raw frames to be handled individually in order to 
work. This get back to my proposal of considering the whatwg streams as 
a viable implementation option that will cover all the use cases without 
the potential performance/lag issues of "frame mode", but still allowing 
direct frame access and manipulation.

Using the whatwg-like api, it could be possible to do

source.pipeThrough(funnyHatsWebWorker)
             .pipeTo(encoder)
             .pipeThrough(rtpPacketizer)
             .pipeTo(rtpSender)
             .pipeTo(rtpTransport)

As you can see it allows raw modification of the stream (if required), 
but allows the browser to optimize the pipelining between the encoder 
and the transport without requiring going into the main js thead at any 
time.

Even the discussion of where the jitter buffer should be placed would be 
elegantly solved:

rtpReceiver.pipeThrough(rtpDepacketizer)
                     .pipeThrough(jitterBuffer)
                     .pipeTo(decoder)

Again, I am not saying that whatwg streams are THE solution, but this is 
definitively the kind of API i would like to get in webrtc nv.

Best regards
Sergio

On 13/06/2018 8:59, Peter Thatcher wrote:
> Emails #3 and #4 of Harald's recent set of 5 emails covered how to get 
> encoded data in and out of RtpSender/RtpReceiver.  And could work fine 
> if you do the encode and decode in wasm/js.   But what if you want the 
> browser to handle the codecs, or provide hardware codecs?
>
> There's one more piece to the puzzle: an API for encoders and 
> decoders.  So here is email #6 (which Harald asked me to write) 
> describing how those would look.
>
>
> Basically an encoder is "track in; encoded frame out" and a decoder is 
> "encoded frame in; track out".  An encoded frame is the encoded bytes 
> of the pixels of a raw video frame at a particular point in time or 
> the encoded bytes of the samples of a raw audio "frame" over a range 
> of time.
>
> While the app doesn't feed the raw frames directly from the track to 
> the encoder  (or from the decoder to the track), it does have direct 
> control over how the encoder encodes and can change it at any time.
>
> Here is how the objects could look:
>
> interface AudioEncoder {
>   // Can be called any time to change parameters
>   void start(MediaStreamTrack rawAudio, AudioEncodeParameters 
> encodeParameters);
>   void stop();
>   attribute eventhandler onencodedaudio;  // of EncodedAudioFrame
> }
>
> dictionary AudioEncodeParameters {
>   unsigned long frameLength;  // aka ptime, in ms
>   unsigned long bitrate;
>   // ...
> }
>
> dictionary EncodedAudioFrame {
>   // Start timestamp in the samplerate clock
>   unsigned long startSampleIndex;
>   unsigned int sampleCount;
>   unsigned int channelCount;
>   CodecType codecType;
>   ByteArray encodedData;
> }
>
> interface AudioDecoder {
>   void decodeAudio(EncodedAudioFrame frame);
>   readonly attribute MediaStreamTrack decodedAudio;
> }
>
> interface VideoEncoder {
>   // Can be called any time to change parameters
>   void start(MediaStreamTrack rawVideo, VideoEncodeParameters 
> encodeParameters);
>   void stop();
>   attribute eventhandler onencodedvideo;  // of EncodedVideoFrame
> }
>
> dictionary VideoEncodeParameters {
>   unsigned long bitrate;
>   boolean generateKeyFrame;
>   // TODO: SVS/simulcast, resolutionScale, framerateScale, ...
>   // ...
> }
>
> dictionary EncodedVideoFrame {
>   unsigned short width;
>   unsigned short height;
>   unsigned short rotationDegrees;
>   unsigned long timestampMs;
>   CodecType codecType;
>   ByteArray encodedData;
> }
>
> interface VideoDecoder {
>   void decodeAudio(EncodedVideoFrame frame);
>   readonly attribute MediaStreamTrack decodedVideo;
> }
>
>
> If you're paying attention, you may be wondering the following:
>
> 1.  Where is the jitter buffer?  Answer: it's in the decoder.  The 
> decoder can take out-of-order encoded frames and produce an in-order 
> track.  This is much more simple than exposing separate jitter buffer 
> and decoder objects.
>
> 2.  What about SVC/simulcast.  There are few ways we could go about 
> it, depending on what we want "encoder" and "encoded frame" to mean (1 
> layer or many?).  I'm sure we'll cover that in the f2f.
>
>
>
>
Received on Wednesday, 13 June 2018 09:09:32 UTC