Some thoughts on simulcast/layered coding support in ORCA API

Erik asked me for my thoughts on how scalable video coding (and simulcast) should be handled in the ORTC API, so I thought I would post some thoughts.  

Before reading this message, I would recommend that people look at the following draft as a point of departure, since it discusses requirements and use cases: 
http://tools.ietf.org/html/draft-garcia-simulcast-and-layered-video-webrtc

While I'm largely in agreement with the document says about temporal scaling, there are probably some disagreements relating to spatial scaling and/or spatial simulcast. 

In general, I agree with the small set of requirements relating to temporal scaling (and think they can be reduced even further).  

   o  REQ-8.  It must be possible to configure the number of temporal
      layers (1 to 4).  This should be the only mandatory parameter when
      enabling temporal scalability.

[BA] IMHO, it is useful to be able to retrieve from the browser the maximum number of temporal layers 
supported for send/receive, so as to be able to signal this to the peer if necessary.   I also think that
an application should be able to set the maximum  number of layers sent/received.  However, 
the application doesnt' need to control the number of layers sent or received on an ongoing basis, 
since this can be handled by the browser with no API controls.  

In practice the layer add/drops can happen very quickly (several times a second) and will be based on 
congestion state, so application control is not feasible and could even be dangerous. 

   o  REQ-9.  It must be possible to configure the bitrate, frame rate
      decimation factor and membership of frames to layers for each 
      temporal layer of the VP8 stream. 

[BA] I would suggest that only frame rate is fundamental here.  The other parameters are codec specific. 
In particular, it seems useful for an application to be able to retrieve the frame rate configuration that
the browser supports so it could signal this.  However, the application may not necessarily be able to
control the frame rate of each layer.  For example, in a given implementation the application might
discover that the base frame rate is 7.5 frames, and extensions are 7.5 and 15 frames/second. Take
it or leave it!

For temporal scaling, the base layer frame rate is the most important parameter, and logically will determine 
the frame rates of extension layers, which are typically designed to allow multiplicative increases in frame rate.  
So there is not really an infinite degree of flexibility here, and you don't want to give the application so much
rope it can hang itself. 

Allowing each extension layer to have frame rate determined independently could result in configuration 
requests that a given implementation might not actually be able to carry out and that could play havoc with 
congestion control.  

The question which I suspect will kick off more debate is how to handle spatial simulcast and/or layering. 

A major use case is described in Section 2.4 "increasing video quality", where the application will want to switch
from a thumbnail to a larger resolution potentially because the active speaker changed or some other reason.

I agree that this is a real scenario but also would caution that there are lots of situations where the resolution will
be changed by the browser without application control.  In reality, supported spatial resolutions are typically not 
infinitely variable.  It makes no sense for an application to change the aspect ratio frequently, for example -- that is 
disconcerting to the user.   To avoid doing this, a set of resolutions with the same aspect ratio may be supported,
allowing the resolution to change while the window size may not change at all (just the quality).

For example, the active speaker might be 640 by 320 and then because of bandwidth lack of availability, a lower 
resolution simulcast of 320 by 160 might be selected by the MANE.   This wouldn't necessarily imply demotion to
a thumbnail, just a decrease in resolution made necessary by an increase in congestion.  Therefore this could be
a decision made entirely by the sender. 

IMHO, an API should allow the application to retrieve the  maximum number of simulcasts to be sent/received so 
this can be conveyed in signaling.   It should also be able to decide how many streams to send, and how many it
could receive.  However, it should be understood that within those parameters in practice the mixer will make the 
decision about which simulcast stream it sends based on bandwidth availability (which could change very quickly).  
So while the receiver should be able to pause/resume simulcast streams this doesn't necessarily imply an ongoing
burden of receiver control.  

Now for a bit of personal bias as to how the control should be exercised in terms of protocol functionality. 

I prefer RTCP control of simulcast/layered coding (such as via stream pause/resume) to signaling in most cases
since this allows much faster control.   For simulcast, the RTCP pause/resume message can refer to the SSRC 
to be paused/resumed since simulcasts have unique SSRCs.  Within layered coding, this is trickier, since only 
in Multi-SSRC Transport (MST) is there a unique SSRC per layer.  So this is one of several arguments in favor of MST. 

For what it's worth, here is my opinion of the requirements relating to spatial simulcast/layered coding. 
As noted below, I would prefer parsimony in terms of the API functionality.  

   o  REQ-1.  It must be possible to enable and configure the scalable
      video coding before initiating a peer connection.

   o  REQ-2.  It must be possible to enable and configure the scalable
      video coding before answering a peer connection.

   o  REQ-5.  It must be possible to configure the number of simulcasted
      streams.

[BA] I would support being able to retrieve the maximum simulcasts that a browser can send/receive.  Also, support for setting a maximum number
of streams to send/receive. 

   o  REQ-3.  It must be possible to enable/disable and re-configure the
      scalable video coding to update a peer connection.

[BA] As noted earlier, I would support pause/resume functionality, and if you're only sending a base layer or a single stream, then I think we've satisfied this requirement, haven't we?

   o  REQ-6.  It must be possible to configure the minimum and maximum
      bitrate of each simulcasted stream.

[BA] Because the bitrate can vary based on motion, framerate and resolution, it probably isn't a good parameter for use in an API.  So I'd focus on framerate for temporal scaling and 
resolution for spatial simulcast and layering. 

   o  REQ-7.  It must be possible to configure the resolution of each
      simulcasted stream.

[BA]  Some amount of configuration does make sense to me, but it's worthwhile to keep the practical constraints in mind. Typically the simulcast resolutions will be within the same aspect ratio and may be auto selected by the sender.  So maybe allow retrieval of the allowable resolutions for send/receive and then select among them based on the maximum number to be sent/received. 

   Requirements regarding RTP usage:

   o  REQ-10.  Congestion control must be supported for all the
      simulcasted streams between the configured boundaries (min/max
      bitrate).

[BA] I agree that congestion controls should be built into the browser, but I expect its operation to not be under the control of the application. 


   o  REQ-11.  Transmission of simulcasted streams must be signaled and
      negotiated in the SDP and transmitted in RTP sessions, making use
      of existing standard attributes
      [I-D.westerlund-avtcore-multistream-and-simulcast].

[BA] I disagree that simulcast or layering changes need to be signaled.  Several simulcast/layering implementations do not do this.  All that may need to be signaled are is the maximum operating envelope. 

   o  REQ-12.  Any endpoint should be prepared to receive VP8 multi-
      layered encoded video not requiring out of band negotiation in
      SDP.

[BA] Not all browsers will necessarily support simulcast or layered coding so we can't require that all endpoints support this for any given codec.  So being able to retrieve the envelope of support is useful. 

Received on Friday, 1 November 2013 05:46:11 UTC