Re: Some thoughts on simulcast/layered coding support in ORCA API from Martin Thomson on 2013-11-01 (public-orca@w3.org from November 2013)

From: Martin Thomson <martin.thomson@gmail.com>
Date: Fri, 1 Nov 2013 09:42:54 -0700
To: Bernard Aboba <Bernard.Aboba@microsoft.com>
Cc: "public-orca@w3c.org" <public-orca@w3c.org>
Message-ID: <CABkgnnUe5cg9Q+WX6Lp1iLa8Bp2YFW6Zz1vauYAtsGCporYnSQ@mail.gmail.com>
This is a very useful analysis.  I think that the challenge will be to
distill this down into something that is both usable and flexible
enough to meet the use cases.

Here are my initial thoughts:

Rather than provide a single configuration for each codec, the browser
offers a set of configurations that it supports.  This probably
includes a simple unlayered variant along with any number of layered
configuration.  As Bernard observes, corroborated by the UCIF layering
profiles document, that this is likely to be a fairly narrowly bounded
set.

Each layering configuration starts with a base layer, with dependent
layers being included in a child element.  Each dependent layer
contains a description of what it adds: 5 extra frames per second, an
increase in size to 720p, or extra quality-goodness.  This can
continue recursively.

The advantage of this is that - at the sender - applications can
choose a layered configuration and remove layers from the description,
also removing the layer from the encoding.  Thus, a browser can offer
a layered configuration and the application can choose a subset of the
layers, or even just the base layer.

As Bernard observes, once a layering configuration is chosen, the
browser may need to drop layers temporarily in response to changing
bandwidth availability.  This is one area where layered coding really
shines.

Of course, each layer description will have to include some way of
identifying the layer, which might depend on several factors.  I'm
going to assume that we are using SSRC in this case.  Worst case, the
receiver has to perform some decoding to determine which layer is
which.

e.g.,

{
  type: "video/H264", framerate: 15, clockRate: 90000,
  fmtp: { ... },
  width: 360, height: 640, bitrate: 256000,
  dependentLayers: [
    { scalability: "temporal", framerate: 15, dependentLayers: [] },
    { scalability: "spatial", width: 720, height: 1280, ,
dependentLayers: [ { scalability: "quality" } ] },
  ]
}

Of course, there's a limit to what a browser can act upon.

Simulcast is an easier problem, just send multiple scaled copies of
the same track.

The real trick is determining what the browser is able to use.
Browsers don't have unlimited encoding capacity, nor is there infinite
bandwidth.

Just as a browser is able to advertise multiple supported profiles, it
can also be given multiple options to choose from.  This is ideal if
there aren't external constraints on usage because it allows the
browser to choose a codec profile that suits the current network and
encoder conditions. Given a choice of multiple codec profiles, I
expect the browser to select the best (or maybe first is better
because it's more deterministic) that it can send within what it
understands to be the current constraints.

On 31 October 2013 22:45, Bernard Aboba <Bernard.Aboba@microsoft.com> wrote:
> Erik asked me for my thoughts on how scalable video coding (and simulcast) should be handled in the ORTC API, so I thought I would post some thoughts.
>
> Before reading this message, I would recommend that people look at the following draft as a point of departure, since it discusses requirements and use cases:
> http://tools.ietf.org/html/draft-garcia-simulcast-and-layered-video-webrtc
>
> While I'm largely in agreement with the document says about temporal scaling, there are probably some disagreements relating to spatial scaling and/or spatial simulcast.
>
> In general, I agree with the small set of requirements relating to temporal scaling (and think they can be reduced even further).
>
>    o  REQ-8.  It must be possible to configure the number of temporal
>       layers (1 to 4).  This should be the only mandatory parameter when
>       enabling temporal scalability.
>
> [BA] IMHO, it is useful to be able to retrieve from the browser the maximum number of temporal layers
> supported for send/receive, so as to be able to signal this to the peer if necessary.   I also think that
> an application should be able to set the maximum  number of layers sent/received.  However,
> the application doesnt' need to control the number of layers sent or received on an ongoing basis,
> since this can be handled by the browser with no API controls.
>
> In practice the layer add/drops can happen very quickly (several times a second) and will be based on
> congestion state, so application control is not feasible and could even be dangerous.
>
>    o  REQ-9.  It must be possible to configure the bitrate, frame rate
>       decimation factor and membership of frames to layers for each
>       temporal layer of the VP8 stream.
>
> [BA] I would suggest that only frame rate is fundamental here.  The other parameters are codec specific.
> In particular, it seems useful for an application to be able to retrieve the frame rate configuration that
> the browser supports so it could signal this.  However, the application may not necessarily be able to
> control the frame rate of each layer.  For example, in a given implementation the application might
> discover that the base frame rate is 7.5 frames, and extensions are 7.5 and 15 frames/second. Take
> it or leave it!
>
> For temporal scaling, the base layer frame rate is the most important parameter, and logically will determine
> the frame rates of extension layers, which are typically designed to allow multiplicative increases in frame rate.
> So there is not really an infinite degree of flexibility here, and you don't want to give the application so much
> rope it can hang itself.
>
> Allowing each extension layer to have frame rate determined independently could result in configuration
> requests that a given implementation might not actually be able to carry out and that could play havoc with
> congestion control.
>
> The question which I suspect will kick off more debate is how to handle spatial simulcast and/or layering.
>
> A major use case is described in Section 2.4 "increasing video quality", where the application will want to switch
> from a thumbnail to a larger resolution potentially because the active speaker changed or some other reason.
>
> I agree that this is a real scenario but also would caution that there are lots of situations where the resolution will
> be changed by the browser without application control.  In reality, supported spatial resolutions are typically not
> infinitely variable.  It makes no sense for an application to change the aspect ratio frequently, for example -- that is
> disconcerting to the user.   To avoid doing this, a set of resolutions with the same aspect ratio may be supported,
> allowing the resolution to change while the window size may not change at all (just the quality).
>
> For example, the active speaker might be 640 by 320 and then because of bandwidth lack of availability, a lower
> resolution simulcast of 320 by 160 might be selected by the MANE.   This wouldn't necessarily imply demotion to
> a thumbnail, just a decrease in resolution made necessary by an increase in congestion.  Therefore this could be
> a decision made entirely by the sender.
>
> IMHO, an API should allow the application to retrieve the  maximum number of simulcasts to be sent/received so
> this can be conveyed in signaling.   It should also be able to decide how many streams to send, and how many it
> could receive.  However, it should be understood that within those parameters in practice the mixer will make the
> decision about which simulcast stream it sends based on bandwidth availability (which could change very quickly).
> So while the receiver should be able to pause/resume simulcast streams this doesn't necessarily imply an ongoing
> burden of receiver control.
>
> Now for a bit of personal bias as to how the control should be exercised in terms of protocol functionality.
>
> I prefer RTCP control of simulcast/layered coding (such as via stream pause/resume) to signaling in most cases
> since this allows much faster control.   For simulcast, the RTCP pause/resume message can refer to the SSRC
> to be paused/resumed since simulcasts have unique SSRCs.  Within layered coding, this is trickier, since only
> in Multi-SSRC Transport (MST) is there a unique SSRC per layer.  So this is one of several arguments in favor of MST.
>
> For what it's worth, here is my opinion of the requirements relating to spatial simulcast/layered coding.
> As noted below, I would prefer parsimony in terms of the API functionality.
>
>    o  REQ-1.  It must be possible to enable and configure the scalable
>       video coding before initiating a peer connection.
>
>    o  REQ-2.  It must be possible to enable and configure the scalable
>       video coding before answering a peer connection.
>
>    o  REQ-5.  It must be possible to configure the number of simulcasted
>       streams.
>
> [BA] I would support being able to retrieve the maximum simulcasts that a browser can send/receive.  Also, support for setting a maximum number
> of streams to send/receive.
>
>    o  REQ-3.  It must be possible to enable/disable and re-configure the
>       scalable video coding to update a peer connection.
>
> [BA] As noted earlier, I would support pause/resume functionality, and if you're only sending a base layer or a single stream, then I think we've satisfied this requirement, haven't we?
>
>    o  REQ-6.  It must be possible to configure the minimum and maximum
>       bitrate of each simulcasted stream.
>
> [BA] Because the bitrate can vary based on motion, framerate and resolution, it probably isn't a good parameter for use in an API.  So I'd focus on framerate for temporal scaling and
> resolution for spatial simulcast and layering.
>
>    o  REQ-7.  It must be possible to configure the resolution of each
>       simulcasted stream.
>
> [BA]  Some amount of configuration does make sense to me, but it's worthwhile to keep the practical constraints in mind. Typically the simulcast resolutions will be within the same aspect ratio and may be auto selected by the sender.  So maybe allow retrieval of the allowable resolutions for send/receive and then select among them based on the maximum number to be sent/received.
>
>    Requirements regarding RTP usage:
>
>    o  REQ-10.  Congestion control must be supported for all the
>       simulcasted streams between the configured boundaries (min/max
>       bitrate).
>
> [BA] I agree that congestion controls should be built into the browser, but I expect its operation to not be under the control of the application.
>
>
>    o  REQ-11.  Transmission of simulcasted streams must be signaled and
>       negotiated in the SDP and transmitted in RTP sessions, making use
>       of existing standard attributes
>       [I-D.westerlund-avtcore-multistream-and-simulcast].
>
> [BA] I disagree that simulcast or layering changes need to be signaled.  Several simulcast/layering implementations do not do this.  All that may need to be signaled are is the maximum operating envelope.
>
>    o  REQ-12.  Any endpoint should be prepared to receive VP8 multi-
>       layered encoded video not requiring out of band negotiation in
>       SDP.
>
> [BA] Not all browsers will necessarily support simulcast or layered coding so we can't require that all endpoints support this for any given codec.  So being able to retrieve the envelope of support is useful.
Received on Friday, 1 November 2013 16:43:21 UTC