Re: A proposal for how we would use the SDP that comes out of the MMUSIC interm

On 10/9/2015 1:56 PM, Peter Thatcher wrote:
>
> On Fri, Oct 9, 2015 at 9:14 AM, Byron Campen <docfaraday@gmail.com 
> <mailto:docfaraday@gmail.com>> wrote:
>
>
>>
>>     dictionary RTCRtpEncodingParameters {
>>       double scale;  // Resolution scale
>>     unsigned long rsid;  // RTP Source Stream ID
>>       // ... the rest as-is
>>     }
>        I am skeptical that a resolution scale is the right tool for
>     "full/postage stamp", which is the primary use case for simulcast.
>     ​ ​
>     A conferencing service is probably going to want to define the
>     postage stamp as a fixed resolution (and probably framerate), not
>     a scale of the full resolution that can slide around.
>
>
> Ultimately, it's the client-side Javascript in control of what gets 
> sent to the server.  The big question has always been: does the JS 
> specify a fixed resolution​
> ​ (or height or width) or a relative one?  All of the discussions 
> we've had in the past in the WebRTC working group about this have 
> always ended up in favor of relative, and not fixed one.   If the JS 
> wants to send a specific resolution, it can control that on the track, 
> not via the RtpEncodingParameters or the SDP.
>
> As for what conferencing services want, the one I'm very familiar with 
> wants a resolution scale.  So at least the desire for a fixed 
> resolution isn't universal.​
> ​
> ​  And, as I already mentioned, services that do want a fixed 
> resolution can send a fixed resolution from the JS via track controls.​

My position is that the JS (which comes from the conferencing 
server/provider!) should be in control of the resolutions, in the end.  
If a camera provides CIF, do you really want to end up scaling up the 
"thumbnail" version from 88x72? (assuming the layout you mentioned, 
full:/2:/4 from CIF source).  And if the source is running HD, your 
thumbnail would be transmitted at 1920x1080, 960x540, and 480x270 (which 
is likely a fair bit larger than "thumbnail" in most conf services).  
You can work around the problem by monitoring the source (modulo that it 
can change resolutions depending on various factors, especially things 
like a window capture), but that adds more than a bit of complexity 
(pre-roll the source to get the resolution and/or have immediate 
renegotiation, and add a continual monitor in a hidden <video> element 
to pick up on resolution changes, more limits in getUserMedia 
constraints with additional fallback paths, etc).

There are several ways to lay out conferencing services; one is bunch of 
fixed sizes, another is adaptive to windowsize and participants (and 
there are more!)  Services that don't have a "gallery of thumbnails" may 
have a fixed area to display people (active talker plus last N talkers 
perhaps) in.  There is an advantage to shipping a constant resolution to 
each - you can avoid client-side scale ups/downs (and wasted bits on the 
scale down or ugly images on scale up).  (Though of the two, scale up, 
if moderate, isn't so bad).  The takeaway here is: don't assume your 
conferencing service is "the" way the feature will be used.

As for track scaling to allow for fixed sizes: That does give the JS 
control.  If we can clone the tracks, and apply scales to each at the 
track level, then attach them for the separate layers instead of using 
"resolutionScale": great.  However, will codecs be able to take 
advantage of encoding N encodings from one source with track cloning and 
scaling?  Likely No.  You need N senders fed from one track (and the 
ability to specify that or otherwise determine they have the same 
input).  So an API based on cloning plus track scaling will preempt the 
ability to use a multi-encoding codec.  (And also lock in a bunch of 
extra CPU/power use, etc)

> But even if we did say "RTCRtpEncodingParameters should have a 
> .maxWidth and a .maxHeight", which I doubt we will, that's somewhat 
> orthogonal to this proposal.

Ok by me - just let's not optimize for a particular application instance.

>>     And here's the *subset* of the SDP from MMUSIC we could use in
>>     the offer (obviously subject to change based on the results of
>>     the interim):
>>
>>     m=video ...
>>     ...
>>     a=rsid send 1
>>     a=rsid send 2
>>     a=rsid send 3
>>     a=simulcast rsids=1,2,3
>        The semantics of this are pretty unclear; what does each of
>     these rids mean? You can say that it is "application dependent" I
>     suppose, but the implementers of conferencing servers are going to
>     want something a little more concrete than that.
>
>
> If the JS wants to send more information about what semantics it is 
> giving to each encoding/rsid, it is more than capable of doing so in 
> its signalling to the server.  We don't need to put all signalling 
> into SDP.
> We may choose, for convenience of the JS, to put a minor amount of 
> signalling in the JS, like we put the track ID into the SDP.  If so, 
> what you're really advocating for is an RSID that's a string instead 
> of an int:

A side-channel in the non-SDP signaling (or elsewhere) certainly is fine 
for anything the JS wants both sides to know.  The question is whether 
the SDP as defined is useful in *any* context outside of WebRTC as-is.  
Generally, it should be.  So we need to care about a) does the draft in 
question work as a definition separate from WebRTC?  (The answer doesn't 
*have* to be yes, but likely should be yes - see comment 22).  And "what 
features from this draft will WebRTC endpoints use"?  And "what happens 
when a WebRTC endpoint using this draft talks to a non-WebRTC endpoint - 
how hard is the conversion/translation/etc"? For example, perhaps, Vidyo 
or a Vidyo-like service.

So this may or may not meet the bar set be my questions just above. If 
it doesn't, it shouldn't need much to meet it I think.

-- 
Randell Jesup -- rjesup a t mozilla d o t com
Please please please don't email randell-ietf@jesup.org!  Way too much spam

Received on Sunday, 11 October 2015 09:23:25 UTC