Re: CHANGE: Use a JS Object as an argument to getUserMedia from Timothy B. Terriberry on 2011-10-05 (public-webrtc@w3.org from October 2011)

From: Timothy B. Terriberry <tterriberry@mozilla.com>
Date: Wed, 05 Oct 2011 11:59:43 -0700
CC: "public-webrtc@w3.org" <public-webrtc@w3.org>
Message-ID: <4E8CA91F.70108@mozilla.com>
Adam Bergkvist wrote:
> empty (no tracks), and tracks would have to be added later. I think it
> would simplify things (e.g. MediaStream playback and sending with
> PeerConnection) if a MediaStream is immutable with regards to its track
> list.

I'm not sure this is really a problem: the request indicates whether you 
asked for audio and/or video, and tracks can be pre-created that simply 
don't reach the appropriate ready state until the user gives consent (if 
ever). You can still argue about whether you want the user to be able to 
consent to "just audio" or "just video" when you asked for both, and 
what should be done in that case. I'll let Anant tackle that issue.

The issue of tracklist mutability, however, is one I've brought up 
before, and was discussed a little bit on the W3C call today, without 
reaching any conclusions. Let me try to summarize things so we can move 
towards a resolution.


In attempting to define exactly how a MediaStream and a MediaStreamTrack 
relate to the underlying RTP concepts, it has been proposed that each 
MediaStreamTrack corresponds to a single SSRC. The SSRC namespace only 
guarantees uniqueness within an RTP session, but for the sake of 
argument I'm going to assume any use of the same SSRC in different 
sessions is intentional, for things like FEC or layered codecs, which I 
expect would still map to a single track. It has also been proposed that 
all the MediaStreamTracks correspond to the same CNAME, but not 
necessarily that all MediaStreamTracks with the same CNAME belong to the 
same MediaStream.

For the purposes of this discussion, when I say "synchronization", I 
mean the actual presentation of timestamped audio and video at the 
proper times. I am assuming that things like clock drift, time 
stretching, and shrinking (i.e., the jitter buffer part) is handled 
internally by the browser, which can see the CNAME for all tracks.


So, I'll rephrase my original question (from 9/22), which I don't think 
was ever answered, in slightly more concrete terms: What happens when a 
remote participant, currently sending only audio, adds a video track 
with the same CNAME?


I see a few possibilities:

1) Add it as a new MediaStreamTrack to the existing MediaStream 
containing the audio.

We don't have any API for callbacks to indicate this has happened. As 
Adam pointed out on the call, this also complicates things if, for 
example, that MediaStream is being fed into another PeerConnection 
(where the far end may not support that media type), or even another 
local consumer (e.g., MediaStream.record(): the container in use may not 
allow a stream of a new type to be added partway through the recording).


2) Add it as a new MediaStream containing just the new MediaStreamTrack 
corresponding to the video.

This leaves the receiving side with two MediaStream objects containing 
different tracks with the same CNAME, which must be synchronized 
manually, e.g., by feeding them, as blocking inputs, into a single 
ProcessedMediaStream from roc's MediaStream Processing API, or, 
depending on how you want to define the semantics, possibly just 
creating a new MediaStream containing both tracks. We haven't really 
talked about how mixing tracks from different MediaStreams affects 
synchronization, but I strongly recommend looking at the MediaStream 
Processing API, and its attempts to prevent the same media source from 
playing out at two different rates. In either case, this means that your 
local processing graph is now different depending on how you set up the 
call. There is also currently no API that indicates that these two 
MediaStreams share the same CNAME, so you don't have any way of knowing 
you need to do this.


3) Remove the old MediaStream and add a new MediaStream containing both 
the old MediaStreamTrack corresponding to the audio and the new 
MediaStreamTrack corresponding to the video.

When you do this, you can either
a) literally use the same MediaStreamTrack object used in the old 
MediaStream, or
b) create a new MediaStreamTrack object for the old audio track in the 
new MediaStream.

I think 3a would mean, for example, that if you ignored the callback and 
continued to use the old MediaStream object, then the audio would 
continue playing through it. That leaves you with an object that _acts_ 
is if it was one of the currently active remote streams, but is not 
actually in the PeerConnection's list of remote streams. It also means 
it may not be synchronized with the new track, unless you do something 
to enforce that synchronization (e.g., switch to using the new 
MediaStream object).

3b, on the other hand, leaves you with the problem of synchronizing the 
transition from the old track to the new track. Unless you can respond 
to the callback and reshuffle your media graph _immediately_ (the next 
stable state may be too late), you may introduce gaps after the media 
stops flowing from the old track and starts flowing from the new track. 
Unless you (and the browser implementation) are very careful, you may 
also lose any internal buffered state (e.g., packets that were received 
and decoded, but only partially played out).

Keep in mind that it's sometimes necessary for the browser to rewind and 
re-process an internal buffer (e.g., to reduce the latency of volume 
changes taking effect, or any other effects processing you can imagine). 
That doesn't make these hand-off issues any easier.


This synchronization/gap problem applies at the application layer to 
both options 2 and 3 equally. I.e., if you're doing any non-trivial 
processing (in a ProcessedMediaStream or otherwise), you'll have to be 
very careful not to introduce these problems when swapping in a new 
MediaStream object, either constructed by the user to enforce 
synchronization in 2 or constructed by the API to enforce same-CNAME 
semantics in 3. In 3b you'll have them at the browser layer as well.

They're compounded in both 3a and 3b by the fact that the remove 
callback is separate from the add callback. If you don't know an add 
callback is coming, you may continue processing things right after the 
remove, introducing these gaps. 3a may be slightly better in this 
regards, as the media will keep playing if you can somehow divine that 
you should ignore the remove callback, but is still not without issues.


Option 3 also doesn't side-step the "no API to indicate CNAME" problem 
entirely, as we may still run into that issue if audio and video have to 
be part of separate RTP sessions.


So, that's as far as I've thought through these things right now. What 
do others think?
Received on Wednesday, 5 October 2011 19:00:09 UTC