Re: Mozilla/Cisco API Proposal

On Mon, 11 Jul 2011, Anant Narayanan wrote:
> > 
> > I considered doing that, but it seems like the common case just gets quite
> > a bit more complicated:
> > 
> >     navigator.getUserMedia('audio,video');
> > 
> > ...vs:
> > 
> >     navigator.getUserMedia({"audio":{},"video":{}});
> I was hoping that the call would get audio & video by default. So, with
> callbacks:
> navigator.getMediaStream({}, onsuccess, onerror);

So how do you say you only want audio?

I'm not inherently against an object/dictionary approach here; indeed 
it's been suggested before. I just want to make sure whatever we have is 
intuitive and simple, and I don't really see how to do that in this case 
using an object/dictionary approach.

> > The string is extensible as well. :-)
> True, but not in the same way JS objects are. In particular, we'll have 
> to come up with an entirely new string structure, and I'm hoping to 
> avoid that!

No, I mean, the string syntax that getUserMedia() is defined as taking is 
already defined in an extensible, forward-compatible way. You first split 
on commas to get the individual tracks that the script wants to get, and 
then you split on spaces to get more data. So for example, if the argument 
is the string "audio, video user" this indicates that the script wants 
both audio and a user-facing video stream ("face-side camera").

We can extend this further quite easily, e.g.:

   getUserMedia('video 30fps 640x480'); get a video-only stream at VGA-resolution and 30 FPS.

I'm certainly open to other ways of conveying this, such as the proposed
object/dictionary approach; all I'm saying is we need to make sure that 
whatever solution we pick is a good design. I'm not convinced that it's 
obvious that an object/dictionary approach is inherently better.

> > One of the differences is that your proposal allows the author to set 
> > things like the quality of the audio. It's not clear to me what the 
> > use case is for that. Can you elaborate on that? It seems like you'd 
> > just want the MediaStream to represent the best possible quality and 
> > just let the UA downsample as needed. (When recording it might make 
> > sense to specify a format; I haven't done anything on that front 
> > because I've no idea what formats are going to be common to all 
> > implementations.)
> We should at-least allow those parameters that can be specified across 
> multiple implementations. For example, quality is a number between 0.0 
> and 1.0, and that might mean different things for different codecs, and 
> that's okay.

It's not clear to me that there should even be a codec involved at this 
point. What you get from getUserMedia() need never actually be encoded, in 
particular if you consider the case I mentioned earlier of just taking an 
audio stream and piping it straight to the speaker -- it's quite possible 
that on some platforms, this can be done entirely in hardware without the 
audio ever being encoded at all.

I think there's a place for codec discussions, including quality control, 
etc, but I'm not sure getUserMediaStream() (or whatever we call it) is the 
place to put it. I think it belongs more in whatever interface we use for 
converting a stream to a binary blob or file.

> Using a JS object also means that UA's can simply ignore properties they 
> don't understand.

The string mechanism already ignores the features that the UA doesn't 
understand, so that's not unique to objects.

> The webapp author may choose to specify nothing, in which case we 
> automatically give out the best quality and the best framerate. 
> Resolution is trickier, since we won't know what sinks the mediastream 
> will be attached to, but some sane defaults can be made to work.

Resolution is an example of why it might not make sense to be giving too 
many options at this level. Consider a case where a stream is plugged into 
a local <video> at 100x200 and into a PeerConnection that goes to a remote 
host which then has it plugged into a <video> at 1000x2000. You really 
want the camera to switch to a higher resolution, rather than have it be 
fixed at whatever resolution the author happened to come up with when he 
wrote the code -- especially given the tendency of Web authors to specify 
arbitrary values when prompted for defaults.

> > Interesting. Basically you're saying you would like a way for a peer 
> > to start an SDP offer/answer exchange where one of the streams offered 
> > by the other peer has been zeroed out? (Currently there's no way for a 
> > peer to tell the other peer to stop sending something.)
> Yes.
> > How should this be notified on the other side?
> I believe it is possible in RTP to ask the other side to stop sending 
> something. If not, we could always just send our own UDP message.

I mean, how should it be exposed in the API?

> > Should it be possible for the other side to just restart sending the 
> > stream?
> I don't think so. If a peer explicitly set the readyState of a remote 
> stream to BLOCKED it means they don't want data. The other side could of 
> course, send a completely new stream if it wishes to.

That's what I meant by restarting the sending of the stream.

It's not clear to me what the use case is here. Can you elaborate on why 
the API should support this natively instead of requiring that authors 
implement this themselves using their signalling channel? (The latter is 
pretty trivial to do, so it's not clear to me that this really simplifies 

> > > 	- Inputs (sources) and Outputs (sinks) are implied and thus not
> > > exposed. Assigning a stream to another object (or variable) implies adding
> > > a
> > > sink.
> > 
> > Not sure what you mean here.
> document.getElementById("somevideoelement").stream = myMediaStream;
> sets the video element to be an output for myMediaStream. The 
> MediaStream does not have an interface to find out all its inputs and 
> outputs. I don't think this part differs much from your proposal, I 
> mentioned it because it came up earlier :)

Ah, ok.

I think we're better off using URL.getObjectURL() for this kind of thing 
rather than having an explicit .stream attribute, since the latter would 
really mess with the <video> resource selection algorithm.

> > > 	- We added a BLOCKED state in addition to LIVE and ENDED, to allow a
> > > peer to say "I do not want data from this stream right now, but I may
> > > later" -
> > > eg. call hold.
> > 
> > How is this distinguished from the stop() scenario at the SDP level?
> stop() at SDP is initiated when a stream is ENDED, as mentioned before we'll
> have to come up with a new mechanism (or use an existing RTP mechanism) to
> implement BLOCKED.

Ah, ok. For interop with existing ICE stacks (e.g. SIP phones) it seems it 
would be massively beneficial if we could stick to the currently specified 
and implemented set of ICE features, at least for our initial work. :-)

> > The reason I didn't expose a way for a peer to tell another peer to 
> > stop sending media (temporarily or permanently) is that I figured 
> > authors would just implement that using their own signalling channel. 
> > Instead of A send media to B, then B use ICE/SDP to stop the media, 
> > then B use ICE/SDP to resume the media, you would have A send media to 
> > B, then B tell A via the author's signalling channel to stop sending 
> > the media, then A would stop sending the media to B, and later A would 
> > resume in a similar manner.
> That's certainly another way to do it. If B wants to temporarily stop a 
> stream from A, it could tell A out of band and A could set it's local 
> stream to state BLOCKED. Either case, we'd have to implement the BLOCKED 
> state to support this.

Well the API in the WHATWG spec already supports the equivalent of 
"BLOCKED", you just disable the tracks. No need for a separate state at 
the stream level.

> > I mostly didn't do that because "MediaStreamTrackList" is a long 
> > interface name. It also allows us to reuse StreamTrack for Stream 
> > later, if roc gets his way. :-)
> Why a separate List object and not simply an array of Tracks?

StreamTrackList is a typedef for an array of SteamTracks.

> > > 	- Added a MediaStreamStackHints object to allow JS content to 
> > > specify details about the media it wants to transport, this is to 
> > > allow the platform (user-agent) to select an appropriate codec.
> > > 	- Sometimes, the ideal codec cannot be found until after the RTP 
> > > connection is established, so we added an onTypeChanged event. A new 
> > > MediaStreamTrack will be created with the new codec (if it was 
> > > changed).
> > 
> > It's not clear to me why the author would ever care about this. What 
> > are the use cases here?
> The author *needn't* care about it (simply don't provide the hints) but 
> can if they want to. Sometimes you're transmitting fast moving images, 
> other times you're transmitting a slideshow (where you want each slide 
> to be of high quality, but very low frame rate). Only the application 
> can know this, and it'd be good for the platform to optimize for it.

It seems like the user agent is actually in a better position to know 
this than the author. Shouldn't this just be automatic? At least in the 
initial implementation. It would suck if we boxed ourselves into an API 
where authors are required to guide implementations through things like 
this, given how rarely authors get this kind of thing right...

> > > 	- StreamTrack.kind was renamed to MediaStreamTrack.type and takes 
> > > a IANA media string to allow for more flexibility and to allow 
> > > specifying a codec.
> > 
> > This makes StreamTrack inconsistent with VideoTrack, AudioTrack, and 
> > TextTrack, which I think we should avoid.
> We were proposing that the types be from the list here:
> It certainly includes types for text (subtitles), audio as well as video.

I mean the name "type" would be inconsistent. The other interfaces use 
"kind" for this concept.

Also, "type" is generally used for a full MIME type, not just a 

> Tracks are just data of a certain type, so we don't have separate 
> objects for each kind.

We do, for <video>.

> That being said, if there's already a spec that we should inter-operate 
> with; that's reasonable. Where can I find more info on VideoTrack, 
> AudioTrack and TextTrack? Have these been implemented by any UA's?

They're specified in the same spec as StreamTrack:

TextTrack probably has implementations by now, the other two are newer 
(they were designed with StreamTrack).

> > > 4. PeerConnection:
> > > 	- Renamed signalingCallback ->  sendSignal, signalingMessage ->
> > > receivedSignal
> > 
> > A "signal" is generally a Unix thing, quite different from the signalling
> > channel. ICE calls this the "signalling channel", which is why I think we
> > should use this term.
> Fair enough. How about sendSignalingMessage() and 
> receivedSignalingMessage()? Perhaps too long :-)

Well sendSignalingMessage's name doesn't matter, since it's a callback.

receivedSignalingMessage() is what I originally wanted to call 
signalingMessage(), but it seemed overly long, yeah. I don't feel strongly 
on this so if people would rather have it called  
receivedSignalingMessage() that's fine by me.

> > I'm not sure the "addLocalStream()" change is better either; after 
> > all, it's quite possible to add a remote stream, e.g. to send the 
> > stream onto yet another user. Maybe addSendingStream() or just 
> > sendStream() and stopSendingStream() would be clearer?
> Ah, adding a remote stream to pass it onto another peer, I had not 
> considered. Mainly the renaming was done to clarify that when you add a 
> local stream with addStream, the streamAdded callback would not be 
> invoked (since that happens only when remote streams are added).

I wouldn't worry too much about that. Authors will figure it out quickly 
enough based on the examples they're copying when learning the API. In 
particular, when it comes to a decision between a long verbose but clear 
name and a shorter mnemonic name that isn't quite as clear, I think we 
should lean towards the latter: while authors will spend a short while 
learning the APIs (being helped by the longer names), they're going to 
spend far longer using them after learning them (being helped by the 
shorter names).

> > > 	- We added a LISTENING state (this means the PeerConnection can 
> > > call accept() to open an incoming connection, the state is entered 
> > > into by calling listen()), and added an open() call to allow a peer 
> > > to explicitly "dial out".
> > 
> > Generally speaking this is an antipattern, IMHO. We learnt with 
> > XMLHttpRequest that this kind of design leads to a rather confusing 
> > situation where you have to support many more state transitions, and 
> > it leads to very confusing bugs. This is why I designed the 
> > PeerConnection() object to have a constructor and to automatically 
> > determine if it was sending or receiving (and gracefully handle the 
> > situation where both happen at once). It makes the authoring 
> > experience much easier.
> I'm all for simplifying the API as much as possible, if there's a way 
> for us to fulfill all the use cases we have in mind. Without an explicit 
> LISTENING state, how would you handle receiving a call on another 
> browser tab, while you are currently in a call?

Whether you're on a call or not doesn't affect new calls. It's a new ICE 
transaction each time, and ICE transactions don't interfere wich each 

The way you start a call with the PeerConnection API is that you create a 
PeerConnection object, and then don't call signalingMessage(). The API 
will then call the callback with the initial SDP offer, to send via the 
signaling channel to the other peer.

The way you listen for a call with the PeerConnection API is that when you 
receive that initial SDP offer from the initiating peer, you create a 
PeerConnection object and immediately call signalingMessage() on it with 
the SDP offer.

If these get out of sync, the normal ICE mechanism resolves any conflicts. 
(For example, both sides can start the connection; the ICE agent will then 
figure out what's going on. This is all defined by ICE already.)

There's thus no need for an explicit open() or listen().

> I would certainly like for the user to be able to put this one on hold 
> and switch to the other one.

That should work fine.

> In general, I think the common theme for your comments is to make things 
> as easy for the web developer as possible. I agree, in general, but for 
> an API at this level we should go for maximum flexibility that gives as 
> much power to the web application as we possibly can.

I strongly disagree. It's all the more important for an API such as this 
one to make things as simple as possible for authors, at least in the 
first version. We can't know what power authors want until we give them 
something. There is a huge risk, especially with an API for something as 
complicated as this, in overdelivering in the first iteration: we might 
end up severely constrained due to initial decisions.

> Programming for it may not be a cake-walk, but that's OK (Network programming
> *is* hard!).

It doesn't have to be. The hard stuff here should all be done by the 
browser. There's no reason we need to make this hard.

I would say that we should measure the success of this effort by how easy 
it is for Web authors to add video conferencing features to their sites. 
If we make it something only experts can do, we will have failed.

Programming this _should_ be a cake walk. It's not ok for it not to be.

> However, consider that today, practically nobody uses the DOM API 
> directly, everyone is building webapps with jQuery or some other fancy 
> JS toolkit.

That's a failure of the DOM API, one that we are trying to resolve over 
time. Let's not repeat that mistake.

> CSS is hard to understand and write, that's why we have things like

Lots of people write CSS directly. We should make CSS easier so that 
things like lesscss are not necessary.

> Let's face it, we're not going to get rid of these cross-browser JS 
> libraries because there's always bound to be (minor, at the least) 
> differences in implementation of any spec.

On the contrary. The whole point of efforts such as the WHATWG 
specifications, the new DOM Core specification, the CSS 2.1 specification, 
etc, is that we can, should, must write specifications that are precise 
enough that they can and will be interoperably implemented.

> But, in a couple years if we discover that we can't write this totally 
> awesome MMORPG-in-the-browser that allows players to talk to each other 
> while playing because of API limitations, well that would be not so 
> awesome :-)

We can always extend the API later. Indeed, I fully expect that we will 
extend this API for years. That's the best way to design APIs: start 
simple, ensure browsers implement it interoperably, and iterate quickly.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 12 July 2011 03:22:47 UTC