RE: Mozilla/Cisco API Proposal from Ian Hickson on 2011-07-14 (public-webrtc@w3.org from July 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 14 Jul 2011 01:33:34 +0000 (UTC)
To: "public-webrtc@w3.org" <public-webrtc@w3.org>
Message-ID: <Pine.LNX.4.64.1107132240580.2079@ps20323.dreamhostps.com>
On Mon, 11 Jul 2011, Timothy B. Terriberry wrote:
> > > Perhaps one example is the sort of thing described by 
> > > MediaStreamTrackHints in the proposal. The Opus audio codec from the 
> > > IETF standardization effort can switch between separate "voip" and 
> > > "audio" coding modes. The script setting up the connection may have 
> > > context information about which of these are more appropriate for 
> > > its [...]
> > 
> > This is why the Web uses a declarative high-level model. It lets us 
> > make the Web better in the future.
> 
> I would argue the "voip" versus "audio" modes _are_ high-level 
> declarations.

Agreed. I was referring more to proposals which gave bitrates, frequency 
responses, resolutions, framerates, specific codecs, etc.

Specifically which modes we offer, and in what contexts we offer them, is 
a different matter. Currently the only use case I've really seen is video 
conferencing, for which we probably don't need to give any modes.


On Tue, 12 Jul 2011, Cullen Jennings wrote:
> 
> Just a side note, the best way to do DTMF (RFC 4733) is not as an audio 
> tone but as a seperate coded for transering the data that keys werhe 
> pressed.

I'm all for supporting DTMF. It seems that the main use case would be 
communicating with legacy systems (on the Web, one would presumably use a 
native AJAX app or some such and only use audio/video conferencing to 
speak to an actual human), so it seems one of our requirements here has to 
be interoperability with the POTS, and thus with legacy SIP systems.

How do legacy SIP systems support DTMF? We presumably have to do it the 
same way.


> 1) How you want to hand music vs spoken voice is totally different. For 
> spoken voice, you want to filter out background noise, fan hum, etc 
> while for music you typically want to capture all the nuances.

Agreed. Do we have any use cases that involve music?


> 2) Battery life. My iphone might be capable of doing awesome H.265 HD 
> video but with a battery life of 45 minutes due to no hardware 
> acceleration. However, with h.264 in the right mode it might get 10x 
> that time. And if I was in an application where HD provided no value, I 
> might want to use smaller resolution. So the browsers might always want 
> to provide the "best" it can but "best" gets more complicated when 
> trading off battery life. I used 264/265 in this example but you will 
> have the same problem with VP8/VP9. Of course the default simple 
> examples should make it so a user does not have to think about this. And 
> the API should be designed such that people don't override with lame 
> defaults. But I think advanced applications need to have some influence 
> over the definition of "best".

Agreed, but that's the kind of thing we find out after deploying the first 
revision.

Whatever we do here is going to have to be continually maintained and 
improved for years to come. It's not like we get just one shot at making a 
P2P video conferencing system and once we have it we can never change it. 
On the contrary. It's a long-term investment with continuous improvement.

To allow us to move fast and iterate, we have to start with the bare 
minimum, and then add what people want. Almost by definition, advanced 
applications aren't going to be completely catered for in our first 
attempt. :-)


On Tue, 12 Jul 2011, Cullen Jennings wrote:
> 
> There are ways in SDP that are used by current phones to say, I'm not 
> sending a particular RTP stream but allow it to be restarted again in 
> the future. If we can figure out what sort of API we want at the high 
> level, I think I can show how to map this on to existing SDP.

Do you have any documentation on this? I'd be happy to support that in the 
spec.


On Tue, 12 Jul 2011, Cullen Jennings wrote:
> 
> Ian, I apologize for asking this again because I remember seeing that 
> you had posted an answer to the question I am about to ask somewhere 
> else but I can't find it.. In the general case of using the API outside 
> browsers, or even in browsers, how does one solve the race condition 
> that happens after creation of the object and before installing the 
> onIncomgStream callback and the arrival of the first incoming stream? 
> Are they queued ?

The short answer is yes, they are queued.

The long answer is that Web browsers use an event loop. Scripts execute as 
a task in the event loop. Events are (generally -- there are exceptions) 
fired s tasks in the event loop. So while a script is running, events 
don't fire.

See:
   http://www.whatwg.org/specs/web-apps/current-work/complete/webappapis.html#event-loops


> I also think there are issues around efficiency of signaling. If the JS 
> is going to add several local stream, if you do offer / answer after 
> each stream is added, you end up with a pretty bad delay for overall 
> setup time.

Valid point. I've updated the spec so that it's clear that added and 
removed streams are all processed together each time the UA reaches a 
stable state (to a first approximation, this means when the script ends.)


On Tue, 12 Jul 2011, Cullen Jennings wrote:
> On Jul 11, 2011, at 20:22 , Ian Hickson wrote:
> > 
> > Whether you're on a call or not doesn't affect new calls. It's a new 
> > ICE transaction each time, and ICE transactions don't interfere wich 
> > each other.
> 
> mostly agree though they do interfere with each others pacing

Sure.


> > The way you start a call with the PeerConnection API is that you 
> > create a PeerConnection object, and then don't call 
> > signalingMessage(). The API will then call the callback with the 
> > initial SDP offer, to send via the signaling channel to the other 
> > peer.
> 
> I'm sort of wondering how long you have to not call the signalingMessage 
> before it sends the SDP offer?

The next time the event loop spins (i.e. when the script ends).

Please see the spec for the specific details:

http://www.whatwg.org/specs/web-apps/current-work/complete.html#dom-peerconnection


> > The way you listen for a call with the PeerConnection API is that when 
> > you receive that initial SDP offer from the initiating peer, you 
> > create a PeerConnection object and immediately call signalingMessage() 
> > on it with the SDP offer.
> > 
> > If these get out of sync, the normal ICE mechanism resolves any 
> > conflicts. (For example, both sides can start the connection; the ICE 
> > agent will then figure out what's going on. This is all defined by ICE 
> > already.)
> > 
> > There's thus no need for an explicit open() or listen().
> 
> There's two layers of signaling going on here. The ICE and the SDP. If 
> both sides simultaneously send an offer to the other side, I don't think 
> ICE sorts out what happens next.

As far as I can tell, this is an ICE role conflict, which ICE handles 
fine. If it's not, could you elaborate on what the difference is between 
the case you are concerned about and an ICE role conflict? (Ideally with 
examples, so that I can compare it to what I thought the spec said.)


On Tue, 12 Jul 2011, Anant Narayanan wrote:
> On 7/11/11 8:22 PM, Ian Hickson wrote:
> > On Mon, 11 Jul 2011, Anant Narayanan wrote:
> > > 
> > > navigator.getMediaStream({}, onsuccess, onerror);
> > 
> > So how do you say you only want audio?
> 
> navigator.getMediaStream({"audio":{}}, onsuccess, onerror);
> 
> No properties is different from 1 or more, but I can see your point it 
> does seem a bit clunky.

Yes, having {} mean the same as {'audio':{},'video':{}} but mean something 
different than {'audio':{}} or {'video':{}} seems a bit unintuitive. :-)


> > No, I mean, the string syntax that getUserMedia() is defined as taking 
> > is already defined in an extensible, forward-compatible way. You first 
> > split on commas to get the individual tracks that the script wants to 
> > get, and then you split on spaces to get more data. So for example, if 
> > the argument is the string "audio, video user" this indicates that the 
> > script wants both audio and a user-facing video stream ("face-side 
> > camera").
> 
> On a separate note, we removed the 'user'/'environment' options because 
> we wanted to come up with better nomenclature for those terms. I 
> originally suggested 'front'/'back', but they seem to restrictive too.

I'm certainly open to better terms (front/back don't work; the front 
camera on a DSLR is very different than the front camera on a phone, hence 
the user/environment names). But this is a pretty important feature.


> > We can extend this further quite easily, e.g.:
> > 
> >     getUserMedia('video 30fps 640x480');
> > 
> > ...to get a video-only stream at VGA-resolution and 30 FPS.
> 
> What about the order of arguments separated by spaces?

What about it?


> > > We should at-least allow those parameters that can be specified 
> > > across multiple implementations. For example, quality is a number 
> > > between 0.0 and 1.0, and that might mean different things for 
> > > different codecs, and that's okay.
> > 
> > It's not clear to me that there should even be a codec involved at 
> > this point. What you get from getUserMedia() need never actually be 
> > encoded, in particular if you consider the case I mentioned earlier of 
> > just taking an audio stream and piping it straight to the speaker -- 
> > it's quite possible that on some platforms, this can be done entirely 
> > in hardware without the audio ever being encoded at all.
> 
> In what cases would you not encode data that comes from the user's 
> hardware? Is there a use-case for sending to the speaker what the user 
> just spoke into the microphone?

That's what an amplifier is, no? Audio monitors are common too. Old-style 
telephones had a local audio loop which was lost in the transition to 
mobile phones that people may wish to reimplement, too.

The same situation exists with video. In a video-conference, there is 
usually a local video display. My understanding is that this is sometimes 
implemented as a hardware-level feature (video going straight from the 
camera to the video card). It would be very sad to preclude this kind of 
thing, requiring that the video be compressed and decompressed just to 
show a local loop.


> I can see one use-case for video, doing local face recognition for login 
> to a website, so data is painted on a canvas straight from hardware 
> without any conversion. I agree that codecs need not be involved at all, 
> something like 'quality' is pretty generic enough.

Quality in particular seems like something ripe for authors to 
misunderstand. "Why of course I want the best quality!"; it works great on 
the local test setup, then a user on dial-up tries it and it's a disaster. 
Or alternatively, "Let's put the quality setting way down, because I want 
dial-up users to find this works great"; followed by a user in an intranet 
trying to video conference with someone else in the same building with 
terabit ethernet but not being able to get good quality.

Better to let the system autonegotiate the quality, IMHO. At least at 
first, until we have a better understanding of what Web authors do with 
this stuff.


> > Resolution is an example of why it might not make sense to be giving 
> > too many options at this level. Consider a case where a stream is 
> > plugged into a local<video> at 100x200 and into a PeerConnection that 
> > goes to a remote host which then has it plugged into a<video> at 
> > 1000x2000. You really want the camera to switch to a higher 
> > resolution, rather than have it be fixed at whatever resolution the 
> > author happened to come up with when he wrote the code -- especially 
> > given the tendency of Web authors to specify arbitrary values when 
> > prompted for defaults.
> 
> We're not prompting them to provide values, and the behavior you specify 
> will be the one that authors gets if they don't specify anything on 
> either end.
> 
> Is your fear that if we allow the API to configure things, webapp 
> authors will use them even if they don't need to?

Yes. We've seen this over and over on the Web. The clasic example is 
addEventListener(), which lets you opt for a capture listener or a bubble 
listener -- zillions of people use capture for no particular reason, even 
though the semantics they want are those of a bubble listener.


> > > > How should this be notified on the other side?
> > > 
> > > I believe it is possible in RTP to ask the other side to stop 
> > > sending something. If not, we could always just send our own UDP 
> > > message.
> > 
> > I mean, how should it be exposed in the API?
> 
> Perhaps an event on the MediaStream at the other end, this part is not 
> fleshed out fully yet.

I think on the short term we're probably best just not providing this 
feature, since it doesn't add anything that authors can't do themselves 
already (with the API as it stands today).


> > > > Should it be possible for the other side to just restart sending the
> > > > stream?
> ...
> > > I don't think so. If a peer explicitly set the readyState of a remote
> > > stream to BLOCKED it means they don't want data. The other side could of
> > > course, send a completely new stream if it wishes to.
> ...
> > It's not clear to me what the use case is here. Can you elaborate on why
> > the API should support this natively instead of requiring that authors
> > implement this themselves using their signalling channel? (The latter is
> > pretty trivial to do, so it's not clear to me that this really simplifies
> > anything.)
> 
> Just from an ease-of-programming standpoint, if we can support it we should.

I strongly disagree with this approach. There's lots of stuff we _can_ 
support. If we try to support everything we can support, we'll just have a 
lot of bugs. :-)

Best to just do a few things well, IMHO, and let authors do the rest. 
We'll never be able to guess at everything authors might want to do.


> Using the out of band of signalling channel is certainly not as trivial as
> setting one property on the stream or track object.

It's more than just setting one property -- you still have to detect when 
the property has been set on the other side to update the UI accordingly, 
etc. I'm not at all convinced that it's significantly easier than just 
doing it via the signalling channel.


> I also suspect that it will be a pretty common scenario where the far 
> end wants to temporarily block a particular track or stream. Call hold 
> is the main example I had in mind.

I can't recall the last time I used call hold. I don't think I've _ever_ 
used call hold on the Web. Are we sure it's a critical feature?

Is it not something we can delay until a later version to see if people 
actually want it? (Since it's possible for authors to implement, we'll be 
able to tell how common a desire it is by looking at deploymets.)


> I'm fine with getObjectURL(); since it means we can just use the 
> existing src attribute. Do you intend for the URL that is returned by 
> that function to be UA specific or something that is standardized?

createObjectURL() is part of the File API:

   http://dev.w3.org/2006/webapi/FileAPI/#dfn-createObjectURL

(Sorry, I got the wrong name in the thread earlier.)

How it's extended to support streams is defined here:

   http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html#dom-url


> > > The author *needn't* care about it (simply don't provide the hints) 
> > > but can if they want to. Sometimes you're transmitting fast moving 
> > > images, other times you're transmitting a slideshow (where you want 
> > > each slide to be of high quality, but very low frame rate). Only the 
> > > application can know this, and it'd be good for the platform to 
> > > optimize for it.
> > 
> > It seems like the user agent is actually in a better position to know 
> > this than the author. Shouldn't this just be automatic? At least in 
> > the initial implementation. It would suck if we boxed ourselves into 
> > an API where authors are required to guide implementations through 
> > things like this, given how rarely authors get this kind of thing 
> > right...
> 
> I don't understand, how could the UA know what kind of information is 
> being transmitted?

Well presumably distinguishing fast-moving content from a static slide 
show is a solved problem. :-)


> Are you suggesting dynamic image analysis or some sort that results in 
> adaptive codec switching based on changes in the input stream?

Yes, at least for the simple cases you describe (fast-moving vs static).


> The use-case is something like slideshare which presents videos of a 
> talk along with the slide deck, were the website explicitly knows which 
> streams are those of slides and which of the presenter, and can tell the 
> UA which stream is which.

For the case of slides, it seems that by far the better solution would be 
to send the slides out-of-band and display them at their native 
resolution, rather than send them via video.


> > > Fair enough. How about sendSignalingMessage() and 
> > > receivedSignalingMessage()? Perhaps too long :-)
> > 
> > Well sendSignalingMessage's name doesn't matter, since it's a 
> > callback.
> > 
> > receivedSignalingMessage() is what I originally wanted to call 
> > signalingMessage(), but it seemed overly long, yeah. I don't feel 
> > strongly on this so if people would rather have it called 
> > receivedSignalingMessage() that's fine by me.
> 
> It just wasn't clear to me at first read what the purpose of 
> signalingChannel was and I had to read it a couple more times to 
> understand. That's the only reason I wanted it renamed :)

I renamed it.

(Also renamed StreamTrack as you suggested.)


> The other reason I like explicit open() and listen() is that it makes 
> clear which side is calling who, and listen() has the potential to give 
> us presence ("I'm now ready to receive calls").

You have to do some signaling over the signaling channel to set up ICE 
anyway, so there's not really a concept of "listening" without an initial 
SDP offer.

I think it's pretty clear who's calling whom in the current API, though, 
since the caller has to create a PeerConnection and then gets a string to 
send via the callback, whereas the listener first receives a string out of 
the blue and has to create a PeerConnection to pass it the string.

Caller:

   var p = new PeerConnection('', callback);
   function callback(s) {
     // send s
   }

Receiver:

   // received s somehow
   var p = new PeerConnection('', callback);
   p.processSignalingMessage(s);


> I completely agree, I also consider it a failure if we don't make our 
> API simple enough for authors. I'm not suggesting that we make it hard 
> in any way at all; and all our proposed configuration options are not 
> required to be specified by authors. In the very simplest case (see our 
> 'A' calls 'B' example) there's hardly anything specified by the author, 
> and the UA chooses the best options. But they're there for more 
> sophisticated webapps, if needed.

I think maybe where we disagree is that I see optional features as 
complexity that affects even the programmers who don't need them.

When you go to learn a feature, you immediately see all the complexity. If 
there's a lot of it, you get scared away from it.

Take the earlier example of addEventListener(). The third argument of this 
method is an advanced feature that should almost always be set to false. 
Yet look at how the method is explained in tutorials:

   https://developer.mozilla.org/en/DOM/element.addEventListener
   http://www.javascriptkit.com/domref/windowmethods.shtml

As a new author, you immediately see all three arguments.

The same applies to other features. Tutorials tend to explain everything. 
So the more features we have, the more authors will either be scared by 
the feature, or the more they'll guess at what they should do (and guesses 
are rarely correct).

Eventually, I'm all for adding lots of features. But we should only add 
features that authors have clearly indicated they need, IMHO. This means 
starting small, and iterating in concert with implementation and author 
usage.


> > The whole point of efforts such as the WHATWG specifications, the new 
> > DOM Core specification, the CSS 2.1 specification, etc, is that we 
> > can, should, must write specifications that are precise enough that 
> > they can and will be interoperably implemented.
> 
> APIs only get us halfway there, a precise specification meant to be 
> entirely interoperable should then also include information on which 
> codecs are used etc.

Yes. We should absolutely specify that too. Unfortunately there's no 
solution everyone is willing to implement as far as codecs go, but that's 
an unfortunate exception. It's not an example to follow. :-)


> The <video> specification, for instance, has a very elegant and simple 
> API; however, we don't see mass adoption because of disagreements on 
> codecs. Web developers who do want to use <video> end up using something 
> like videojs.com for multiple fallbacks based on the UAs of their users.

Yes. It's a terribly bad situation. Unfortunately, there is currently no 
solution.


> I'm fully on board for starting simple and iterating quickly :-) If we 
> think we can add new capabilities in the future without breaking 
> compatibility with the APIs we started with, that would be great.

We can definitely add features later. :-)


> A good way to frame the discussion would be to take concrete use-cases 
> and see if our API supports it. If not, what can be the simplest way to 
> enable that use-case? Or perhaps we decide to not work on that use-case 
> for the current iteration and come back to it later, works for me!

Agreed.

The main use cases I've been considering are:

 - 1:1 Web video conferencing, like what Facebook recently launched, with 
   the server providing discovery and presence.

 - 1:1 audio telecommunication from the Web to a SIP device, with the help 
   of a gateway server for presence and call setup.

 - P2P gaming (data only).

Obviously 1:many and many:many video conferencing (such as what Google+ 
recently launched in trial) are interesting too; I mainly didn't look at 
those since I couldn't see anything in ICE that supported them natively, 
and at a higher level they can be approximated as either every node 
connecting to every other node, or every node connecting to a server that 
repeats the video back out.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 14 July 2011 01:33:58 UTC