Re: Mozilla/Cisco API Proposal

On Jul 13, 2011, at 18:33 , Ian Hickson wrote:

> 
> On Mon, 11 Jul 2011, Timothy B. Terriberry wrote:
>>>> Perhaps one example is the sort of thing described by 
>>>> MediaStreamTrackHints in the proposal. The Opus audio codec from the 
>>>> IETF standardization effort can switch between separate "voip" and 
>>>> "audio" coding modes. The script setting up the connection may have 
>>>> context information about which of these are more appropriate for 
>>>> its [...]
>>> 
>>> This is why the Web uses a declarative high-level model. It lets us 
>>> make the Web better in the future.
>> 
>> I would argue the "voip" versus "audio" modes _are_ high-level 
>> declarations.
> 
> Agreed. I was referring more to proposals which gave bitrates, frequency 
> responses, resolutions, framerates, specific codecs, etc.
> 
> Specifically which modes we offer, and in what contexts we offer them, is 
> a different matter. Currently the only use case I've really seen is video 
> conferencing, for which we probably don't need to give any modes.

In video conferencing, there are often more than one video stream, and when there is more than one, some are  more typically more important than others. For some conferences a high quality version of the presentation material may be more important than the presenter, for others, it is the opposite. I understand your position presentation are better not done as video and though I agree with you, the practical reality is that power points with animation are widely used and the video becomes about the only way to deal with them. When people say presentation, they are often talking about application and desktop sharing. 


> 
> 
> On Tue, 12 Jul 2011, Cullen Jennings wrote:
>> 
>> Just a side note, the best way to do DTMF (RFC 4733) is not as an audio 
>> tone but as a seperate coded for transering the data that keys werhe 
>> pressed.
> 
> I'm all for supporting DTMF. It seems that the main use case would be 
> communicating with legacy systems (on the Web, one would presumably use a 
> native AJAX app or some such and only use audio/video conferencing to 
> speak to an actual human), so it seems one of our requirements here has to 
> be interoperability with the POTS, and thus with legacy SIP systems.
> 
> How do legacy SIP systems support DTMF? We presumably have to do it the 
> same way.

I see the use case fro DTMF as pretty much only interaction with legacy system for things like voicemail and when you call fedex or other IVRs. The way that works best is rfc4733. The short summary of this is it sends it over the RTP stream and sets up separate codec for DTMF. It's not really a codec in the traditional sense, it just sends a RTP pack that indicates the user pressed a "7" or whatever. I imagine we need some API to indicate the user wants to send a DTMF digit and then the browser can just generate the right RTP packet and send it. I don't think there is a use case where we need to be able to set the volume however there are use cases for duration. Many IVR require you to what is called a long # where you have to hold the # key down for a few seconds. 


> 
> 
>> 1) How you want to hand music vs spoken voice is totally different. For 
>> spoken voice, you want to filter out background noise, fan hum, etc 
>> while for music you typically want to capture all the nuances.
> 
> Agreed. Do we have any use cases that involve music?

I'll let the game people speak up, oh right, they are not here. Guess not :-) 

> 
> 
>> 2) Battery life. My iphone might be capable of doing awesome H.265 HD 
>> video but with a battery life of 45 minutes due to no hardware 
>> acceleration. However, with h.264 in the right mode it might get 10x 
>> that time. And if I was in an application where HD provided no value, I 
>> might want to use smaller resolution. So the browsers might always want 
>> to provide the "best" it can but "best" gets more complicated when 
>> trading off battery life. I used 264/265 in this example but you will 
>> have the same problem with VP8/VP9. Of course the default simple 
>> examples should make it so a user does not have to think about this. And 
>> the API should be designed such that people don't override with lame 
>> defaults. But I think advanced applications need to have some influence 
>> over the definition of "best".
> 
> Agreed, but that's the kind of thing we find out after deploying the first 
> revision.

Understand your overall point but this is something we have already learned given what is already deployed. 

> 
> Whatever we do here is going to have to be continually maintained and 
> improved for years to come. It's not like we get just one shot at making a 
> P2P video conferencing system and once we have it we can never change it. 
> On the contrary. It's a long-term investment with continuous improvement.

Totally agree. That's way it's important to make sure we have the right extensibility points in to start with. I'm not saying we need to think of every way that best might ever be defined on day one, but we do need to make sure there is a clear point where we can add things we learn in the future. 

> 
> To allow us to move fast and iterate, we have to start with the bare 
> minimum, and then add what people want. Almost by definition, advanced 
> applications aren't going to be completely catered for in our first 
> attempt. :-)

Sure, I understand, I also understand how fast users bother to move to new browsers. We need to catch the right balance. 

> 
> 
> On Tue, 12 Jul 2011, Cullen Jennings wrote:
>> 
>> There are ways in SDP that are used by current phones to say, I'm not 
>> sending a particular RTP stream but allow it to be restarted again in 
>> the future. If we can figure out what sort of API we want at the high 
>> level, I think I can show how to map this on to existing SDP.
> 
> Do you have any documentation on this? I'd be happy to support that in the 
> spec.
> 
> 
> On Tue, 12 Jul 2011, Cullen Jennings wrote:
>> 
>> Ian, I apologize for asking this again because I remember seeing that 
>> you had posted an answer to the question I am about to ask somewhere 
>> else but I can't find it.. In the general case of using the API outside 
>> browsers, or even in browsers, how does one solve the race condition 
>> that happens after creation of the object and before installing the 
>> onIncomgStream callback and the arrival of the first incoming stream? 
>> Are they queued ?
> 
> The short answer is yes, they are queued.
> 
> The long answer is that Web browsers use an event loop. Scripts execute as 
> a task in the event loop. Events are (generally -- there are exceptions) 
> fired s tasks in the event loop. So while a script is running, events 
> don't fire.
> 
> See:
>   http://www.whatwg.org/specs/web-apps/current-work/complete/webappapis.html#event-loops

Thanks - I'm get that now and that sorts out a bunch of questions. 

> 
> 
>> I also think there are issues around efficiency of signaling. If the JS 
>> is going to add several local stream, if you do offer / answer after 
>> each stream is added, you end up with a pretty bad delay for overall 
>> setup time.
> 
> Valid point. I've updated the spec so that it's clear that added and 
> removed streams are all processed together each time the UA reaches a 
> stable state (to a first approximation, this means when the script ends.)
> 
> 
> On Tue, 12 Jul 2011, Cullen Jennings wrote:
>> On Jul 11, 2011, at 20:22 , Ian Hickson wrote:
>>> 
>>> Whether you're on a call or not doesn't affect new calls. It's a new 
>>> ICE transaction each time, and ICE transactions don't interfere wich 
>>> each other.
>> 
>> mostly agree though they do interfere with each others pacing
> 
> Sure.
> 
> 
>>> The way you start a call with the PeerConnection API is that you 
>>> create a PeerConnection object, and then don't call 
>>> signalingMessage(). The API will then call the callback with the 
>>> initial SDP offer, to send via the signaling channel to the other 
>>> peer.
>> 
>> I'm sort of wondering how long you have to not call the signalingMessage 
>> before it sends the SDP offer?
> 
> The next time the event loop spins (i.e. when the script ends).
> 
> Please see the spec for the specific details:
> 
> http://www.whatwg.org/specs/web-apps/current-work/complete.html#dom-peerconnection
> 
> 
>>> The way you listen for a call with the PeerConnection API is that when 
>>> you receive that initial SDP offer from the initiating peer, you 
>>> create a PeerConnection object and immediately call signalingMessage() 
>>> on it with the SDP offer.
>>> 
>>> If these get out of sync, the normal ICE mechanism resolves any 
>>> conflicts. (For example, both sides can start the connection; the ICE 
>>> agent will then figure out what's going on. This is all defined by ICE 
>>> already.)
>>> 
>>> There's thus no need for an explicit open() or listen().
>> 
>> There's two layers of signaling going on here. The ICE and the SDP. If 
>> both sides simultaneously send an offer to the other side, I don't think 
>> ICE sorts out what happens next.
> 
> As far as I can tell, this is an ICE role conflict, which ICE handles 
> fine. If it's not, could you elaborate on what the difference is between 
> the case you are concerned about and an ICE role conflict? (Ideally with 
> examples, so that I can compare it to what I thought the spec said.)

The ICE spec is very confusing - not a great spec. My original comment was about if we needed explicit open() or listen() and I'm gong to go rethink that now that I understand the event model better. But to explain a bit about 1 to many with ICE ...

Say A wants to connect to B and C. Lets say A was not behind any NAT, has single IP, and no TURN sever. So all A gets is a single local port as it's only candidate. A gathers candidates and then starts ICE connections with B and C using the same candidates it gather. B and C both do the usual ICE thing and A is connected to both B and C.  A would end up using the same local port to talk to B and C. A would have to keep track of the remote IP addresses to sort out the incoming packets or when using SRTP keep track of the MKI if that was in use. 

ICE does not support 1 to many in the sense of using multicast to allow A to send one packet that goes to both B and C. 

> 
> 
> On Tue, 12 Jul 2011, Anant Narayanan wrote:
>> On 7/11/11 8:22 PM, Ian Hickson wrote:
>>> On Mon, 11 Jul 2011, Anant Narayanan wrote:
>>>> 
>>>> navigator.getMediaStream({}, onsuccess, onerror);
>>> 
>>> So how do you say you only want audio?
>> 
>> navigator.getMediaStream({"audio":{}}, onsuccess, onerror);
>> 
>> No properties is different from 1 or more, but I can see your point it 
>> does seem a bit clunky.
> 
> Yes, having {} mean the same as {'audio':{},'video':{}} but mean something 
> different than {'audio':{}} or {'video':{}} seems a bit unintuitive. :-)
> 
> 
>>> No, I mean, the string syntax that getUserMedia() is defined as taking 
>>> is already defined in an extensible, forward-compatible way. You first 
>>> split on commas to get the individual tracks that the script wants to 
>>> get, and then you split on spaces to get more data. So for example, if 
>>> the argument is the string "audio, video user" this indicates that the 
>>> script wants both audio and a user-facing video stream ("face-side 
>>> camera").
>> 
>> On a separate note, we removed the 'user'/'environment' options because 
>> we wanted to come up with better nomenclature for those terms. I 
>> originally suggested 'front'/'back', but they seem to restrictive too.
> 
> I'm certainly open to better terms (front/back don't work; the front 
> camera on a DSLR is very different than the front camera on a phone, hence 
> the user/environment names). But this is a pretty important feature.
> 
> 
>>> We can extend this further quite easily, e.g.:
>>> 
>>>    getUserMedia('video 30fps 640x480');
>>> 
>>> ...to get a video-only stream at VGA-resolution and 30 FPS.
>> 
>> What about the order of arguments separated by spaces?
> 
> What about it?
> 
> 
>>>> We should at-least allow those parameters that can be specified 
>>>> across multiple implementations. For example, quality is a number 
>>>> between 0.0 and 1.0, and that might mean different things for 
>>>> different codecs, and that's okay.
>>> 
>>> It's not clear to me that there should even be a codec involved at 
>>> this point. What you get from getUserMedia() need never actually be 
>>> encoded, in particular if you consider the case I mentioned earlier of 
>>> just taking an audio stream and piping it straight to the speaker -- 
>>> it's quite possible that on some platforms, this can be done entirely 
>>> in hardware without the audio ever being encoded at all.
>> 
>> In what cases would you not encode data that comes from the user's 
>> hardware? Is there a use-case for sending to the speaker what the user 
>> just spoke into the microphone?
> 
> That's what an amplifier is, no? Audio monitors are common too. Old-style 
> telephones had a local audio loop which was lost in the transition to 
> mobile phones that people may wish to reimplement, too.

Most modern digital phones still produce sidetone, it is very confusing to users to not have it at the right level - they start yelling into their phones. 

> 
> The same situation exists with video. In a video-conference, there is 
> usually a local video display. My understanding is that this is sometimes 
> implemented as a hardware-level feature (video going straight from the 
> camera to the video card). It would be very sad to preclude this kind of 
> thing, requiring that the video be compressed and decompressed just to 
> show a local loop.
> 
> 
>> I can see one use-case for video, doing local face recognition for login 
>> to a website, so data is painted on a canvas straight from hardware 
>> without any conversion. I agree that codecs need not be involved at all, 
>> something like 'quality' is pretty generic enough.
> 
> Quality in particular seems like something ripe for authors to 
> misunderstand. "Why of course I want the best quality!"; it works great on 
> the local test setup, then a user on dial-up tries it and it's a disaster. 
> Or alternatively, "Let's put the quality setting way down, because I want 
> dial-up users to find this works great"; followed by a user in an intranet 
> trying to video conference with someone else in the same building with 
> terabit ethernet but not being able to get good quality.
> 
> Better to let the system autonegotiate the quality, IMHO. At least at 
> first, until we have a better understanding of what Web authors do with 
> this stuff.
> 
> 
>>> Resolution is an example of why it might not make sense to be giving 
>>> too many options at this level. Consider a case where a stream is 
>>> plugged into a local<video> at 100x200 and into a PeerConnection that 
>>> goes to a remote host which then has it plugged into a<video> at 
>>> 1000x2000. You really want the camera to switch to a higher 
>>> resolution, rather than have it be fixed at whatever resolution the 
>>> author happened to come up with when he wrote the code -- especially 
>>> given the tendency of Web authors to specify arbitrary values when 
>>> prompted for defaults.
>> 
>> We're not prompting them to provide values, and the behavior you specify 
>> will be the one that authors gets if they don't specify anything on 
>> either end.
>> 
>> Is your fear that if we allow the API to configure things, webapp 
>> authors will use them even if they don't need to?
> 
> Yes. We've seen this over and over on the Web. The clasic example is 
> addEventListener(), which lets you opt for a capture listener or a bubble 
> listener -- zillions of people use capture for no particular reason, even 
> though the semantics they want are those of a bubble listener.
> 
> 
>>>>> How should this be notified on the other side?
>>>> 
>>>> I believe it is possible in RTP to ask the other side to stop 
>>>> sending something. If not, we could always just send our own UDP 
>>>> message.
>>> 
>>> I mean, how should it be exposed in the API?
>> 
>> Perhaps an event on the MediaStream at the other end, this part is not 
>> fleshed out fully yet.
> 
> I think on the short term we're probably best just not providing this 
> feature, since it doesn't add anything that authors can't do themselves 
> already (with the API as it stands today).
> 
> 
>>>>> Should it be possible for the other side to just restart sending the
>>>>> stream?
>> ...
>>>> I don't think so. If a peer explicitly set the readyState of a remote
>>>> stream to BLOCKED it means they don't want data. The other side could of
>>>> course, send a completely new stream if it wishes to.
>> ...
>>> It's not clear to me what the use case is here. Can you elaborate on why
>>> the API should support this natively instead of requiring that authors
>>> implement this themselves using their signalling channel? (The latter is
>>> pretty trivial to do, so it's not clear to me that this really simplifies
>>> anything.)
>> 
>> Just from an ease-of-programming standpoint, if we can support it we should.
> 
> I strongly disagree with this approach. There's lots of stuff we _can_ 
> support. If we try to support everything we can support, we'll just have a 
> lot of bugs. :-)
> 
> Best to just do a few things well, IMHO, and let authors do the rest. 
> We'll never be able to guess at everything authors might want to do.
> 
> 
>> Using the out of band of signalling channel is certainly not as trivial as
>> setting one property on the stream or track object.
> 
> It's more than just setting one property -- you still have to detect when 
> the property has been set on the other side to update the UI accordingly, 
> etc. I'm not at all convinced that it's significantly easier than just 
> doing it via the signalling channel.
> 
> 
>> I also suspect that it will be a pretty common scenario where the far 
>> end wants to temporarily block a particular track or stream. Call hold 
>> is the main example I had in mind.
> 
> I can't recall the last time I used call hold. I don't think I've _ever_ 
> used call hold on the Web. Are we sure it's a critical feature?
> 
> Is it not something we can delay until a later version to see if people 
> actually want it? (Since it's possible for authors to implement, we'll be 
> able to tell how common a desire it is by looking at deploymets.)
> 
> 
>> I'm fine with getObjectURL(); since it means we can just use the 
>> existing src attribute. Do you intend for the URL that is returned by 
>> that function to be UA specific or something that is standardized?
> 
> createObjectURL() is part of the File API:
> 
>   http://dev.w3.org/2006/webapi/FileAPI/#dfn-createObjectURL
> 
> (Sorry, I got the wrong name in the thread earlier.)
> 
> How it's extended to support streams is defined here:
> 
>   http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html#dom-url
> 
> 
>>>> The author *needn't* care about it (simply don't provide the hints) 
>>>> but can if they want to. Sometimes you're transmitting fast moving 
>>>> images, other times you're transmitting a slideshow (where you want 
>>>> each slide to be of high quality, but very low frame rate). Only the 
>>>> application can know this, and it'd be good for the platform to 
>>>> optimize for it.
>>> 
>>> It seems like the user agent is actually in a better position to know 
>>> this than the author. Shouldn't this just be automatic? At least in 
>>> the initial implementation. It would suck if we boxed ourselves into 
>>> an API where authors are required to guide implementations through 
>>> things like this, given how rarely authors get this kind of thing 
>>> right...
>> 
>> I don't understand, how could the UA know what kind of information is 
>> being transmitted?
> 
> Well presumably distinguishing fast-moving content from a static slide 
> show is a solved problem. :-)
> 
> 
>> Are you suggesting dynamic image analysis or some sort that results in 
>> adaptive codec switching based on changes in the input stream?
> 
> Yes, at least for the simple cases you describe (fast-moving vs static).
> 
> 
>> The use-case is something like slideshare which presents videos of a 
>> talk along with the slide deck, were the website explicitly knows which 
>> streams are those of slides and which of the presenter, and can tell the 
>> UA which stream is which.
> 
> For the case of slides, it seems that by far the better solution would be 
> to send the slides out-of-band and display them at their native 
> resolution, rather than send them via video.
> 
> 
>>>> Fair enough. How about sendSignalingMessage() and 
>>>> receivedSignalingMessage()? Perhaps too long :-)
>>> 
>>> Well sendSignalingMessage's name doesn't matter, since it's a 
>>> callback.
>>> 
>>> receivedSignalingMessage() is what I originally wanted to call 
>>> signalingMessage(), but it seemed overly long, yeah. I don't feel 
>>> strongly on this so if people would rather have it called 
>>> receivedSignalingMessage() that's fine by me.
>> 
>> It just wasn't clear to me at first read what the purpose of 
>> signalingChannel was and I had to read it a couple more times to 
>> understand. That's the only reason I wanted it renamed :)
> 
> I renamed it.
> 
> (Also renamed StreamTrack as you suggested.)
> 
> 
>> The other reason I like explicit open() and listen() is that it makes 
>> clear which side is calling who, and listen() has the potential to give 
>> us presence ("I'm now ready to receive calls").
> 
> You have to do some signaling over the signaling channel to set up ICE 
> anyway, so there's not really a concept of "listening" without an initial 
> SDP offer.
> 
> I think it's pretty clear who's calling whom in the current API, though, 
> since the caller has to create a PeerConnection and then gets a string to 
> send via the callback, whereas the listener first receives a string out of 
> the blue and has to create a PeerConnection to pass it the string.
> 
> Caller:
> 
>   var p = new PeerConnection('', callback);
>   function callback(s) {
>     // send s
>   }
> 
> Receiver:
> 
>   // received s somehow
>   var p = new PeerConnection('', callback);
>   p.processSignalingMessage(s);
> 
> 
>> I completely agree, I also consider it a failure if we don't make our 
>> API simple enough for authors. I'm not suggesting that we make it hard 
>> in any way at all; and all our proposed configuration options are not 
>> required to be specified by authors. In the very simplest case (see our 
>> 'A' calls 'B' example) there's hardly anything specified by the author, 
>> and the UA chooses the best options. But they're there for more 
>> sophisticated webapps, if needed.
> 
> I think maybe where we disagree is that I see optional features as 
> complexity that affects even the programmers who don't need them.
> 
> When you go to learn a feature, you immediately see all the complexity. If 
> there's a lot of it, you get scared away from it.
> 
> Take the earlier example of addEventListener(). The third argument of this 
> method is an advanced feature that should almost always be set to false. 
> Yet look at how the method is explained in tutorials:
> 
>   https://developer.mozilla.org/en/DOM/element.addEventListener
>   http://www.javascriptkit.com/domref/windowmethods.shtml
> 
> As a new author, you immediately see all three arguments.
> 
> The same applies to other features. Tutorials tend to explain everything. 
> So the more features we have, the more authors will either be scared by 
> the feature, or the more they'll guess at what they should do (and guesses 
> are rarely correct).
> 
> Eventually, I'm all for adding lots of features. But we should only add 
> features that authors have clearly indicated they need, IMHO. This means 
> starting small, and iterating in concert with implementation and author 
> usage.
> 
> 
>>> The whole point of efforts such as the WHATWG specifications, the new 
>>> DOM Core specification, the CSS 2.1 specification, etc, is that we 
>>> can, should, must write specifications that are precise enough that 
>>> they can and will be interoperably implemented.
>> 
>> APIs only get us halfway there, a precise specification meant to be 
>> entirely interoperable should then also include information on which 
>> codecs are used etc.
> 
> Yes. We should absolutely specify that too. Unfortunately there's no 
> solution everyone is willing to implement as far as codecs go, but that's 
> an unfortunate exception. It's not an example to follow. :-)
> 
> 
>> The <video> specification, for instance, has a very elegant and simple 
>> API; however, we don't see mass adoption because of disagreements on 
>> codecs. Web developers who do want to use <video> end up using something 
>> like videojs.com for multiple fallbacks based on the UAs of their users.
> 
> Yes. It's a terribly bad situation. Unfortunately, there is currently no 
> solution.
> 
> 
>> I'm fully on board for starting simple and iterating quickly :-) If we 
>> think we can add new capabilities in the future without breaking 
>> compatibility with the APIs we started with, that would be great.
> 
> We can definitely add features later. :-)
> 
> 
>> A good way to frame the discussion would be to take concrete use-cases 
>> and see if our API supports it. If not, what can be the simplest way to 
>> enable that use-case? Or perhaps we decide to not work on that use-case 
>> for the current iteration and come back to it later, works for me!
> 
> Agreed.
> 
> The main use cases I've been considering are:
> 
> - 1:1 Web video conferencing, like what Facebook recently launched, with 
>   the server providing discovery and presence.
> 
> - 1:1 audio telecommunication from the Web to a SIP device, with the help 
>   of a gateway server for presence and call setup.
> 
> - P2P gaming (data only).
> 
> Obviously 1:many and many:many video conferencing (such as what Google+ 
> recently launched in trial) are interesting too; I mainly didn't look at 
> those since I couldn't see anything in ICE that supported them natively, 
> and at a higher level they can be approximated as either every node 
> connecting to every other node, or every node connecting to a server that 
> repeats the video back out.
> 
> -- 
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
> 


Cullen Jennings
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

Received on Friday, 15 July 2011 15:46:08 UTC