Re: Mozilla/Cisco API Proposal from Anant Narayanan on 2011-07-12 (public-webrtc@w3.org from July 2011)

From: Anant Narayanan <anant@mozilla.com>
Date: Tue, 12 Jul 2011 09:17:40 -0700
To: Ian Hickson <ian@hixie.ch>
CC: public-webrtc@w3.org
Message-ID: <4E1C73A4.9050506@mozilla.com>
On 7/11/11 8:22 PM, Ian Hickson wrote:
> On Mon, 11 Jul 2011, Anant Narayanan wrote:
>>
>> navigator.getMediaStream({}, onsuccess, onerror);
>
> So how do you say you only want audio?

navigator.getMediaStream({"audio":{}}, onsuccess, onerror);

No properties is different from 1 or more, but I can see your point it 
does seem a bit clunky.

> No, I mean, the string syntax that getUserMedia() is defined as taking is
> already defined in an extensible, forward-compatible way. You first split
> on commas to get the individual tracks that the script wants to get, and
> then you split on spaces to get more data. So for example, if the argument
> is the string "audio, video user" this indicates that the script wants
> both audio and a user-facing video stream ("face-side camera").

On a separate note, we removed the 'user'/'environment' options because 
we wanted to come up with better nomenclature for those terms. I 
originally suggested 'front'/'back', but they seem to restrictive too.

> We can extend this further quite easily, e.g.:
>
>     getUserMedia('video 30fps 640x480');
>
> ...to get a video-only stream at VGA-resolution and 30 FPS.

What about the order of arguments separated by spaces?

>> We should at-least allow those parameters that can be specified across
>> multiple implementations. For example, quality is a number between 0.0
>> and 1.0, and that might mean different things for different codecs, and
>> that's okay.
>
> It's not clear to me that there should even be a codec involved at this
> point. What you get from getUserMedia() need never actually be encoded, in
> particular if you consider the case I mentioned earlier of just taking an
> audio stream and piping it straight to the speaker -- it's quite possible
> that on some platforms, this can be done entirely in hardware without the
> audio ever being encoded at all.

In what cases would you not encode data that comes from the user's 
hardware? Is there a use-case for sending to the speaker what the user 
just spoke into the microphone?

I can see one use-case for video, doing local face recognition for login 
to a website, so data is painted on a canvas straight from hardware 
without any conversion. I agree that codecs need not be involved at all, 
something like 'quality' is pretty generic enough.

> I think there's a place for codec discussions, including quality control,
> etc, but I'm not sure getUserMediaStream() (or whatever we call it) is the
> place to put it. I think it belongs more in whatever interface we use for
> converting a stream to a binary blob or file.

Ah, that makes sense. I agree, we can move these parameters into the 
MediaStreamRecorder or other equivalents.

> Resolution is an example of why it might not make sense to be giving too
> many options at this level. Consider a case where a stream is plugged into
> a local<video>  at 100x200 and into a PeerConnection that goes to a remote
> host which then has it plugged into a<video>  at 1000x2000. You really
> want the camera to switch to a higher resolution, rather than have it be
> fixed at whatever resolution the author happened to come up with when he
> wrote the code -- especially given the tendency of Web authors to specify
> arbitrary values when prompted for defaults.

We're not prompting them to provide values, and the behavior you specify 
will be the one that authors gets if they don't specify anything on 
either end.

Is your fear that if we allow the API to configure things, webapp 
authors will use them even if they don't need to?

>>> How should this be notified on the other side?
>>
>> I believe it is possible in RTP to ask the other side to stop sending
>> something. If not, we could always just send our own UDP message.
>
> I mean, how should it be exposed in the API?

Perhaps an event on the MediaStream at the other end, this part is not 
fleshed out fully yet.

>>> Should it be possible for the other side to just restart sending the
>>> stream?
...
>> I don't think so. If a peer explicitly set the readyState of a remote
>> stream to BLOCKED it means they don't want data. The other side could of
>> course, send a completely new stream if it wishes to.
...
> It's not clear to me what the use case is here. Can you elaborate on why
> the API should support this natively instead of requiring that authors
> implement this themselves using their signalling channel? (The latter is
> pretty trivial to do, so it's not clear to me that this really simplifies
> anything.)

Just from an ease-of-programming standpoint, if we can support it we 
should. Using the out of band of signalling channel is certainly not as 
trivial as setting one property on the stream or track object. I also 
suspect that it will be a pretty common scenario where the far end wants 
to temporarily block a particular track or stream. Call hold is the main 
example I had in mind.

>> document.getElementById("somevideoelement").stream = myMediaStream;
>>
>> sets the video element to be an output for myMediaStream. The
>> MediaStream does not have an interface to find out all its inputs and
>> outputs. I don't think this part differs much from your proposal, I
>> mentioned it because it came up earlier :)
>
> Ah, ok.
>
> I think we're better off using URL.getObjectURL() for this kind of thing
> rather than having an explicit .stream attribute, since the latter would
> really mess with the<video>  resource selection algorithm.

I'm fine with getObjectURL(); since it means we can just use the 
existing src attribute. Do you intend for the URL that is returned by 
that function to be UA specific or something that is standardized?

>> The author *needn't* care about it (simply don't provide the hints) but
>> can if they want to. Sometimes you're transmitting fast moving images,
>> other times you're transmitting a slideshow (where you want each slide
>> to be of high quality, but very low frame rate). Only the application
>> can know this, and it'd be good for the platform to optimize for it.
>
> It seems like the user agent is actually in a better position to know
> this than the author. Shouldn't this just be automatic? At least in the
> initial implementation. It would suck if we boxed ourselves into an API
> where authors are required to guide implementations through things like
> this, given how rarely authors get this kind of thing right...

I don't understand, how could the UA know what kind of information is 
being transmitted? Are you suggesting dynamic image analysis or some 
sort that results in adaptive codec switching based on changes in the 
input stream? I don't think we're quite there yet in terms of 
technology, but I could be mistaken.

The use-case is something like slideshare which presents videos of a 
talk along with the slide deck, were the website explicitly knows which 
streams are those of slides and which of the presenter, and can tell the 
UA which stream is which.

>> We were proposing that the types be from the list here:
>> http://www.iana.org/assignments/media-types/index.html
>>
>> It certainly includes types for text (subtitles), audio as well as video.
>
> I mean the name "type" would be inconsistent. The other interfaces use
> "kind" for this concept.

I'm fine with renaming it to 'kind'.

>> That being said, if there's already a spec that we should inter-operate
>> with; that's reasonable. Where can I find more info on VideoTrack,
>> AudioTrack and TextTrack? Have these been implemented by any UA's?
>
> They're specified in the same spec as StreamTrack:
>
>     http://whatwg.org/c
>
> TextTrack probably has implementations by now, the other two are newer
> (they were designed with StreamTrack).

Using the existing track definitions sounds good to me, we might have to 
add some other types though (DTMF is the main one).

>> Fair enough. How about sendSignalingMessage() and
>> receivedSignalingMessage()? Perhaps too long :-)
>
> Well sendSignalingMessage's name doesn't matter, since it's a callback.
>
> receivedSignalingMessage() is what I originally wanted to call
> signalingMessage(), but it seemed overly long, yeah. I don't feel strongly
> on this so if people would rather have it called
> receivedSignalingMessage() that's fine by me.

It just wasn't clear to me at first read what the purpose of 
signalingChannel was and I had to read it a couple more times to 
understand. That's the only reason I wanted it renamed :)

>> Ah, adding a remote stream to pass it onto another peer, I had not
>> considered. Mainly the renaming was done to clarify that when you add a
>> local stream with addStream, the streamAdded callback would not be
>> invoked (since that happens only when remote streams are added).
>
> I wouldn't worry too much about that. Authors will figure it out quickly
> enough based on the examples they're copying when learning the API. In
> particular, when it comes to a decision between a long verbose but clear
> name and a shorter mnemonic name that isn't quite as clear, I think we
> should lean towards the latter: while authors will spend a short while
> learning the APIs (being helped by the longer names), they're going to
> spend far longer using them after learning them (being helped by the
> shorter names).

Fair enough.

> Whether you're on a call or not doesn't affect new calls. It's a new ICE
> transaction each time, and ICE transactions don't interfere wich each
> other.
>
> The way you start a call with the PeerConnection API is that you create a
> PeerConnection object, and then don't call signalingMessage(). The API
> will then call the callback with the initial SDP offer, to send via the
> signaling channel to the other peer.
>
> The way you listen for a call with the PeerConnection API is that when you
> receive that initial SDP offer from the initiating peer, you create a
> PeerConnection object and immediately call signalingMessage() on it with
> the SDP offer.
>
> If these get out of sync, the normal ICE mechanism resolves any conflicts.
> (For example, both sides can start the connection; the ICE agent will then
> figure out what's going on. This is all defined by ICE already.)
>
> There's thus no need for an explicit open() or listen().

Ah, yes I can see how this can be made to work. I'll wait for more 
details on if we can have multiple peers connect to a single 'server' 
(over the same PeerConnection on the server-side) and if this is even 
possible in ICE. That was the original reason we added explicit listen 
and open calls (see the hockey game viewer example in our proposal).

The other reason I like explicit open() and listen() is that it makes 
clear which side is calling who, and listen() has the potential to give 
us presence ("I'm now ready to receive calls").

>> In general, I think the common theme for your comments is to make things
>> as easy for the web developer as possible. I agree, in general, but for
>> an API at this level we should go for maximum flexibility that gives as
>> much power to the web application as we possibly can.
>
> I strongly disagree. It's all the more important for an API such as this
> one to make things as simple as possible for authors, at least in the
> first version. We can't know what power authors want until we give them
> something. There is a huge risk, especially with an API for something as
> complicated as this, in overdelivering in the first iteration: we might
> end up severely constrained due to initial decisions.
...
> It doesn't have to be. The hard stuff here should all be done by the
> browser. There's no reason we need to make this hard.
>
> I would say that we should measure the success of this effort by how easy
> it is for Web authors to add video conferencing features to their sites.
> If we make it something only experts can do, we will have failed.

I completely agree, I also consider it a failure if we don't make our 
API simple enough for authors. I'm not suggesting that we make it hard 
in any way at all; and all our proposed configuration options are not 
required to be specified by authors. In the very simplest case (see our 
'A' calls 'B' example) there's hardly anything specified by the author, 
and the UA chooses the best options. But they're there for more 
sophisticated webapps, if needed.

> Programming this _should_ be a cake walk. It's not ok for it not to be.

+1.

>> However, consider that today, practically nobody uses the DOM API
>> directly, everyone is building webapps with jQuery or some other fancy
>> JS toolkit.
>
> That's a failure of the DOM API, one that we are trying to resolve over
> time. Let's not repeat that mistake.
...
>> CSS is hard to understand and write, that's why we have things like
>> http://lesscss.org/.
>
> Lots of people write CSS directly. We should make CSS easier so that
> things like lesscss are not necessary.

I don't see the emergence of toolkits like jQuery as a failure for the 
web at all. It just means the web framework is quite powerful and allows 
for a lot of flexibility, and often times there's a conflict between 
simplicity and power. There's always room to make things simpler at any 
layer of the stack, but not always to add new capabilities.

>> Let's face it, we're not going to get rid of these cross-browser JS
>> libraries because there's always bound to be (minor, at the least)
>> differences in implementation of any spec.
>
> On the contrary. The whole point of efforts such as the WHATWG
> specifications, the new DOM Core specification, the CSS 2.1 specification,
> etc, is that we can, should, must write specifications that are precise
> enough that they can and will be interoperably implemented.

APIs only get us halfway there, a precise specification meant to be 
entirely interoperable should then also include information on which 
codecs are used etc. The <video> specification, for instance, has a very 
elegant and simple API; however, we don't see mass adoption because of 
disagreements on codecs. Web developers who do want to use <video> end 
up using something like videojs.com for multiple fallbacks based on the 
UAs of their users.

I do hope that in this particular case the situation is different. But 
my gut feeling is that we won't be able to get rid of cross-platform JS 
libraries, in the end web developers will end up using them for a 
variety of reasons. And this gives us a little more leeway when defining 
the base standard.

>> But, in a couple years if we discover that we can't write this totally
>> awesome MMORPG-in-the-browser that allows players to talk to each other
>> while playing because of API limitations, well that would be not so
>> awesome :-)
>
> We can always extend the API later. Indeed, I fully expect that we will
> extend this API for years. That's the best way to design APIs: start
> simple, ensure browsers implement it interoperably, and iterate quickly.

I'm fully on board for starting simple and iterating quickly :-) If we 
think we can add new capabilities in the future without breaking 
compatibility with the APIs we started with, that would be great.

A good way to frame the discussion would be to take concrete use-cases 
and see if our API supports it. If not, what can be the simplest way to 
enable that use-case? Or perhaps we decide to not work on that use-case 
for the current iteration and come back to it later, works for me!

Regards,
-Anant
Received on Tuesday, 12 July 2011 16:18:42 UTC