Re: On a particular design meme

On Mon, Aug 27, 2012 at 11:53 AM, Matthew Kaufman
<matthew.kaufman@skype.net> wrote:
> Harald (as chair): I would like to kindly request that you make the same request (as you made below of Martin Thomson) of Eric Rescorla regarding his "Initial notes on MS proposal" posted to the list on 7 August 2012 as a link to an HTTP-accessible document.
>
> Responding to his comments in this forum is inconvenient at best with the original material not part of the record.
>
> Matthew Kaufman

No problem... I've copied and pasted the HTML below. I've tried to recover
some of the formatting, but there's some inevitable lossage in the HTML-ASCII
conversion.

-Ekr

P.S. I'm of course happy to copy stuff to the list, but fwiw blog entries on
EG are generally pretty stable once published and I try to explicitly
call out any non-trivial changes that get made after publication.

P.P.S. There were some comments on the post. Would you like me to
fwd those as well?


> ps. I have copied ekr directly as I suspect that not providing the comments as an attachment was simply an oversight, and chair action won't actually be required.

Somewhere between oversight and laziness :)



EXECUTIVE SUMMARY
Yesterday, Microsoft published their CU-RTC-Web WebRTC API proposal as
an alternative to the existing W3C WebRTC API being implemented in
Chrome and Firefox. Microsoft's proposal is a "low-level API" proposal
which basically exposes a bunch of media- and transport-level
primitives to the JavaScript Web application, which is expected to
stitch them together into a complete calling system. By contrast to
the current "mid-level" API, the Microsoft API moves a lot of
complexity from the browser to the JavaScript but the authors argue
that this makes it more powerful and flexible. I don't find these
arguments that convincing, however: a lot of them seem fairly abstract
and rhetorical and when we get down to concrete use cases, the
examples Microsoft gives seem like things that could easily be done
within the existing framework. So, while it's clear that the Microsoft
proposal is a lot more work for the application developer; it's a lot
less clear that it's sufficiently more powerful to justify that
additional complexity.

Microsoft's arguments for the superiority of this API fall into three
major categories:

* JSEP doesn't match with "key Web tenets"; i.e., it doesn't match the
  Web/HTML5 style.
* It allows the development of applications that would otherwise be
  difficult to develop with the existing W3C API.
* It will be easier to make it interoperate with existing VoIP endpoints.

Like any all-new design, this API has the significant advantage (which
the authors don't mention) of architectural cleanliness. The existing
API is a compromise between a number of different architectural
notions and like any hybrid proposals has points of ugliness where
those proposals come into contact with each other (especially in the
area of SDP.) However, when we actually look at functionality rather
than elegance, the advantages of an all-new design---not only one
which is largely not based on preexisting technologies but one which
involves discarding most of the existing work on WebRTC itself---start
to look fairly thin.

Looking at the three claims listed above: the first seems more
rhetorical than factual. It's certainly true that in the early years
of the Web designers strove to keep state out of the Web browser, but
that hasn't been the case with rich Web applications for quite some
time. To the contrary, many modern HTML5 technologies (localstore,
WebSockets, HSTS, WebGL) are about pushing state onto the browser from
the server.

The interoperability argument is similarly weakly supported. Given
that JSEP is based on existing VoIP technologies, it seems likely that
it is easier to make it interoperate with existing endpoints since
it's not first necessary to implement those technologies (principally
SDP) in JavaScript before you can even try to interoperate. The idea
here seems to be that it will be easier to accomodate existing
noncompliant endpoints if you can adapt your Web application on the
fly, but given the significant entry barrier to interoperating at all,
this seems like an argument that needs rather more support than MS has
currently offered.

Finally, with regard to the question of the flexibility/JavaScript
complexity tradeoff, it's somewhat distressing that the specific
applications that Microsoft cites (baby monitoring, security cameras,
etc.) are so pedestrian and easily handled by JSEP. This isn't of
course to say that there aren't applications which we can't currently
envision which JSEP would handle badly, but it rather undercuts this
argument if the only examples you cite in support of a new design are
those which are easily handled by the old one.

None of this is to say that CU-RTC-Web wouldn't be better in some
respects than JSEP. Obviously, any design has tradeoffs and as I said
above, it's always appealing to throw all that annoying legacy stuff
away and start fresh. However, that also comes with a lot of costs and
before we consider that we really need to have a far better picture of
what benefits other than elegance starting over would bring to the
table.

BACKGROUND
More or less everyone agrees about the basic objectives of the WebRTC
effort: to bring real-time communications (i.e., audio, video, and
direct data) to browsers. Specifically, the idea is that Web
applications should be able to use these capabilities directly. This
sort of functionality was of course already available either via
generic plugins such as Flash or via specific plugins such as Google
Talk, but the idea here was to have a standardized API that was built
into browsers.

In spite of this agreement about objectives, from the beginning there
was debate about the style of API that was appropriate, and in
particular how much of the complexity should be in the browser and how
much in the JavaScript The initial proposals broke down into two main
flavors:

High-level APIs — essentially a softphone in the browser. The Web
application would request the creation of a call (perhaps with some
settings as to what kinds of media it wanted) and then each browser
would emit standardized signaling messages which the Web application
would arrange to transit to the other browser. The original WHATWG
HTML5/PeerConnection spec was of this type.  Low-level APIs — an API
which exposed a bunch of primitive media and transport capabilities to
the JavaScript. A browser that implemented this sort of API couldn't
really do much by itself. Instead, you would need to write something
like a softphone in JavaScript, including implementing the media
negotiation, all the signaling state machinery, etc. Matthew Kaufman
from Microsoft was one of the primary proponents of this design.

After a lot of debate, the WG ultimately rejected both of these and
settled on a protocol called JavaScript Session Establishment Protocol
(JSEP), which is probably best described as a mid-level API. That
design, embodied in the current specifications [
http://tools.ietf.org/html/draft-ietf-rtcweb-jsep-01
http://dev.w3.org/2011/webrtc/editor/webrtc.html], keeps the transport
establishment and media negotiation in the browser but moves a fair
amount of the session establishment state machine into the
JavaScript. While it doesn't standardize signaling, it also has a
natural mapping to a simple signaling protocol as well as to SIP and
Jingle, the two dominant standardized calling protocols. The idea is
supposed to be that it's simple to write a basic application (indeed,
a large number of such simple demonstration apps have been written)
but that it's also possible to exercise advanced features by
manipulating the various data structures emitted by the browser. This
is obviously something of a compromise between the first two classes
of proposals.

The decision to follow this trajectory was made somewhere around six
months ago and at this point Google has a fairly mature JSEP
implementation available in Chrome Canary while Mozilla has a less
mature implementation which you could compile yourself but hasn't been
released in any public build.

Yesterday, Microsoft made a new proposal, called CU-RTC-Web. See the
blog post and the specification.

Below is an initial, high-level analysis of this proposal.

Disclaimer: I have been heavily involved with both the IETF and W3C
working groups in this area and have contributed significant chunks of
code to both the Chrome and Firefox implementations. I am also
currently consulting for Mozilla on their implementation. However, the
comments here are my own and don't necessarily represent those of any
other organization.

WHAT IS MICROSOFT PROPOSING?
What Microsoft is proposing is effectively a straight low level API.

There are a lot of different API points, and I don't plan to discuss
the API in much detail, but it's helpful to talk about the API some to
get a flavor of what's required to use it.

RealTimeMediaStream -- each RealTimeMediaStream represents a single
		       flow of media (i.e., audio or video).

RealTimeMediaDescription -- a set of parameters for the
			    RealTimeMediaStream.

RealTimeTransport -- a transport channel which a RealTimeMediaStream
		     can run over.

RealTimePort -- a transport endpoint which can be paired with a
	        RealTimePort on the other side to form a RealTimeTransport.

In order to set up an audio, video, or audio-video session, then, the
JS has to do something like the following:

1. Acquire local media streams on each browser via the getUserMedia()
API, thus getting some set of MediaStreamTracks.
2. Create RealTimePorts on each browser for all the local network
addresses as well as for whatever media relays are available/
required.
3. Communicate the coordinates for the RealTimePorts from each browser
to the other.
4. On each browser, run ICE connectivity checks for all combinations
of remote and local RealTimePorts.
5. Select a subset of the working remote/local RealTimePort pairs and
establish RealTimeTransports based on those pairs. (This might be one
or might be more than one depending on the number of media flows,
level of multiplexing, and the level of redundancy required).
6. Determine a common set of media capabilities and codecs between
each browser, select a specific set of media parameters, and create
matching RealTimeMediaDescriptions on each browser based on those
parameters.
7. Create RealTimeMediaStreams by combining RealTimeTransports,
RealTimeMediaDescriptions, and MediaStreamTracks.
8. Attach the remote RealTimeMediaStreams to some local display method
(such as an audio or video tag).

For comparison, in JSEP you would do something like:

1. Acquire local media streams on each browser via the getUserMedia()
API, thus getting some set of MediaStreamTracks.
2. Create a PeerConnection() and call AddStream() for each of the local streams.
3. Create an offer on one brower send it to the other side, create an
answer on the other side and send it back to the offering browser. In
the simplest case, this just involves making some API calls with no
arguments and passing the results to the other side.
4. The PeerConnection fires callbacks announcing remote media streams
which you attach to some local display method.


As should be clear, the CU-RTC-Web proposal requires significantly
more complex JavaScript, and in particular requires that JavaScript to
be a lot smarter about what it's doing. In a JSEP-style API, the Web
programmer can be pretty ignorant about things like codecs and
transport protocols, unless he wants to do something fancy, but with
CU-RTC-Web, he needs to understand a lot of stuff to make things work
at all. In some ways, this is a much better fit for the traditional
Web approach of having simple default behaviors which fit a lot of
cases but which can then be customized, albeit in ways that are
somewtimes a bit clunky.

Note that it's not like this complexity doesn't exist in JSEP, it's
just been pushed into the browser so that the user doesn't have to see
it. As discussed below, Microsoft's argument is that this simplicity
in the JavaScript comes at a price in terms of flexibility and
robustness, and that libraries will be developed (think jQuery) to
give the average Web programmer a simple experience, so that they
won't have to accept a lot of complexity themselves. However, since
those libraries don't exist, it seems kind of unclear how well that's
going to work.

ARGUMENTS FOR MICROSOFT'S PROPOSAL
Microsoft's proposal and the associated blog post makes a number of
major arguments for why it is a superior choice (the proposal just
came out today so there haven't really been any public arguments for
why it's worse). Combining the blog posts, you would get something
like this:

* That the current specification violates "fit with key web tenets",
  specifically that it's not stateless and that you can only make
  changes when in specific states. Also, that it depends on the SDP
  offer/answer model.

* That it doesn't allow a "customizable response to changing network
  quality".

* That it doesn't support "real-world interoperability" with existing
  equipment.

* That it's too tied to specific media formats and codecs.

* That JSEP requires a Web application to do some frankly inconvenient
  stuff if it wants to do something that the API doesn't have explicit
  support for.

* That it's inflexible and/or brittle with respect to new applications
  and in particular that it's difficult to implement some specific
  "innovative" applications with JSEP.

Below we examine each of these arguments in turn.


FITTING WITH "WEB TENETS"
MS writes:

   Honoring key Web tenets-The Web favors stateless interactions which
   do not saddle either party of a data exchange with the
   responsibility to remember what the other did or expects. Doing
   otherwise is a recipe for extreme brittleness in implementations;
   it also raises considerably the development cost which reduces the
   reach of the standard itself.

This sounds rhetorically good, but I'm not sure how accurate it
is. First, the idea that the Web is "stateless" feels fairly
anachronistic in an era where more and more state is migrating from
the server. To pick two examples, WebSockets involves forming a fairly
long-term stateful two-way channel between the browser and the server,
and localstore/localdb allow the server to persist data
semi-permanently on the browser. Indeed, CU-RTC-Web requires forming a
nontrivial amount of state on the browser in the form of the
RealTimePorts, which represent actual resource reservations that
cannot be reliably reconstructed if (for instance) the page reloads. I
think the idea here is supposed to be that this is "soft state", in
that it can be kept on the server and just reimposed on the browser at
refresh time, but as the RealTimePorts example shows, it's not clear
that this is the case. Similar comments apply to the state of the
audio and video devices which are inherently controlled by the
browser.

Moreover, it's never been true that neither party in the data exchange
was "saddled" with remembering what the other did; rather, it used to
be the case that most state sat on the server, and indeed, that's
where the CU-RTC-Web proposal keeps it. This is the first time we have
really built a Web-based peer-to-peer app. Pretty much all previous
applications have been client-server applications, so it's hard to
know what idioms are appropriate in a peer-to-peer case.

I'm a little puzzled by the argument about "development cost"; there
are two kinds of development cost here: that to browser implementors
and that to Web application programmers. The MS proposal puts more of
that cost on Web programmers whereas JSEP puts more of the cost on
browser implementors. One would ordinarily think that as long as the
standard wasn't too difficult for browser implementors to develop at
all, then pushing complexity away from Web programmers would tend to
increase the reach of the standard. One could of course argue that
this standard is too complicated for browser implementors to implement
at all, but the existing state of Google and Mozilla's implementations
would seem to belie that claim.

Finally, given that the original WHATWG draft had even more state in
the browser (as noted above, it was basically a high-level API), it's
a little odd to hear that Ian Hickson is out of touch with the "key
Web tenets".

CUSTOMIZABLE RESPONSE TO CHANGING NETWORK QUALITY
The CU-RTC-Web proposal writes:

   Real time media applications have to run on networks with a wide
   range of capabilities varying in terms of bandwidth, latency, and
   noise. Likewise these characteristics can change while an
   application is running. Developers should be able to control how
   the user experience adapts to fluctuations in communication
   quality. For example, when communication quality degrades, the
   developer may prefer to favor the video channel, favor the audio
   channel, or suspend the app until acceptable quality is
   restored. An effective protocol and API will have to arm developers
   with the tools to tailor such answers to the exact needs of the
   moment, while minimizing the complexity of the resulting API
   surface.

It's certainly true that it's desirable to be able to respond to
changing network conditions, but it's a lot less clear that the
CU-RTC-Web API actually offers a useful response to such changes. In
general, the browser is going to know a lot more about the
bandwidth/quality tradeoff of a given codec is going to be than most
JavaScript applications will, and so it seems at least plausible that
you're going to do better with a small number of policies (audio is
more important than video, video is more important than audio, etc.)
than you would by having the JS try to make fine-grained decisions
about what it wants to do. It's worth noting that the actual
"customizable" policies that are proposed here seem pretty simple. The
idea seems to be not that you would impose policy on the browser but
rather that since you need to implement all the negotiation logic
anyway, you get to implement whatever policy you want.

Moroever, there's a real concern that this sort of adaptation will
have to happen in two places: as MS points out, this kind of network
variability is really common and so applications have to handle
it. Unless you want to force every JS calling application in the
universe to include adaptation logic, the browser will need some
(potentially configurable and/or disableable) logic. It's worth asking
whether whatever logic you would write in JS is really going to be
enough better to justify this design.


REAL-WORLD INTEROPERABILITY
In their blog post today, MS writes about JSEP:

   it shows no signs of offering real world interoperability with
   existing VoIP phones, and mobile phones, from behind firewalls and
   across routers and instead focuses on video communication between
   web browsers under ideal conditions. It does not allow an
   application to control how media is transmitted on the network.

I wish this argument had been elaborated more, since it seems like
CU-RTC-Web is less focused on interoperability, not more. In
particular, since JSEP is based on existing technologies such as SDP
and ICE, it's relatively easy to build Web applications which gateway
JSEP to SIP or Jingle signaling (indeed, relatively simple prototypes
of these already exist). By contrast, gatewaying CU-RTC-Web signaling
to either of these protocols would require developing an entire SDP
stack, which is precisely the piece that the MS guys are implicitly
arguing is expensive.

Based on Matthew Kaufman's mailing list postings, his concern seems to
be that there are existing endpoints which don't implement some of the
specifications required by WebRTC (principally ICE, which is used to
set up the network transport channels) correctly, and that it will be
easier to interoperate with them if your ICE implementation is written
in JavaScript and downloaded by the application rather than in C++ and
baked into the browser. This isn't a crazy theory, but I think there
are serious open questions about whether it is correct. The basic
problem is that it's actually quite hard to write a good ICE stack
(though easy to write a bad one). The browser vendors have the
resources to do a good job here, but it's less clear that random JS
toolkits that people download will actually do that good a job
(especially if they are simultaneously trying to compensate for broken
legacy equipment). The result of having everyone write their own ICE
stack might be good but it might also lead to a landscape where
cross-Web application interop is basically impossible (or where there
are islands of noninteroperable de facto standards based on popular
toolkits or even popular toolkit versions).

A lot of people's instincts here seem to be based on an environment
where updating the software on people's machines was hard but updating
one's Web site was easy. But for about half of the population of
browsers (Chrome and Firefox) do rapid auto-updates, so they actually
are generally fairly modern. By contrast, Web applications often use
downrev version of their JS libraries (I wish I had survey data here
but it's easy to see just by opening up a JS debugger on you favorite
sites). It's not at all clear that the JS is easy to upgrade/native is
hard dynamic holds up any more.


TOO TIED TO SPECIFIC MEDIA FORMATS AND CODECS
The proposal says:

   A successful standard cannot be tied to individual codecs, data
   formats or scenarios. They may soon be supplanted by newer
   versions, which would make such a tightly coupled standard obsolete
   just as quickly. The right approach is instead to to support
   multiple media formats and to bring the bulk of the logic to the
   application layer, enabling developers to innovate.

I can't make much sense of this at all. JSEP, like the standards that
it is based on, is agnostic about the media formats and codecs that
are used. There's certainly nothing in JSEP that requires you to use
VP8 for your video codec, Opus for your audio codec, or anything
else. Rather, two conformant JSEP implementations will converge on a
common subset of interoperable formats. This should happen
automatically without Web application intervention.

Arguably, in fact, CU-RTC-Web is *more* tied to a given codec because
the codec negotiation logic is implemented either on the server or in
the JavaScript. If a browser adds support for a new codec, the Web
application needs to detect that and somehow know how to prioritize it
against existing known codecs. By contrast, when the browser
manufacturer adds a new codec, he knows how it performs compared to
existing codecs and can adjust his negotiation algorithms
accordingly. Moreover, as discussed below, JSEP provides (somewhat
clumsy) mechanisms for the user to override the browser's default
choices. These mechanisms could probably be made better within the
JSEP architecture.

Based on Matthew Kaufman's interview with Janko Rogers
[http://gigaom.com/2012/08/06/microsoft-webrtc-w3c/], it seems like
this may actually be about the proposal to have a mandatory to
implement video codec (the leading candidates seem to be H.264 or
VP8). Obviously, there have been a lot of arguments about whether such
a mandatory codec is required (the standard argument in favor of it is
that then you know that any two implementations have at least one
codec in common), but this isn't really a matter of "tightly coupling"
the codec to the standard. To the contrary, if we mandated VP8 today
and then next week decided to mandate H.264 it would be a one-line
change in the specification. In any case, this doesn't seem like a
structural argument about JSEP versus CU-RTC-Web. Indeed, if IETF and
W3C decided to ditch JSEP and go with CU-RTC-Web, it seems likely that
this wouldn't affect the question of mandatory codecs at all.


THE INCONVENIENCE OF SDP EDITING
Probably the strongest point that the MS authors make is that if the
API doesn't explicitly support doing something, the situation is kind
of gross:

   In particular, the negotiation model of the API relies on the SDP
   offer/answer model, which forces applications to parse and generate
   SDP in order to effect a change in browser behavior. An application
   is forced to only perform certain changes when the browser is in
   specific states, which further constrains options and increases
   complexity. Furthermore, the set of permitted transformations to
   SDP are constrained in non-obvious and undiscoverable ways, forcing
   applications to resort to trial-and-error and/or browser-specific
   code. All of this added complexity is an unnecessary burden on
   applications with little or no benefit in return.

What this is about is that in JSEP you call CreateOffer() on a
PeerConnection in order to get an SDP offer. This doesn't actually
change the PeerConnection state to accomodate the new offer; instead,
you call SetLocalDescription() to install the offer. This gives the
Web application the opportunity to apply its own preferences by
editing the offer. For instance, it might delete a line containing a
codec that it didn't want to use. Obviously, this requires a lot of
knowledge of SDP in the application, which is irritating to say the
least, for the reasons in the quote above.

The major mitigating factor is that the W3C/IETF WG members intend to
allow most common manipulations to made through explicit settings
parameters, so that only really advanced applications need to know
anything about SDP at all. Obviously opinions vary about how good a
job they have done, and of course it's possible to write libraries
that would make this sort of manipulation easier. It's worth noting
that there has been some discussion of extending the W3C APIs to have
an explicit API for manipulating SDP objects rather than just editing
the string versions (perhaps by borrowing some of the primitives in
CU-RTC-Web). Such a change would make some things easier while not
really representing a fundamental change to the JSEP model. However,
it's not clear if there are enough SDP-editing tasks to make this
project worthwhile.

With that said, that in order to have CU-RTC-Web interoperate with
existing SIP endpoints at all you would need to know far more about
SDP than would be required to do most anticipated transformations in a
JSEP environment, so it's not like CU-RTC-Web frees you from SDP if
you care about interoperability with existing equipment.

SUPPORT FOR NEW/INNOVATIVE APPLICATIONS
Finally, the MSFT authors argue that CU-RTC-Web is more flexible
and/or less brittle than JSEP:

   On the other hand, implementing innovative, real-world applications
   like security consoles, audio streaming services or baby monitoring
   through this API would be unwieldy, assuming it could be made to
   work at all. A Web RTC standard must equip developers with the
   ability to implement all scenarios, even those we haven't thought
   of.

Obviously the last sentence is true, but the first sentence provides
scant support for the claim that CU-RTC-Web fulfills this requirement
better than JSEP. The particular applications cited here, namely audio
streaming, security consoles, and baby monitoring, seem not only
doable with JSEP, but straightforward. In particular, security
consoles and baby monitoring just look like one way audio and/or video
calls from some camera somewhere. This seems like a trivial subset of
the most basic JSEP functionality. Audio streaming is, if anything,
even easier. Audio streaming from servers already exists without any
WebRTC functionality at all, in the form of the audio tag, and audio
streaming from client to server can be achieved with the combination
of getUserMedia and WebSockets. Even if you decided that you wanted to
use UDP rather than WebSockets, audio streaming is just a one-way
audio call, so it's hard to see that this is a problem.

In e-mail to the W3C WebRTC mailing list, Matthew Kaufman mentions the
use case of handling page reload:

   An example would be recovery from call setup in the face of a
   browser page reload... a case where the state of the browser must
   be reinitialized, leading to edge cases where it becomes impossible
   with JSEP for a developer to write Javascript that behaves properly
   in all cases (because without an offer one cannot generate an
   answer, and once an offer has been generated one must not generate
   another offer until the first offer has been answered, but in
   either case there is no longer sufficient information as to how to
   proceed).

This use case, often called "rehydration" has been studied a fair bit
and it's not entirely clear that there is a convenient solution with
JSEP. However, the problem isn't the offer/answer state, which is
actually easily handled, but rather the ICE and cryptographic state,
which are just as troublesome with CU-RTC-Web as they are with JSEP
[for a variety of technical reasons, you can't just reuse the previous
settings here.] So, while rehydration is an issue, it's not clear that
CU-RTC-Web makes matters any easier.

This argument, which should be the strongest of MS's arguments, feels
rather like the weakest. Given how much effort has already gone into
JSEP, both in terms of standards and implementation, if we're going to
replace it with something else that something else should do something
that JSEP can't, not just have a more attractive API. If MS can't come
up with any use cases that JSEP can't accomplish, and if in fact the
use cases they list are arguably more convenient with JSEP than with
CU-RTC-Web, then that seems like a fairly strong argument that we
should stick with JSEP, not one that we should replace it.

What I'd like to see Microsoft do here is describe some applications
that are really a lot easier with CU-RTC-Web than they are with
JSEP. Depending on the details, this might be a more or less
convincing argument, but without some examples, it's pretty hard to
see what considerations other than aesthetic would drive us towards
CU-RTC-Web.


Acknowledgement
Thanks to Cullen Jennings, Randell Jesup, Maire Reavy, and Tim
Terriberry for early comments on this draft.

Received on Monday, 27 August 2012 19:17:40 UTC