[whatwg] PeerConnection feedback from Ian Hickson on 2011-04-12 (public-whatwg-archive@w3.org from April 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 12 Apr 2011 02:09:23 +0000 (UTC)
Message-ID: <Pine.LNX.4.64.1104120027010.19153@ps20323.dreamhostps.com>
On Tue, 29 Mar 2011, Robert O'Callahan wrote:
> Ian Hickson wrote:
> > 
> > I agree that (on the long term) we should support stream filters on 
> > streams, but I'm not sure I understand <video>'s role in this. 
> > Wouldn't it be more efficient to have something that takes a Stream on 
> > one side and outputs a Stream on the other, possibly running some 
> > native code or JS in the middle?
> 
> We could.
> 
> I'm trying to figure out how this is going to fit in with audio APIs. 
> Chris Rogers from Google is proposing a graph-based audio API to the W3C 
> Audio incubator group which would overlap considerably with a Stream 
> processing API like you're suggesting (although in his proposal 
> processing nodes, not streams, are first-class).

Indeed. I think it would make sense to have nodes in this graph that could 
take Streams as input, or output the resulting data as Streams.


> A fundamental problem here is that HTML media elements have the 
> functionality of both sources and sinks.

Indeed. Unfortunately, at the time that we were designing <video>, the 
later needs of multitrack video and video conferencing were not completely 
clear. If we could go back, I think it would make sense to split the part 
of <video> that does network traffic and the part of <video> that does 
rendering and UI control from each other, if only to make it possible 
now to have them be split further for video conferencing and multitrack. 
Sadly, that's not really an option.


> You want to see <video> and <audio> only as sinks which accept streams. 
> But in that case, if we create an audio processing API using Streams, 
> we'll need a way to download stream data for processing that doesn't use 
> <audio> and <video>, which means we'll need to replicate <src> elements, 
> the type attribute, networkstates, readystates, possibly the 'loop' 
> attribute... should we introduce a new object or element that provides 
> those APIs? How much can be shared with <video> and <audio>? Should we 
> be trying to share? (In Chris Rogers' proposal, <audio> elements are 
> used as sources, not sinks.)

I think at this point we should probably just make media elements (<video> 
and <audio>) support being used both as sources and as sinks. They'll just 
be a little overweight when used just for one of those purposes.

Basically I'm suggesting viewing media elements like this:

   URL to network resource
   URL to Stream object
   URL to Blob object
   |
   |   ----------------------------
   +-> :SINK                SOURCE: -+
       ------------. T .-----------  |
                   |   |             |
                   |   |            Input for
                   |   |            Audio API
                   |   |
                   \   /
                    \ /
                     V
                  DISPLAY
                    AND
                SOUND BOARD

It's a source that happens to have built-in monitor output. Or a sink that 
happens to have a monitor output port. Depending on how you want to see it.


On Tue, 29 Mar 2011, Harald Alvestrand wrote:
>
> A lot of firewalls (including Google's, I believe) drop the subsequent 
> part of fragmented UDP packets, because it's impossible to apply 
> firewall rules to fragments without keeping track of all fragmented UDP 
> packets that are in the process of being transmitted (and keeping track 
> would open the firewalls to an obvious resource exhaustion attack).
> 
> This has made UDP packets larger than the MTU pretty useless.

So I guess the question is do we want to limit the input to a fixed value 
that is the lowest used MTU (576 bytes per IPv4), or dynamically and 
regularly determine what the lowest possible MTU is?

The former has a major advantage: if an application works in one 
environment, you know it'll work elsewhere, because the maximum packet 
size won't change. This is a erious concern on the Web, where authors tend 
to do limited testing and thus often fail to handle rare edge cases well.

The latter has a major disadvantage: the path MTU might change, meaning we 
might start dropping data if we don't keep trying to determine the Path 
MTU. Also, it's really hard to determine the Path MTU in practice.

For now I've gone with the IPv4 "minimum maximum" of 576 minus overhead, 
leaving 504 bytes for user data per packet. It seems small, but I don't 
know how much data people normally send along these low-latency unreliable 
channels.

However, if people want to instead have the minimum be dynamically 
determined, I'm open to that too. I think the best way to approach that 
would be to have UAs implement it as an experimental extension at first, 
and for us to get implementation experience on how well it works. If 
anyone is interested in doing that I'm happy to work with them to work out 
a way to do this that doesn't interfere with UAs that don't yet implement 
that extension.


On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> On 03/29/11 03:00, Ian Hickson wrote:
> > On Wed, 23 Mar 2011, Harald Alvestrand wrote:
> > > >
> > > > Is there really an advantage to not using SRTP and reusing the RTP 
> > > > format for the data messages?
> >
> > Could you elaborate on how (S)RTP would be used for this? I'm all in 
> > favour of defering as much of this to existing protocols as possible, 
> > but RTP seemed like massive overkill for sending game status packets.
>
> If "data" was defined as an RTP codec ("application/packets?"), SRTP 
> could be applied to the packets.
>
> It would impose a 12-byte header in front of the packet and the 
> recommended authentication tag at the end, but would ensure that we 
> could use exactly the same procedure for key exchange

We already use SDP for key exchange for the data stream.


> multiplexing of multiple data streams on the same channel using SSRC, 

I don't follow. What benefit would that have?


> and procedures for identifying the stream in SDP (if we continue to use 
> SDP) - I believe SDP implicitly assumes that all the streams it 
> describes are RTP streams.

That doesn't seem to be the case, but I could be misinterpreting SDP. 
Currently, the HTML spec includes instructions on how to identify the 
stream in SDP; if those instructions are meaningless due to a 
misunderstanding of SDP then we should fix it (and in that case, it might 
indeed make a lot of sense to use RTP to carry this data).


> I've been told that defining RTP packetization formats for a codec needs 
> to be done carefully, so I don't think this is a full specification, but 
> it seems that the overhead of doing so is on the same order of magnitude 
> as the currently proposed solution, and the security properties then 
> become very similar to the properties for media streams.

There are very big differences in the security considerations for media 
data and the security considerations for the data stream. In particular, 
the media data can't be generated by the author in any meaningful way, 
whereas the data is entirely under author control. I don't think it is 
safe to assume that the security properties that we have for media streams 
necessarily work for data streams.


On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> > > > 
> > > > Recording any of these requires much more specification than just
> > > > "record here".
> >
> > Could you elaborate on what else needs specifying?
>
> One thing I remember from an API design talk I viewed: "An ability to 
> record to a file means that the file format is part of your API."

Indeed.


> For instance, for audio recording, it's likely that you want control 
> over whether the resulting file is in Ogg Vorbis format or in MP3 
> format; for video, it's likely that you may want to specify that it will 
> be stored using the VP8 video codec, the Vorbis audio codec and the 
> Matroska container format. These desires have to be communicated to the 
> underlying audio/video engine, so that the proper transforms can be 
> inserted into the processing stream

Yes, we will absolutely need to add these features in due course. Exactly 
what we should add is something we have to determine from implementation 
experience.


> and I think they have to be communicated across this interface; since 
> the output of these operations is a blob without any inherent type 
> information, the caller has to already know which format the media is 
> in.

Depending on the use case and on implementation trajectories, this isn't a 
given. For example, if all UAs end up implementing one of two 
codec/container combinations and we don't expose any controls, it may be 
that the first few bytes of the output file are in practice enough to 
fully identify the format used.

Note also that Blob does have a MIME type, so even without looking at the 
data itself, or at the UA string, it may be possible to get a general idea 
of the container and maybe even codecs.


On Wed, 30 Mar 2011, Stefan H?kansson LK wrote:
>
> This is absolutely correct, and it is not only about codecs or container 
> formats. Maybe you need to supply info like audio sampling rate, video 
> frame rate, video resolution, ... There was an input on this already 
> last November: 
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-November/029069.html

Indeed. The situation hasn't changed since then:

   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-February/030484.html


On Tue, 29 Mar 2011, Stefan H?kansson LK wrote:
> > > > > The web application must be able to define the media format to 
> > > > > be used for the streams sent to a peer.
> > > > 
> > > > Shouldn't this be automatic and renegotiated dynamically via SDP 
> > > > offer/answer?
> > >
> > > Yes, this should be (re)negotiated via SDP, but what is unclear is 
> > > how the SDP is populated based on the application's preferences.
> > 
> > Why would the Web application have any say on this? Surely the user 
> > agent is in a better position to know what to negotiate, since it will 
> > be doing the encoding and decoding itself.
>
> The best format of the coded media being streamed from UA a to UA b 
> depends on a lot of factors. An obvious one is that the codec used is 
> supported by both UAs.... As you say much of it can be handled without 
> any involvement from the application.
> 
> But let's say that the app in UA a does "addStream". The application in 
> UA b (the same application as in UA a) has two <video> elements, one 
> using a large display size, one using a small size. The UAs don't know 
> in which element the stream will be rendered at this stage (that will be 
> known first when the app in UA b connects the stream to one of the 
> elements at "onaddstream"), so I don't understand how the UAs can select 
> a suitable video resolution without the application giving some input. 
> (Once the stream is being rendered in an element the situation is 
> different - then UA b has knowledge about the rendering and could 
> somehow inform UA a.)

I had assumed that the video would at first be sent with some more or less 
arbitrary dimensions (maybe the native ones), and that the receiving UA 
would then renegotiate the dimensions once the stream was being displayed 
somewhere. Since the page can let the user change the <video> size 
dynamically, it seems the UA would likely need to be able to do that kind 
of dynamic update anyway.


On Thu, 31 Mar 2011, Lachlan Hunt wrote:
>
> When getUserMedia() is invoked with unknown options, the spec currently 
> implicitly requires a PERMISSION_DENIED error to be thrown.
> 
> e.g. navigator.getUserMedia("foo");
> 
> In this case, the option for "foo" is unknown.  Presumably, this would 
> fall under platform limitations, and would thus jump from step 11 to the 
> failure case, and throw a permission denied error.
> 
> We are wondering if this is the most ideal error to throw in this case, 
> as opposed to introducing a more logical NOT_SUPPORTED error, and if it 
> might be useful to authors to distinguish these cases?
> 
> We assume, however, that if the author requests "audio,foo", and the 
> user grants access to audio, then the success callback would be invoked, 
> despite the unknown option for "foo".

Good point. I've updated the spec to fire NOT_SUPPORTED_ERR if there's no 
known value.


On Fri, 8 Apr 2011, Harald Alvestrand wrote:
> 
> The current (April 8) version of section 9.4 says that the config string for a
> PeerConnection object is this:
> ---------------------------
> The allowed formats for this string are:
> 
> "TYPE 203.0.113.2:3478"
> Indicates a specific IP address and port for the server.
> 
> "TYPE relay.example.net:3478"
> Indicates a specific host and port for the server; the user agent will look up
> the IP address in DNS.
> 
> "TYPE example.net"
> Indicates a specific domain for the server; the user agent will look up the IP
> address and port in DNS.
> 
> The "TYPE" is one of:
> 
> STUN
> Indicates a STUN server
> STUNS
> Indicates a STUN server that is to be contacted using a TLS session.
> TURN
> Indicates a TURN server
> TURNS
> Indicates a TURN server that is to be contacted using a TLS session.
> -------------------------------
> I believe this is insufficient, for a number of reasons:
> - For future extensibility, new forms of init data needs to be passed without
> invalidating old implementations. This indicates that a serialized JSON object
> with a few keys of defined meaning is a better basic structure than a format
> string with no format identifier.

The format is already defined in a forward-compatible manner. 
Specifically, UAs are currently required to ignore everything past the 
first line feed character. In a future version, we could extend this API 
by simply including additional data after the linefeed.


> - For use with STUN and TURN, we need to support the case where we need a STUN
> server and a TURN server, and they're different.

TURN servers are STUN servers, at least according to the relevant RFCs, as 
far as I can tell. Can you elaborate on which TURN servers do not 
implement STUN, or explain the use cases for having different TURN and 
STUN servers? This is an area where I am most definitely not an expert, so 
any information here would be quite helpful.


> - The method of DNS lookup is not specified. In particular, it is not
> specified whether SRV records are looked up or not.

This seems to be entirely specified. Please ensure that you are reading 
the normative conformance criteria for user agents, and not the 
non-normative authoring advice, which is only a brief overview.


> - We have no evaluation that shows that we'll never need the unencrypted 
> TCP version of STUN or TURN, or that we need to support the encrypted 
> STUN version. We should either support all formats that the spec can 
> generate, or we should get a reasonable survey of implementors on what 
> they think is needed.

If anyone has any data on this, that would indeed be quite useful.


On Fri, 8 Apr 2011, Harald Alvestrand wrote:
>
> BTW, I haven't been on this list that long... if anyone has advice on 
> whether such discussions are better as buganizer threads or as whatwg 
> mailing list threads, please give it!

Discussion is best on the mailing list. In general Bugzilla is best for 
straight-forward bugs rather than design discussions.


On Fri, 8 Apr 2011, Glenn Maynard wrote:
> 
> FWIW, I thought the block-of-text configuration string was peculiar and 
> unlike anything else in the platform.  I agree that using a 
> configuration object (of some kind) makes more sense.

An object wouldn't work very well as it would add additional steps in the 
case where someone just wants to transmit the configuration information to 
the client as data. Using JSON strings as input as Harald suggested could 
work, but seems overly verbose for such a simple data.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 11 April 2011 19:09:23 UTC