[whatwg] PeerConnection feedback from Justin Uberti on 2011-04-12 (public-whatwg-archive@w3.org from April 2011)

From: Justin Uberti <juberti@google.com>
Date: Mon, 11 Apr 2011 23:17:30 -0700
Message-ID: <BANLkTi=o3or93Hg-0tHamFYha6QpriTcgg@mail.gmail.com>
CIL

On Mon, Apr 11, 2011 at 7:09 PM, Ian Hickson <ian at hixie.ch> wrote:

> On Tue, 29 Mar 2011, Robert O'Callahan wrote:
> > Ian Hickson wrote:
> > >
> > > I agree that (on the long term) we should support stream filters on
> > > streams, but I'm not sure I understand <video>'s role in this.
> > > Wouldn't it be more efficient to have something that takes a Stream on
> > > one side and outputs a Stream on the other, possibly running some
> > > native code or JS in the middle?
> >
> > We could.
> >
> > I'm trying to figure out how this is going to fit in with audio APIs.
> > Chris Rogers from Google is proposing a graph-based audio API to the W3C
> > Audio incubator group which would overlap considerably with a Stream
> > processing API like you're suggesting (although in his proposal
> > processing nodes, not streams, are first-class).
>
> Indeed. I think it would make sense to have nodes in this graph that could
> take Streams as input, or output the resulting data as Streams.
>
>
> > A fundamental problem here is that HTML media elements have the
> > functionality of both sources and sinks.
>
> Indeed. Unfortunately, at the time that we were designing <video>, the
> later needs of multitrack video and video conferencing were not completely
> clear. If we could go back, I think it would make sense to split the part
> of <video> that does network traffic and the part of <video> that does
> rendering and UI control from each other, if only to make it possible
> now to have them be split further for video conferencing and multitrack.
> Sadly, that's not really an option.
>
>
> > You want to see <video> and <audio> only as sinks which accept streams.
> > But in that case, if we create an audio processing API using Streams,
> > we'll need a way to download stream data for processing that doesn't use
> > <audio> and <video>, which means we'll need to replicate <src> elements,
> > the type attribute, networkstates, readystates, possibly the 'loop'
> > attribute... should we introduce a new object or element that provides
> > those APIs? How much can be shared with <video> and <audio>? Should we
> > be trying to share? (In Chris Rogers' proposal, <audio> elements are
> > used as sources, not sinks.)
>
> I think at this point we should probably just make media elements (<video>
> and <audio>) support being used both as sources and as sinks. They'll just
> be a little overweight when used just for one of those purposes.
>
> Basically I'm suggesting viewing media elements like this:
>
>   URL to network resource
>   URL to Stream object
>   URL to Blob object
>   |
>   |   ----------------------------
>   +-> :SINK                SOURCE: -+
>       ------------. T .-----------  |
>                   |   |             |
>                   |   |            Input for
>                   |   |            Audio API
>                   |   |
>                   \   /
>                    \ /
>                     V
>                  DISPLAY
>                    AND
>                SOUND BOARD
>
> It's a source that happens to have built-in monitor output. Or a sink that
> happens to have a monitor output port. Depending on how you want to see it.
>
>
> On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> >
> > A lot of firewalls (including Google's, I believe) drop the subsequent
> > part of fragmented UDP packets, because it's impossible to apply
> > firewall rules to fragments without keeping track of all fragmented UDP
> > packets that are in the process of being transmitted (and keeping track
> > would open the firewalls to an obvious resource exhaustion attack).
> >
> > This has made UDP packets larger than the MTU pretty useless.
>
> So I guess the question is do we want to limit the input to a fixed value
> that is the lowest used MTU (576 bytes per IPv4), or dynamically and
> regularly determine what the lowest possible MTU is?
>
> The former has a major advantage: if an application works in one
> environment, you know it'll work elsewhere, because the maximum packet
> size won't change. This is a erious concern on the Web, where authors tend
> to do limited testing and thus often fail to handle rare edge cases well.
>
> The latter has a major disadvantage: the path MTU might change, meaning we
> might start dropping data if we don't keep trying to determine the Path
> MTU. Also, it's really hard to determine the Path MTU in practice.
>
> For now I've gone with the IPv4 "minimum maximum" of 576 minus overhead,
> leaving 504 bytes for user data per packet. It seems small, but I don't
> know how much data people normally send along these low-latency unreliable
> channels.
>
> However, if people want to instead have the minimum be dynamically
> determined, I'm open to that too. I think the best way to approach that
> would be to have UAs implement it as an experimental extension at first,
> and for us to get implementation experience on how well it works. If
> anyone is interested in doing that I'm happy to work with them to work out
> a way to do this that doesn't interfere with UAs that don't yet implement
> that extension.
>
>
In practice, applications assume that the minimum MTU is 1280 (the minimum
IPv6 MTU), and limit payloads to about 1200 bytes so that with framing they
will fit into a 1280-byte MTU. Going down to 576 would significantly
increase the packetization overhead.


> On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> > On 03/29/11 03:00, Ian Hickson wrote:
> > > On Wed, 23 Mar 2011, Harald Alvestrand wrote:
> > > > >
> > > > > Is there really an advantage to not using SRTP and reusing the RTP
> > > > > format for the data messages?
> > >
> > > Could you elaborate on how (S)RTP would be used for this? I'm all in
> > > favour of defering as much of this to existing protocols as possible,
> > > but RTP seemed like massive overkill for sending game status packets.
> >
> > If "data" was defined as an RTP codec ("application/packets?"), SRTP
> > could be applied to the packets.
> >
> > It would impose a 12-byte header in front of the packet and the
> > recommended authentication tag at the end, but would ensure that we
> > could use exactly the same procedure for key exchange
>
> We already use SDP for key exchange for the data stream.
>



>
> > multiplexing of multiple data streams on the same channel using SSRC,
>
> I don't follow. What benefit would that have?
>

If you are in a conference that has 10 participants, you don't want to have
to set up a new transport for each participant. Instead, SSRC provides an
excellent way to multiplex multiple media streams over a single RTP session
(and network transport).

>
>
> > and procedures for identifying the stream in SDP (if we continue to use
> > SDP) - I believe SDP implicitly assumes that all the streams it
> > describes are RTP streams.
>
> That doesn't seem to be the case, but I could be misinterpreting SDP.
> Currently, the HTML spec includes instructions on how to identify the
> stream in SDP; if those instructions are meaningless due to a
> misunderstanding of SDP then we should fix it (and in that case, it might
> indeed make a lot of sense to use RTP to carry this data).
>
>
> > I've been told that defining RTP packetization formats for a codec needs
> > to be done carefully, so I don't think this is a full specification, but
> > it seems that the overhead of doing so is on the same order of magnitude
> > as the currently proposed solution, and the security properties then
> > become very similar to the properties for media streams.
>
> There are very big differences in the security considerations for media
> data and the security considerations for the data stream. In particular,
> the media data can't be generated by the author in any meaningful way,
> whereas the data is entirely under author control. I don't think it is
> safe to assume that the security properties that we have for media streams
> necessarily work for data streams.
>
>
> On Tue, 29 Mar 2011, Harald Alvestrand wrote:
> > > > >
> > > > > Recording any of these requires much more specification than just
> > > > > "record here".
> > >
> > > Could you elaborate on what else needs specifying?
> >
> > One thing I remember from an API design talk I viewed: "An ability to
> > record to a file means that the file format is part of your API."
>
> Indeed.
>
>
> > For instance, for audio recording, it's likely that you want control
> > over whether the resulting file is in Ogg Vorbis format or in MP3
> > format; for video, it's likely that you may want to specify that it will
> > be stored using the VP8 video codec, the Vorbis audio codec and the
> > Matroska container format. These desires have to be communicated to the
> > underlying audio/video engine, so that the proper transforms can be
> > inserted into the processing stream
>
> Yes, we will absolutely need to add these features in due course. Exactly
> what we should add is something we have to determine from implementation
> experience.
>
>
> > and I think they have to be communicated across this interface; since
> > the output of these operations is a blob without any inherent type
> > information, the caller has to already know which format the media is
> > in.
>
> Depending on the use case and on implementation trajectories, this isn't a
> given. For example, if all UAs end up implementing one of two
> codec/container combinations and we don't expose any controls, it may be
> that the first few bytes of the output file are in practice enough to
> fully identify the format used.
>
> Note also that Blob does have a MIME type, so even without looking at the
> data itself, or at the UA string, it may be possible to get a general idea
> of the container and maybe even codecs.
>
>
> On Wed, 30 Mar 2011, Stefan H?kansson LK wrote:
> >
> > This is absolutely correct, and it is not only about codecs or container
> > formats. Maybe you need to supply info like audio sampling rate, video
> > frame rate, video resolution, ... There was an input on this already
> > last November:
> >
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2010-November/029069.html
>
> Indeed. The situation hasn't changed since then:
>
>
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-February/030484.html
>
>
> On Tue, 29 Mar 2011, Stefan H?kansson LK wrote:
> > > > > > The web application must be able to define the media format to
> > > > > > be used for the streams sent to a peer.
> > > > >
> > > > > Shouldn't this be automatic and renegotiated dynamically via SDP
> > > > > offer/answer?
> > > >
> > > > Yes, this should be (re)negotiated via SDP, but what is unclear is
> > > > how the SDP is populated based on the application's preferences.
> > >
> > > Why would the Web application have any say on this? Surely the user
> > > agent is in a better position to know what to negotiate, since it will
> > > be doing the encoding and decoding itself.
> >
> > The best format of the coded media being streamed from UA a to UA b
> > depends on a lot of factors. An obvious one is that the codec used is
> > supported by both UAs.... As you say much of it can be handled without
> > any involvement from the application.
> >
> > But let's say that the app in UA a does "addStream". The application in
> > UA b (the same application as in UA a) has two <video> elements, one
> > using a large display size, one using a small size. The UAs don't know
> > in which element the stream will be rendered at this stage (that will be
> > known first when the app in UA b connects the stream to one of the
> > elements at "onaddstream"), so I don't understand how the UAs can select
> > a suitable video resolution without the application giving some input.
> > (Once the stream is being rendered in an element the situation is
> > different - then UA b has knowledge about the rendering and could
> > somehow inform UA a.)
>
> I had assumed that the video would at first be sent with some more or less
> arbitrary dimensions (maybe the native ones), and that the receiving UA
> would then renegotiate the dimensions once the stream was being displayed
> somewhere. Since the page can let the user change the <video> size
> dynamically, it seems the UA would likely need to be able to do that kind
> of dynamic update anyway.
>
>
> On Thu, 31 Mar 2011, Lachlan Hunt wrote:
> >
> > When getUserMedia() is invoked with unknown options, the spec currently
> > implicitly requires a PERMISSION_DENIED error to be thrown.
> >
> > e.g. navigator.getUserMedia("foo");
> >
> > In this case, the option for "foo" is unknown.  Presumably, this would
> > fall under platform limitations, and would thus jump from step 11 to the
> > failure case, and throw a permission denied error.
> >
> > We are wondering if this is the most ideal error to throw in this case,
> > as opposed to introducing a more logical NOT_SUPPORTED error, and if it
> > might be useful to authors to distinguish these cases?
> >
> > We assume, however, that if the author requests "audio,foo", and the
> > user grants access to audio, then the success callback would be invoked,
> > despite the unknown option for "foo".
>
> Good point. I've updated the spec to fire NOT_SUPPORTED_ERR if there's no
> known value.
>
>
> On Fri, 8 Apr 2011, Harald Alvestrand wrote:
> >
> > The current (April 8) version of section 9.4 says that the config string
> for a
> > PeerConnection object is this:
> > ---------------------------
> > The allowed formats for this string are:
> >
> > "TYPE 203.0.113.2:3478"
> > Indicates a specific IP address and port for the server.
> >
> > "TYPE relay.example.net:3478"
> > Indicates a specific host and port for the server; the user agent will
> look up
> > the IP address in DNS.
> >
> > "TYPE example.net"
> > Indicates a specific domain for the server; the user agent will look up
> the IP
> > address and port in DNS.
> >
> > The "TYPE" is one of:
> >
> > STUN
> > Indicates a STUN server
> > STUNS
> > Indicates a STUN server that is to be contacted using a TLS session.
> > TURN
> > Indicates a TURN server
> > TURNS
> > Indicates a TURN server that is to be contacted using a TLS session.
> > -------------------------------
> > I believe this is insufficient, for a number of reasons:
> > - For future extensibility, new forms of init data needs to be passed
> without
> > invalidating old implementations. This indicates that a serialized JSON
> object
> > with a few keys of defined meaning is a better basic structure than a
> format
> > string with no format identifier.
>
> The format is already defined in a forward-compatible manner.
> Specifically, UAs are currently required to ignore everything past the
> first line feed character. In a future version, we could extend this API
> by simply including additional data after the linefeed.
>
>
> > - For use with STUN and TURN, we need to support the case where we need a
> STUN
> > server and a TURN server, and they're different.
>
> TURN servers are STUN servers, at least according to the relevant RFCs, as
> far as I can tell. Can you elaborate on which TURN servers do not
> implement STUN, or explain the use cases for having different TURN and
> STUN servers? This is an area where I am most definitely not an expert, so
> any information here would be quite helpful.
>
>
> > - The method of DNS lookup is not specified. In particular, it is not
> > specified whether SRV records are looked up or not.
>
> This seems to be entirely specified. Please ensure that you are reading
> the normative conformance criteria for user agents, and not the
> non-normative authoring advice, which is only a brief overview.
>
>
> > - We have no evaluation that shows that we'll never need the unencrypted
> > TCP version of STUN or TURN, or that we need to support the encrypted
> > STUN version. We should either support all formats that the spec can
> > generate, or we should get a reasonable survey of implementors on what
> > they think is needed.
>
> If anyone has any data on this, that would indeed be quite useful.
>
>
> On Fri, 8 Apr 2011, Harald Alvestrand wrote:
> >
> > BTW, I haven't been on this list that long... if anyone has advice on
> > whether such discussions are better as buganizer threads or as whatwg
> > mailing list threads, please give it!
>
> Discussion is best on the mailing list. In general Bugzilla is best for
> straight-forward bugs rather than design discussions.
>
>
> On Fri, 8 Apr 2011, Glenn Maynard wrote:
> >
> > FWIW, I thought the block-of-text configuration string was peculiar and
> > unlike anything else in the platform.  I agree that using a
> > configuration object (of some kind) makes more sense.
>
> An object wouldn't work very well as it would add additional steps in the
> case where someone just wants to transmit the configuration information to
> the client as data. Using JSON strings as input as Harald suggested could
> work, but seems overly verbose for such a simple data.
>

I have a feeling that this configuration information will only start off
simple.

>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 11 April 2011 23:17:30 UTC