[whatwg] PeerConnection feedback from Ian Hickson on 2011-12-03 (public-whatwg-archive@w3.org from December 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Sat, 3 Dec 2011 00:00:50 +0000 (UTC)
Message-ID: <Pine.LNX.4.64.1112022344230.9078@ps20323.dreamhostps.com>
I include below, for posterity, some feedback to which I will not be 
replying, as it relates to the PeerConnection and media streams section of 
the specification which has since been moved to the WebRTC working group 
at the W3C.

I encourage anyone who is interested in that particular topic to follow 
the aforementioned group.

On Tue, 26 Jul 2011, Mark Callow wrote:
> On 26/07/2011 14:30, Ian Hickson wrote:
> > On Thu, 14 Jul 2011 04:09:40 +0530, Ian Hickson <ian at hixie.ch> wrote:
> > > > > > 
> > > > > > Another question is flash. As far as I have seen, there seems 
> > > > > > to be no option to specify whether the camera needs to use 
> > > > > > flash or not. Is this decision left up to the device? (If 
> > > > > > someone is making an app which is just clicking a picture of 
> > > > > > the person, then it would be nice to have the camera use flash 
> > > > > > in low light conditions).
> > > > >
> > > > > getUserMedia() returns a video stream, so it wouldn't use a 
> > > > > flash.
> > > 
> > > Wouldn't it make sense to have a provision for flash separately 
> > > then? I think a lot of apps would like just a picture instead of 
> > > video, and in those cases, flash would be required. Maybe a seperate 
> > > provision in the spec which defines whether to use flash, and if so, 
> > > for how many miliseconds. Is that doable?
>
> There is a lot more that could be done than simply triggering the flash. 
> See /The Frankencamera: An Experimental Platform for Computational 
> Photography/ <http://graphics.stanford.edu/papers/fcam/> and The FCAM 
> API <http://fcam.garage.maemo.org/>.

On Tue, 26 Jul 2011, Tommy Widenflycht (?~[~O?~Z??~[~X?~[~X?~Z?) wrote:
> On Tue, Jul 26, 2011 at 07:30, Ian Hickson <ian at hixie.ch> wrote:
> > >
> > > If you send two MediaStream objects constructed from the same 
> > > LocalMediaStream over a PeerConnection there needs to be a way to 
> > > separate them on the receiving side.
> >
> > What's the use case for sending the same feed twice?
> 
> There's no proper use case as such but the spec allows this.
> >
> > > I also think it is a bit unfortunate that we now have a 'label' 
> > > property on the track objects that means something else than the 
> > > 'label' property on MediaStream, perhaps 'description' would be a 
> > > more suitable name for the former.
> >
> > In what sense do they mean different things? I don't understand the 
> > problem here. Can you elaborate?
> 
> label on a MediaStream is a unique identifier, while the label on a 
> MediaStreamTrack is just a description like "Logitech Vision Pro", "Line 
> In" or "Built-in Mic". I too find this a bit odd.
>
> [...]
>
> If I may make an analogy to the real world: plumbing.
> 
> Each fork of a MediaStream is a new joint in the pipe, my suggestion 
> introduces a tap at each joint. No matter how you open and close the tap 
> at the end (or middle); if any previous tap is closed there's nothing 
> coming through. The spec currently removes and add the entire pipe after 
> the changed joint.
>
> > > Also some follow-up questions regarding the new TrackLists:
> > >
> > > What should happen when a track fails? Should the entire stream 
> > > fail, the MSTrack silently be removed or the MSTrack disassociated 
> > > with the track (and thus becoming a do-nothing object)?
> >
> > What do you mean by "fails"?
> 
> Yanking the USB cable to the camera for example. This should imho stop 
> the MS, not just silently send black video.
> 
> > > What should happen when a stream with two or more video tracks is 
> > > associated to a <video> tag? Just render the first enabled one?
> >
> > Same as if you had a regular video file with multiple tracks.
>
> And that is? Sorry, this might be written down somewhere and I have 
> missed it.

On Thu, 28 Jul 2011, Stefan H?kansson LK wrote:
> >On Tue, Jul 26, 2011 at 07:30, Ian Hickson <ian at hixie.ch> wrote:
> >>
> >> > If you send two MediaStream objects constructed from the same 
> >> > LocalMediaStream over a PeerConnection there needs to be a way to 
> >> > separate them on the receiving side.
> >>
> >> What's the use case for sending the same feed twice?
> >
> >There's no proper use case as such but the spec allows this.
>
> The question is how serious a problem this is. If you want to fork, and 
> make both (all) versions available at the peer, would you not transmit 
> the full stream and fork at the receiving end for efficiency reasons? 
> And if you really want to fork at the sender, one way to separate them 
> is to use one PeerConnection per fork.

On Tue, 2 Aug 2011, Per-Erik Brodin wrote:
> On 2011-07-26 07:30, Ian Hickson wrote:
> > On Tue, 19 Jul 2011, Per-Erik Brodin wrote:
> > > 
> > > Perhaps now that there is no longer any relation to tracks on the 
> > > media elements we could also change Track to something else, maybe 
> > > Component. I have had people complaining to me that Track is not 
> > > really a good name here.
> > 
> > I'm happy to change the name if there's a better one. I'm not sure 
> > Component is any better than Track though.
> 
> OK, let's keep Track until someone comes up with a better name then.
> 
> > > Good. Could we still keep audio and video in separate lists though? 
> > > It makes it easier to check the number of audio or video components 
> > > and you can avoid loops that have to check the kind for each 
> > > iteration if you only want to operate on one media type.
> > 
> > Well in most (almost all?) cases, there'll be at most one audio track 
> > and at most one video track, which is why I didn't put them in 
> > separate lists. What use cases did you have in mind where there would 
> > be enough tracks that it would be better for them to be separate 
> > lists?
> 
> Yes, you're right, but even with zero or one track it's more convenient 
> to have them separate because that way you can more easily check if the 
> stream contains any audio and/or video tracks and check the number of 
> tracks of each kind. I also think it will be problematic if we would 
> like to add another kind at a later stage if all tracks are in the same 
> list since people will make assumptions that audio and video are the 
> only kinds.
> 
> > > I also think that it would be easier to construct new MediaStream 
> > > objects from individual components rather than temporarily disabling 
> > > the ones you do not want to copy to the new MediaStream object and 
> > > then re-enabling them again afterwards.
> > 
> > Re-enabling them afterwards would re-include them in the copies, too.
> 
> Why is this needed? If a new MediaStream object is constructed from 
> another MediaStream I think it would be simpler to just let that be a 
> clone of the stream with all tracks present (with the enabled/disabled 
> states independently set).
> 
> > The main use case here is temporarily disabling a video or audio track 
> > in a video conference. I don't understand how your proposal would work 
> > for that. Can you elaborate?
> 
> A new MediaStream object is created from the video track of a 
> LocalMediaStream to be used as self-view. The LocalMediaStream can then 
> be sent over PeerConnection and the video track disabled without 
> affecting the MediaStream being played back locally in the self-view. In 
> addition, my proposal opens up for additional use cases that require 
> combining tracks from different streams, such as recording a 
> conversation (a number of audio tracks from various streams, local and 
> remote combined to a single stream).
> 
> > > It is also unclear to me what happens to a LocalMediaStream object 
> > > that is currently being consumed in that case.
> > 
> > Not sure what you mean. Can you elaborate?
> 
> I was under the impression that, if a stream of audio and video is being 
> sent to one peer and then another peer joins but only audio should be 
> sent, then video would have to be temporarily disabled in the first 
> stream in order to construct a new MediaStream object containing only 
> the audio track. Again, it would be simpler to construct a new 
> MediaStream object from just the audio track and send that.
> 
> > > Why should the label the same as the parent on the newly constructed 
> > > MediaStream object?
> > 
> > The label identifies the source of the media. It's the same source, 
> > so, same label.
> 
> I agree, but usually you have more than one source in a MediaStream and 
> if you construct a new MediaStream from it which doesn't contain all of 
> the sources from the parent I don't think the label should be the same. 
> By the way, what happens if you call getUserMedia() twice and get the 
> same set of sources both times, do you get the same label then? What if 
> the user selects different sources the second time?
> 
> > > If you send two MediaStream objects constructed from the same 
> > > LocalMediaStream over a PeerConnection there needs to be a way to 
> > > separate them on the receiving side.
> > 
> > What's the use case for sending the same feed twice?
> 
> If the labels are the same then that should indicate that it's 
> essentially the same stream and there should be no need to send it 
> twice. If the streams are not composed of the same underlying sources 
> then you may want to send them both and the labels should differ.
> 
> > > I also think it is a bit unfortunate that we now have a 'label' 
> > > property on the track objects that means something else than the 
> > > 'label' property on MediaStream, perhaps 'description' would be a 
> > > more suitable name for the former.
> > 
> > In what sense do they mean different things? I don't understand the 
> > problem here. Can you elaborate?
> 
> As Tommy pointed out, label on MediaStream is an identifier for the 
> stream whereas label och MediaStreamTrack is a description of the 
> source.
> 
> > > > The current design is just the result of needing to define what 
> > > > happens when you call getRecordedData() twice in a row. Could you 
> > > > elaborate on what API you think we should have?
> > > 
> > > What I am thinking of is something similar to what was proposed in 
> > > http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-March/030921.html
> > 
> > That doesn't answer the question of what happens if you call stop() 
> > twice.
> 
> Nothing will happen the second time since recording has already stopped.
> 
> > (Also, having to call a method and hook an event so that you can read 
> > an attribute seems like a rather round-about way of getting data. Is 
> > calling a method with a callback not simpler?)
> 
> When the event has been fired you can read the attribute whenever you 
> want to get the blob, how many times you want. I prefer that over having 
> stop() take a callback argument.
> 
> > Quota doesn't seem particularly important here. It's not like you can 
> > really do lasting damage. It would just be a DOS attack, like creating 
> > a Web page with an infinite number of 10000x10000 canvases. We can 
> > just let the "hardware limitation" clause handle it.
> 
> In a video blog recording application it would be nice to be able to 
> present to the user how much more can be recorded and not just handle it 
> as a hardware limitation, since that could mean dropping the entire 
> recording.
> 
> > > I was not saying that it would not be possible to keep track of 
> > > which blob: URLs that point to blobs and which point to streams just 
> > > that we want to avoid doing that in the early stage of the media 
> > > engine selection. In my opinion a stream is quite the opposite of a 
> > > blob (unknown, perhaps infinite length vs. fixed length) so when 
> > > printing the URLs for debugging purposes it would also be much nicer 
> > > to have two different protocol schemes. If I remember correctly the 
> > > discussions leading up to the renaming of createBlobURL to 
> > > createObjectURL assumed that there would be stream: URLs.
> > 
> > You wouldn't be able to remove that logic, since http: URLs would 
> > still have the same needs. You can have finite and infinite http: 
> > resources, just like you can have finite and infinite blob: resources. 
> > I don't really see the problem here. Indeed, with blob:, it's trivial 
> > to find out if the resource is finite or not; with http: you might not 
> > know until the whole finite resource is downloaded.
> > 
> > If there is something I'm missing here please do let me know.
> 
> The differentiation is not between finite and infinite resources but 
> rather between playback media resources and conversational media 
> resources. blob: and http: are both handled by the playback media engine 
> whereas stream: is handled by the conversational media engine. We would 
> like to be able to determine which engine to use by simply looking at 
> the URL.
> 
> > > > > PeerConnection is an EventTarget but it still uses a callback 
> > > > > for the signaling messages and this mixture of events and 
> > > > > callbacks is a bit awkward in my opinion. If you would like to 
> > > > > change the function that handles signaling messages after 
> > > > > calling the constructor you would have to wrap a function call 
> > > > > inside the callback to the actual signal handling function, 
> > > > > instead of just (re-)setting an onsignal (or whatever) attribute 
> > > > > listener (the event could reuse the MessageEvent interface).
> > > > 
> > > > When would you change the callback?
> > > 
> > > If you would like to send the signaling messages peer-to-peer over 
> > > the data channel, once it is established.
> > 
> > That seems like a disaster waiting to happen. The UDP data channel is 
> > unreliable, the signaling channel has to be reliable. Worse, the UDP 
> > data channel might go down at any second, and then the user agent 
> > would try to re-establish it using the signaling channel.
> 
> You can provide a reliable channel on top of the unreliable channel and 
> monitor the PeerConnection state so that you know when to fall back to 
> server-relayed signaling. One reason to do this would be to improve the 
> signaling latency which can be of importance in applications that, for 
> example, trigger format renegotiation due to change in video display 
> size.
> 
> > > > - It's easy to not register a callback, which makes no sense. 
> > > > There's literally never a use for create a PeerConnection without 
> > > > a signaling channel, as far as I can tell, so making it easier to 
> > > > create one without a callback than with seems like a bad design.
> > > 
> > > For example, creating an EventSource without registering any 
> > > listener for incoming events equally does not make sense.
> > 
> > Actually, it does. One operation mode for EventSource is to have 
> > events with different names, each triggering a different event 
> > listener.
> 
> An EventSource without any event listener seems rather useless to me. 
> Even if you can assign multiple handlers for events with different 
> names, all those handlers could still be provided as arguments to the 
> constructor, right? That would ensure that nobody can create an 
> EventSource without registering at least one event listener.
> 
> > > > > There is a potential problem in the exchange of SDPs in that 
> > > > > glare conditions can occur if both peers add streams 
> > > > > simultaneously, in which case there will be two different 
> > > > > outstanding offers that none of the peers are allowed to respond 
> > > > > to according to the SDP offer-answer model. Instead of using one 
> > > > > SDP session for all media as the specification suggests, we are 
> > > > > handling the offer-answer for each stream separately to avoid 
> > > > > such conditions.
> > > > 
> > > > Why isn't this handled by the ICE role conflict processing rules? 
> > > > It seems like simultaneous ICE restarts would be trivially 
> > > > resolvable by just following the rules in the ICE spec. Am I 
> > > > missing something?
> > > 
> > > This problem is not related to ICE but rather to the SDP 
> > > offer-answer model which is separate from the ICE processing. The 
> > > problem is that SDP offer-answer does not allow you to respond to an 
> > > offer when you have an outstanding offer for the same set of 
> > > streams.
> > 
> > As far as I can tell, your interpretation is incorrect. This is 
> > entirely related to ICE, and ICE, as far as I can tell, defines this 
> > exact case in its role conflict resolution.
> > 
> > The only time this can happen is if you have both ends do an ICE 
> > restart at exactly the same time. The offer from each ICE agent will 
> > be received by the other as if it was the response, and thus there 
> > will be a role conflict and the ICE role conflict resolution process 
> > will kick in. No?
> 
> No, an ICE role conflict is not the same thing as a glare condition in 
> SDP offer-answer.

On Wed, 27 Jul 2011, Rob Manson wrote:
> 
> This is definitely not intended as criticism of any of the work going 
> on.  It's intended as constructive feedback that hopefully provides 
> clarification on a key use case and it's supporting requirements.
> 
>         "Access to live/raw audio and video stream data from both local
>         and remote sources in a consistent way"
> 
> I've spent quite a bit of time trying to follow a clear thread of 
> requirements/solutions that provide API access to raw stream data (e.g. 
> audio, video, etc.).  But I'm a bit concerned this is falling in the gap 
> between the DAP and RTC WGs.  If this is not the case then please point 
> me to the relevant docs and I'll happily get back in my box 8)
> 
> Here's how the thread seems to flow at the moment based on public 
> documents.
> 
> On the DAP page [1] the mission states:
>         "the Device APIs and Policy Working Group is to create
>         client-side APIs that enable the development of Web Applications
>         and Web Widgets that interact with devices services such as
>         Calendar, Contacts, Camera, etc"
> 
> So it seems clear that this is the place to start.  Further down that 
> page the "HTML Media Capture" and "Media Capture" APIs are listed.
> 
> HTML Media Capture (camera/microphone interactions through HTML forms) 
> initially seems like a good candidate, however the intro in the latest 
> PWD [2] clearly states:
>         "Providing streaming access to these capabilities is outside of
>         the scope of this specification."
> 
> Followed by a NOTE that states:
>         "The Working Group is investigating the opportunity to specify
>         streaming access via the proposed <device> element."
>         The link on the "proposed <device> element" [3] links to a "no 
> longer maintained" document that then redirects to the top level of the 
> whatwg "current work" page [4].  On that page the most relevant link is 
> the video conferencing and peer-to-peer communication section [5].  
> More about that further below.
> 
> So back to the DAP page to follow explore the other Media Capture API 
> (programmatic access to camera/microphone) [1] and it's latest PWD [6].
>
> The abstract states:
>
>         "This specification defines an Application Programming Interface
>         (API) that provides access to the audio, image and video capture
>         capabilities of the device."
> 
> And the introduction states:
>
>         "The Capture API defines a high-level interface for accessing
>         the microphone and camera of a hosting device. It completes the
>         HTML Form Based Media Capturing specification [HTMLMEDIACAPTURE]
>         with a programmatic access to start a parametrized capture
>         process."
>
> So it seems clear that this is not related to streams in any way either.
> 
> The Notes column for this API on the DAP page [1] also states:
>         "Programmatic API that completes the form based approach
>         Need to check if still interest in this
>         How does it relate with the Web RTC Working Group?"
> 
> Is there an updated position on this?
> 
> So if you then head over to the WebRTC WG's charter [7] it states:
>         "...to define client-side APIs to enable Real-Time
>         Communications in Web browsers.
>         
>         These APIs should enable building applications that can be run
>         inside a browser, requiring no extra downloads or plugins, that
>         allow communication between parties using audio, video and
>         supplementary real-time communication, without having to use
>         intervening servers..."
>         So this is clearly focused upon peer-to-peer communication 
> "between" systems and the stream related access is naturally just 
> treated as an ancillary requirement.  The scope section then states:
>         "Enabling real-time communications between Web browsers require
>         the following client-side technologies to be available:
>         
>         - API functions to explore device capabilities, e.g. camera,
>         microphone, speakers (currently in scope for the Device APIs &
>         Policy Working Group)
>         - API functions to capture media from local devices (camera and
>         microphone) (currently in scope for the Device APIs & Policy
>         Working Group)
>         - API functions for encoding and other processing of those media
>         streams,
>         - API functions for establishing direct peer-to-peer
>         connections, including firewall/NAT traversal
>         - API functions for decoding and processing (including echo
>         cancelling, stream synchronization and a number of other
>         functions) of those streams at the incoming end,
>         - Delivery to the user of those media streams via local screens
>         and audio output devices (partially covered with HTML5)"
>         
> So this is where I really start to feel the gap growing.  The DAP is
> pointing to RTC saying not sure how if our Camera/Microphone APIs are
> being superseded by the work in the RTC...and the RTC then points back
> to say it will be relying on work in the DAP.  However the RTCs
> Recommended Track Deliverables list does include:
>         "Media Stream Functions, Audio Stream Functions and Video Stream
>         Functions"
> 
> So then it's back to the whatwg MediaStream and LocalMediaStream current
> work [8].  Following this through you end up back at the <audio> and
> <video> media element with some brief discussion about media data [9].
> 
> Currently the only API that I'm aware of that allows live access to the
> audio data through the <audio> tag is the relatively proprietary Mozilla
> Audio Data API [10].
> 
> And while the video stream data can be accessed by rendering each frame
> into a canvas 2d graphics context and then using getImageData to extract
> and manipulate it from there [11], this seems more like a work around
> than an elegantly designed solution.
>  
> As I said above, this is not intended as a criticism of the work that
> the DAP WG, WebRTC WG or WHATWG are doing.  It's intended as
> constructive feedback to highlight that the important use case of
> "Access to live/raw audio and video stream data from both local and
> remote sources" appears to be falling in the gaps between the groups. 
> 
> From my perspective this is a critical use case for many advanced web
> apps that will help bring them in line with what's possible in the
> native single vendor stack based apps at the moment (e.g. iPhone &
> Android).  And it's also critical for the advancement of web standards
> based AR applications and other computer vision, hearing and signal
> processing applications.
> 
> I understand that a lot of these specifications I've covered are in very
> formative stages and that requirements and PWDs are just being drafted
> as I write.  And that's exactly why I'm raising this as a single and
> consolidated perspective that spans all these groups.  I hope this goes
> some way towards "Access to live/raw audio and video stream data from
> both local and remote sources" being treated as an essential and core
> use case that binds together the work of all these groups.  With a clear
> vision for this and a little consolidated work I think this will then
> also open up a wide range of other app opportunities that we haven't
> even thought of yet.  But at the moment it really feels like this is
> being treated as an assumed requirement and could end up as a poorly
> formed second class bundle of semi-related API hooks.
> 
> For this use case I'd really like these clear requirements to be
> supported:
> - access the raw stream data for both audio and video in similar ways
> - access the raw stream data from both remote and local streams in
> similar ways
> - ability to inject new data or the transformed original data back into
> streams and presented audio/video tags in a consistent way
> - all of this be optimised for performance to meet the demands of live
> signal processing
> 
> PS: I've also cc'd in the mozilla dev list as I think this directly
> relates to the current "booting to the web" thread [12]
> 
> [1] http://www.w3.org/2009/dap/
> [2] http://www.w3.org/TR/2011/WD-html-media-capture-20110414/#introduction
> [3] http://dev.w3.org/html5/html-device/ 
> [4] http://www.whatwg.org/specs/web-apps/current-work/complete/#devices 
> [5] http://www.whatwg.org/specs/web-apps/current-work/complete/#auto-toc-9
> [6] http://www.w3.org/TR/2010/WD-media-capture-api-20100928/
> [7] http://www.w3.org/2011/04/webrtc-charter.html
> [8] http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html#mediastream 
> [9] http://www.whatwg.org/specs/web-apps/current-work/complete/the-iframe-element.html#media-data
> [10] https://wiki.mozilla.org/Audio_Data_API
> [11] https://developer.mozilla.org/En/Manipulating_video_using_canvas
> [12] http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/7668a9d46a43e482# 

On Fri, 12 Aug 2011, Darin Fisher wrote:
>
> Putting implementation details aside, I agree that it is a bit 
> unfortunate to refer to a stream as a blob.  So far, blobs have always 
> referred to static, fixed-size things.
> 
> This function was originally named createBlobURL, but it was renamed 
> createObjectURL precisely because we imagined it being useful to pass 
> things that were not blobs to it.  It seems reasonable that passing a 
> Foo object to createObjectURL might mint a different URL type than what 
> we would mint for a Bar object.
> 
> It could also be the case that using blob: for referring to Blobs was 
> unfortunate.  Maybe we do not really need separate URL schemes for 
> static, fixed size things and streams.

On Mon, 15 Aug 2011, Harald Alvestrand wrote:
>
> Back in ancient history (late 90s, I think), when I wrote the first 
> version of stuff that eventually merged into RFC 4395, "New URI 
> schemes", I thought the set of operations an URI supported was pretty 
> important.
> 
> In fact the text of RFC 4395 says:
> 
> 2.4.  Definition of Operations
> 
>    As part of the definition of how a URI identifies a resource, a URI
>    scheme definition SHOULD define the applicable set of operations that
>    may be performed on a resource using the URI as its identifier.  A
>    model for this is HTTP; an HTTP resource can be operated on by GET,
>    POST, PUT, and a number of other operations available through the
>    HTTP protocol.  The URI scheme definition should describe all
>    well-defined operations on the URI identifier, and what they are
>    supposed to do.
> 
>    Some URI schemes don't fit into the "information access" paradigm of
>    URIs.  For example, "telnet" provides location information for
>    initiating a bi-directional data stream to a remote host; the only
>    operation defined is to initiate the connection.  In any case, the
>    operations appropriate for a URI scheme should be documented.
> 
>    Note: It is perfectly valid to say that "no operation apart from GET
>    is defined for this URI".  It is also valid to say that "there's only
>    one operation defined for this URI, and it's not very GET-like".  The
>    important point is that what is defined on this scheme is described.
> 
> So if that consideration is still of concern, the next question is of 
> course "are there operations that make sense for a stream that don't 
> make sense for (current uses of) blob:, or vice versa"?
> 
> If "blob:" was intended to mean "reference to internal object, hand it 
> to APIs, the APIs will tell you if they don't like them", that 
> consideration may not be that important.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 2 December 2011 16:00:50 UTC