Re: [Bug 23220] Add 'zoom' constraint to VideoStreamTrack from Rob Manson on 2013-10-02 (public-media-capture@w3.org from October 2013)

From: Rob Manson <roBman@mob-labs.com>
Date: Wed, 02 Oct 2013 15:48:23 +1000
To: public-media-capture@w3.org
Message-ID: <524BB3A7.2090109@mob-labs.com>
Hi Martin,

thanks for the feedback.

> I don't think that this is the right way to present this information.

Fair enough.  Is it just the representation/diagram or the whole concept?

Personally I think it would be hard to defend an argument that this 
overall growth of binary data streams isn't happening and that it's not 
changing the overall needs imposed on browser architectures.


> This isn't just because of the problems that Harald found with the
> Stream <=> MediaStream analogy.

Harald had some good points about the names/language.  But I didn't get 
the feeling he was questioning the overall discussion.  If I misread 
this Harald, please call out?


> The fundamental problem is that you are mixing primitives (MediaStream/Track)
> with sources and sinks (like gUM, RTCPeerConnection, processing nodes, audio
> and video tags). At least for media, there should be primitives on either side
> of the processing and RTC nodes (RTCPeerConnection, web audio). And a
> similar set of nodes would isolate byte streams from media streams,
> the MediaRecorder being one example of this.

Well, first of all let me say that this is an abstraction...and that all 
abstractions are lies 8)  And of course I've drawn this from a specific 
web developers perspective to highlight a specific communication 
objective...which I think is creating a little impedence mismatch 
between us.

But let me walk through your feedback example as I'm not sure I agree 
with your statement and perhaps I'm misunderstanding something.  If you 
can help me understand that this would be great.

You said I'm mixing primitives with sources and sinks.  But if we just 
walk through the very top-most flow in the diagram I see this.

On the left we have a "camera" or "screen" (I agree this is a "source").

And then next on the right we have the gUM API which allows us to access 
these "cameras" and "screens".

And then gUM passes me a MediaStream object in the success callback.  In 
this case it's a localStream.  And in the PeerConnection example below 
that it may be a remoteStream.  But either way it's a MediaStream[2].

Next on the right we have the processing pipelines we've been 
documenting and experimenting with[1].  For the Video/Canvas example we 
connect the MediaStream to a HTMLVideoElement .src (in your mental model 
this is the actual "sink") and then dump that onto a canvas using 
.drawImage(video,...).  And then we extract that frame as an Image Data 
object (which is really just a wrapper for a Typed Array in .data) using 
.getImageData().

And underneath that is also an Array Buffer which we can use to minimise 
copying.

And then this pipeline may choose to display this content out to the 
user in either this Canvas (as another "sink"?).  Or into any other 
context (e.g. we often use that data to render WebGL overlays)...or may 
even just send the extracted features etc. over the network (would this 
be a "sink"?).

But in the end, for me it's the "Display" that is the true "sink"...just 
like the "camera" or "screen" is the real "source" and not the gUM API 
or the MediaStream itself.


First, do you see that any of this is incorrect?

And second, can you suggest some way I could communicate this type of 
relationship more clearly?


I admit that if you keep focused on the source/sink distinction[2] when 
looking at this diagram then on the right hand side you could feel it 
was not cleanly abstracted.  But just ending at an <img>, <video> or 
<audio> tag is not really the end so for me it's not really the final 
"sink" (as described above).

But from the other web developers and browser implementors I've 
discussed this with it seems this really clearly describes how I can 
flow data from a camera/screen (etc). through various required steps 
right through until I render the results on a "Display".  And we can 
then profile performance and bottlenecks within each of the elements on 
the diagram.  And this is all I have really been trying to 
capture/communicate as this is what we are really wrestling with.


BTW: If you think the examples I have been presenting are pushing the 
use of binary streams of data and image processing too far...then you 
should have seen the paper I just watched at ISMAR13 where the front 
camera was used to track the user's gaze while the back camera was used 
to track the real world scene.  This just doubled the size of these data 
streams and the compute resources required...but delivers a massive 
benefit in terms of User Experience for AR.  And this is just scratching 
the surface.

roBman

[1] 
https://github.com/buildar/getting_started_with_webrtc/#image_processing_pipelinehtml
[2] 
http://www.w3.org/TR/mediacapture-streams/#idl-def-NavigatorUserMediaSuccessCallback
[3] 
http://www.w3.org/TR/mediacapture-streams/#the-model-sources-sinks-constraints-and-states
Received on Wednesday, 2 October 2013 05:48:52 UTC