Feedback on the MediaStream Capture Post-Processing scenarios from Rob Manson on 2013-09-05 (public-media-capture@w3.org from September 2013)

From: Rob Manson <roBman@mob-labs.com>
Date: Fri, 06 Sep 2013 01:13:17 +1000
To: "public-media-capture@w3.org" <public-media-capture@w3.org>
CC: "public-ar@w3.org" <public-ar@w3.org>
Message-ID: <52289F8D.8090600@mob-labs.com>

Hi,

here's some feedback on, and questions about the MediaStream Capture
Scenarios[1] from an Augmented Web[2] perspective. So I guess this is
for Travis but as always all answers and comments are welcome 8)

3.3 Find the ball assignment (media processing and recording)
-------------------------------------------------------------
"Alice is finishing up a college on-line course on image
processing..."

I think it's definitely important to include image processing scenarios
in this document, however I don't think this scenario captures how
critical image processing will be for the Augmented Web. A more
pragmatic example that people might more closely relate to would be "QR
code scanning". So instead of "detecting a blue ball", it could be
"detecting a QR code". There are existing libraries that can be used
for this[4].

3.n
---
I would like to propose the addition of a number of other stream
processing based scenarios to flesh out this area further.
Here's a list;
- QR/barcode scanning
- pitch detection
- voice commands
- head/gesture tracking
- facial recognition
- fiducial marker tracking
- natural feature tracking

8.5 Pre-processing vs 8.6 Post-processing
-----------------------------------------
The pre/post distinction seems to be based on two types as described
here[5].

a. realtime
pre is before the stream is connected to a sink (e.g. <video>
element) and post is after.

b. recorded
pre is before the stream is captured "to a known MIME format" and
post is after.

However, I'm not sure this distinction has strictly been applied to the
content in those sections. Or am I misunderstanding this distinction?

e.g. 8.5.1 example 3 is "Face-recognition and gesture detection".
Surely face and gesture detection and face recognition could only be
done in post for realtime and both pre and post for recorded. Based on
the 6 item list in "8.6.1 Web platform post-processing toolbox" it's
hard to see how "face-recognition" could be done without connecting the
video stream to a sink <video> element. So for realtime (e.g. not
recorded) then this would really be post-processing wouldn't it? (e.g.
realtime after connected to a sink).

Perhaps the goals of using this distinction here could be met in a
simpler way?

Media Capture vs Recording
--------------------------
In 2. Concepts and Definitions "Media Capture" is defined as "obtaining
a stream of data from a device" and "Recording" is defined as "capture
of media under application control and in a specific, known, format".

It's a little confusing that the second part of this ("Recording") uses
the word "capture" which is also in the name of the first part ("Media
Capture").

Plus I'm not sure this distinction is completely clear either.

a. With the current image stream processing pipeline you connect a
stream to a <video> element then connect that to a <canvas> and then
extract the ImageData from there using an event loop like
requestAnimationFrame() or setTimeout().

b. With the Mediastream Image Capture API you extract a track from a
stream and then use that to create an Image Capture object that you call
getFrame() on to extract the ImageData using an event loop like
requestAnimationFrame() or setTimeout().

c. With the MediaStream Recording API you connect a stream through a
MediaRecorder object and call start() to extract a Blob of data at
regular timeslices.

But, for all 3 of these pipelines including the "Recording" example the
frame data can be accessed before the "capture" is completed. So even
"Recording" can also behave like "realtime" from a data processing
perspective.

8.6.2 Time sensitivity and performance
--------------------------------------

"Some post-processing scenarios are time-sensitive—especially those
scenarios that involve processing large amounts of data while the
user waits."

I think real-time applications are the most time sensitive. For example
face recognition or gesture tracking need to be fast and responsive with
little or no lag otherwise at best it can feel like the user interface
is swimming.

Numbering?
----------
I think that items 4, 5 and 6 should really be moved in one level so
they are 3.4, 3.5, 3.6 and all their children should move in as well.

I hope this feedback is clear and useful. I know it's a little long so
if you'd like me to break any of this out into separate email messages
just let me know.

roBman

[1]
https://dvcs.w3.org/hg/dap/raw-file/tip/media-stream-capture/scenarios.html
[2] http://www.w3.org/community/ar/
[3]
https://dvcs.w3.org/hg/dap/raw-file/tip/media-stream-capture/scenarios.html#find-the-ball-assignment-media-processing-and-recording
[4] https://github.com/LazarSoft/jsqrcode
[5]
https://dvcs.w3.org/hg/dap/raw-file/tip/media-stream-capture/scenarios.html#post-processing

Received on Thursday, 5 September 2013 15:13:40 UTC