[whatwg] Access to live/raw audio and video stream data from both local and remote sources from Rob Manson on 2011-07-27 (public-whatwg-archive@w3.org from July 2011)

From: Rob Manson <roBman@mob-labs.com>
Date: Wed, 27 Jul 2011 10:56:11 +1000
Message-ID: <1311728171.2937.90113.camel@robslapu>
Hi,

sorry for posting across multiple groups, but I hope you'll see from my
comments below that this is really needed. 

This is definitely not intended as criticism of any of the work going
on.  It's intended as constructive feedback that hopefully provides
clarification on a key use case and it's supporting requirements.

        "Access to live/raw audio and video stream data from both local
        and remote sources in a consistent way"

I've spent quite a bit of time trying to follow a clear thread of
requirements/solutions that provide API access to raw stream data (e.g.
audio, video, etc.).  But I'm a bit concerned this is falling in the gap
between the DAP and RTC WGs.  If this is not the case then please point
me to the relevant docs and I'll happily get back in my box 8)

Here's how the thread seems to flow at the moment based on public
documents.

On the DAP page [1] the mission states:
        "the Device APIs and Policy Working Group is to create
        client-side APIs that enable the development of Web Applications
        and Web Widgets that interact with devices services such as
        Calendar, Contacts, Camera, etc"

So it seems clear that this is the place to start.  Further down that
page the "HTML Media Capture" and "Media Capture" APIs are listed.

HTML Media Capture (camera/microphone interactions through HTML forms)
initially seems like a good candidate, however the intro in the latest
PWD [2] clearly states:
        "Providing streaming access to these capabilities is outside of
        the scope of this specification."

Followed by a NOTE that states:
        "The Working Group is investigating the opportunity to specify
        streaming access via the proposed <device> element."
        
The link on the "proposed <device> element" [3] links to a "no longer
maintained" document that then redirects to the top level of the whatwg
"current work" page [4].  On that page the most relevant link is the
video conferencing and peer-to-peer communication section [5].  More
about that further below.

So back to the DAP page to follow explore the other Media Capture API
(programmatic access to camera/microphone) [1] and it's latest PWD [6].
The abstract states:
        "This specification defines an Application Programming Interface
        (API) that provides access to the audio, image and video capture
        capabilities of the device."

And the introduction states:
        "The Capture API defines a high-level interface for accessing
        the microphone and camera of a hosting device. It completes the
        HTML Form Based Media Capturing specification [HTMLMEDIACAPTURE]
        with a programmatic access to start a parametrized capture
        process."
        
So it seems clear that this is not related to streams in any way either.

The Notes column for this API on the DAP page [1] also states:
        "Programmatic API that completes the form based approach
        Need to check if still interest in this
        How does it relate with the Web RTC Working Group?"

Is there an updated position on this?

So if you then head over to the WebRTC WG's charter [7] it states:
        "...to define client-side APIs to enable Real-Time
        Communications in Web browsers.
        
        These APIs should enable building applications that can be run
        inside a browser, requiring no extra downloads or plugins, that
        allow communication between parties using audio, video and
        supplementary real-time communication, without having to use
        intervening servers..."
        
So this is clearly focused upon peer-to-peer communication "between"
systems and the stream related access is naturally just treated as an
ancillary requirement.  The scope section then states:
        "Enabling real-time communications between Web browsers require
        the following client-side technologies to be available:
        
        - API functions to explore device capabilities, e.g. camera,
        microphone, speakers (currently in scope for the Device APIs &
        Policy Working Group)
        - API functions to capture media from local devices (camera and
        microphone) (currently in scope for the Device APIs & Policy
        Working Group)
        - API functions for encoding and other processing of those media
        streams,
        - API functions for establishing direct peer-to-peer
        connections, including firewall/NAT traversal
        - API functions for decoding and processing (including echo
        cancelling, stream synchronization and a number of other
        functions) of those streams at the incoming end,
        - Delivery to the user of those media streams via local screens
        and audio output devices (partially covered with HTML5)"
        
So this is where I really start to feel the gap growing.  The DAP is
pointing to RTC saying not sure how if our Camera/Microphone APIs are
being superseded by the work in the RTC...and the RTC then points back
to say it will be relying on work in the DAP.  However the RTCs
Recommended Track Deliverables list does include:
        "Media Stream Functions, Audio Stream Functions and Video Stream
        Functions"

So then it's back to the whatwg MediaStream and LocalMediaStream current
work [8].  Following this through you end up back at the <audio> and
<video> media element with some brief discussion about media data [9].

Currently the only API that I'm aware of that allows live access to the
audio data through the <audio> tag is the relatively proprietary Mozilla
Audio Data API [10].

And while the video stream data can be accessed by rendering each frame
into a canvas 2d graphics context and then using getImageData to extract
and manipulate it from there [11], this seems more like a work around
than an elegantly designed solution.
 
As I said above, this is not intended as a criticism of the work that
the DAP WG, WebRTC WG or WHATWG are doing.  It's intended as
constructive feedback to highlight that the important use case of
"Access to live/raw audio and video stream data from both local and
remote sources" appears to be falling in the gaps between the groups. 

>From my perspective this is a critical use case for many advanced web
apps that will help bring them in line with what's possible in the
native single vendor stack based apps at the moment (e.g. iPhone &
Android).  And it's also critical for the advancement of web standards
based AR applications and other computer vision, hearing and signal
processing applications.

I understand that a lot of these specifications I've covered are in very
formative stages and that requirements and PWDs are just being drafted
as I write.  And that's exactly why I'm raising this as a single and
consolidated perspective that spans all these groups.  I hope this goes
some way towards "Access to live/raw audio and video stream data from
both local and remote sources" being treated as an essential and core
use case that binds together the work of all these groups.  With a clear
vision for this and a little consolidated work I think this will then
also open up a wide range of other app opportunities that we haven't
even thought of yet.  But at the moment it really feels like this is
being treated as an assumed requirement and could end up as a poorly
formed second class bundle of semi-related API hooks.

For this use case I'd really like these clear requirements to be
supported:
- access the raw stream data for both audio and video in similar ways
- access the raw stream data from both remote and local streams in
similar ways
- ability to inject new data or the transformed original data back into
streams and presented audio/video tags in a consistent way
- all of this be optimised for performance to meet the demands of live
signal processing

roBman

PS: I've also cc'd in the mozilla dev list as I think this directly
relates to the current "booting to the web" thread [12]


[1] http://www.w3.org/2009/dap/
[2] http://www.w3.org/TR/2011/WD-html-media-capture-20110414/#introduction
[3] http://dev.w3.org/html5/html-device/ 
[4] http://www.whatwg.org/specs/web-apps/current-work/complete/#devices 
[5] http://www.whatwg.org/specs/web-apps/current-work/complete/#auto-toc-9
[6] http://www.w3.org/TR/2010/WD-media-capture-api-20100928/
[7] http://www.w3.org/2011/04/webrtc-charter.html
[8] http://www.whatwg.org/specs/web-apps/current-work/complete/video-conferencing-and-peer-to-peer-communication.html#mediastream 
[9] http://www.whatwg.org/specs/web-apps/current-work/complete/the-iframe-element.html#media-data
[10] https://wiki.mozilla.org/Audio_Data_API
[11] https://developer.mozilla.org/En/Manipulating_video_using_canvas
[12] http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/7668a9d46a43e482#
Received on Tuesday, 26 July 2011 17:56:11 UTC