dap commit: First Draft of Scenarios Document from Mercurial notifier on 2011-12-06 (public-dap-commits@w3.org from December 2011)

From: Mercurial notifier <cvsmail@w3.org>
Date: Tue, 06 Dec 2011 01:33:48 +0000
To: public-dap-commits@w3.org
Message-Id: <E1RXjum-0004E2-CD@mcbain.w3.org>
changeset:   36:d21e515ff4f5
tag:         tip
user:        tleithea
date:        Mon Dec 05 17:32:19 2011 -0800
files:       media-stream-capture/scenarios.html
description:
First Draft of Scenarios Document
(including a bunch of commentary and issues)


diff -r 5185030da020 -r d21e515ff4f5 media-stream-capture/scenarios.html
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/media-stream-capture/scenarios.html	Mon Dec 05 17:32:19 2011 -0800
@@ -0,0 +1,863 @@
+<!DOCTYPE html>
+<html>
+  <head>
+    <title>MediaStream Capture Scenarios</title>
+    <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>
+    <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/respec.js' class='remove'></script>
+    <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/sh_main.min.js' class='remove'></script>
+    <script type="text/javascript" class='remove'>
+            var respecConfig = {
+                  specStatus: "CG-NOTE",
+                  editors: [{
+                            name: "Travis Leithead", 
+                            company: "Microsoft Corp.",
+                            url: "mailto:travis.leithead@microsoft.com?subject=MediaStream Capture Scenarios Feedback",
+                            companyURL: "http://www.microsoft.com"}],
+                  previousPublishDate:  null,
+                  noIDLIn:  true,
+              };
+    </script>
+    <script type="text/javascript" src='http://dev.w3.org/2009/dap/common/config.js' class='remove'></script>
+    <style type="text/css">
+      /* ReSpec.js CSS optimizations (Richard Tibbett) - cut-n-paste :) */
+      div.example {
+          border-top: 1px solid #ff4500;
+          border-bottom: 1px solid #ff4500;
+          background: #fff;
+          padding:    1em;
+          font-size: 0.9em;
+          margin-top: 1em;
+      }
+      div.example::before {
+          content:    "Example";
+          display:    block;
+          width:      150px;
+          background: #ff4500;
+          color:  #fff;
+          font-family:    initial;
+          padding:    3px;
+          padding-left: 5px;
+          font-weight:    bold;
+          margin: -1em 0 1em -1em;
+      }
+
+      /* Clean up pre.idl */
+      pre.idl::before {
+          font-size:0.9em;
+      }
+
+      /* Add better spacing to sections */
+      section, .section {
+          margin-bottom: 2em;
+      }
+
+      /* Reduce note & issue render size */
+      .note, .issue {
+          font-size:0.8em;
+      }
+
+      /* Add addition spacing to <ol> and <ul> for rule definition */
+      ol.rule li, ul.rule li {
+          padding:0.2em;
+      }
+    </style>
+  </head>
+
+  <body>
+    <section id='abstract'>
+      <p>
+        This document collates the target scenarios for the Media Capture task force. Scenarios represent 
+        the set of expected functionality that may be achieved by the use of the MediaStream Capture API. A set of 
+        un-supported scenarios may also be documented here.
+      </p>
+      <p>This document builds on the assumption that the mechanism for obtaining fundamental access to local media
+        capture device(s) is <code>navigator.getUserMedia</code> (name/behavior subject to this task force), and that 
+        the vehicle for delivery of the content from the local media capture device(s) is a <code>MediaStream</code>. 
+        Hence the title of this note.
+      </p>
+    </section>
+
+    <section id='sotd'>
+      <p>
+        This document will eventually represent the consensus of the media capture task force on the set of scenarios 
+        supported by the MediaStream Capture API. If you wish to make comments regarding this document, please 
+        send them to <a href="mailto:public-media-capture@w3.org">public-media-capture@w3.org</a> (
+        <a href="mailto:public-media-capture-request@w3.org?subject=subscribe">subscribe</a>, 
+        <a href="http://lists.w3.org/Archives/Public/public-media-capture/">archives</a>).
+      </p>
+    </section>
+
+    <section class="informative">
+      <h2>Introduction</h2>
+      <p>
+        One of the goals of the joint task force between the Device and Policy working group and the Web Real Time 
+        Communications working groups is to bring media capture scenarios from both groups together into one unified 
+        API that can address all relevant use cases.
+      </p>
+      <p>
+        The capture scenarios from WebRTC are primarily driven from real-time-communication-based scenarios, such as 
+        the recording of live chats, teleconferences, and other media streamed from over the network from potentially 
+        multiple sources.
+      </p>
+      <p>
+        The capture scenarios from DAP are primarily driven from "local" capture scenarios related to providing access 
+        to a user agent's camera and related experiences.
+      </p>
+      <p>
+        Both groups include overlapping chartered deliverables in this space. Namely in DAP, 
+        <a href="http://www.w3.org/2009/05/DeviceAPICharter">the charter specifies a recommendation-track deliverable</a>:
+        <ul>
+         <li>
+          <dt>Camera API</dt>
+          <dd>an API to manage a device's camera e.g. to take a picture</dd>
+         </li>
+        </ul>
+      </p>
+      <p>
+         And <a href="http://www.w3.org/2011/04/webrtc-charter.html">WebRTC's charter scope</a> describes enabling 
+         real-time communications between web browsers that will require specific client-side technologies:
+         <ul>
+          <li>API functions to explore device capabilities, e.g. camera, microphone, speakers (currently in scope
+           for the <a href="http://www.w3.org/2009/dap/">Device APIs &amp; Policy Working Group</a>)</li>
+          <li>API functions to capture media from local devices (camera and microphone) (currently in scope for the 
+            <a href="http://www.w3.org/2009/dap/">Device APIs &amp; Policy Working Group</a>)</li>
+          <li>API functions for encoding and other processing of those media streams,</li>
+          <li>API functions for decoding and processing (including echo cancelling, stream synchronization and a
+            number of other functions) of those streams at the incoming end,</li>
+          <li>Delivery to the user of those media streams via local screens and audio output devices (partially 
+            covered with HTML5)</li>
+         </ul>
+      </p>
+      <p>
+        Note, that the scenarios described in this document specifically exclude peer-to-peer and networking scenarios 
+        that do not overlap with local capture scenarios, as these are not considered in-scope for this task force.
+      </p>
+      <p>
+         Also excluded are scenarios that involve declarative capture scenarios, such as those where media capture can be 
+         obtained and submitted to a server entirely without the use of script. Such scenarios generally involve the use 
+         of a UA-specific app or mode for interacting with the recording device, altering settings and completing the 
+         capture. Such scenarios are currently captured by the DAP working group's <a href="http://dev.w3.org/2009/dap/camera/">HTML Media Capture</a>
+         specification.
+      </p>
+      <p>
+         The scenarios contained in this document are specific to scenarios in which web applications require direct access
+         to the capture device, its settings, and the recording mechanism and output. Such scenarios have been deemed 
+         crucial to building applications that can create a site-specific look-and-feel to the user's interaction with the 
+         capture device, as well as utilize advanced functionality that may not be available in a declarative model.
+      </p>
+    </section>
+
+    <!-- Travis: No conformance section necessary?
+
+      <section id='conformance'>
+      <p>
+        This specification defines conformance criteria that apply to a single product: the 
+        <dfn id="ua">user agent</dfn> that implements the interfaces that it contains.
+      </p>
+      <p>
+        Implementations that use ECMAScript to implement the APIs defined in this specification must implement
+        them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification
+        [[!WEBIDL]], as this specification uses that specification and terminology.
+      </p>
+      <p>
+        A conforming implementation is required to implement all fields defined in this specification.
+      </p>
+
+      <section>
+        <h2>Terminology</h2>
+        <p>
+          The terms <dfn>document base URL</dfn>, <dfn>browsing context</dfn>, <dfn>event handler attribute</dfn>, 
+          <dfn>event handler event type</dfn>, <dfn>task</dfn>, <dfn>task source</dfn> and <dfn>task queues</dfn> 
+          are defined by the HTML5 specification [[!HTML5]].
+        </p>
+        <p>
+          The <a>task source</a> used by this specification is the <dfn>device task source</dfn>.
+        </p>
+        <p>
+          To <dfn>dispatch a <code>success</code> event</dfn> means that an event with the name
+          <code>success</code>, which does not bubble and is not cancellable, and which uses the
+          <code>Event</code> interface, is to be dispatched at the <a>ContactFindCB</a> object.
+        </p>
+        <p>
+          To <dfn>dispatch an <code>error</code> event</dfn> means that an event with the name
+          <code>error</code>, which does not bubble and is not cancellable, and which uses the <code>Event</code>
+          interface, is to be dispatched at the <a>ContactErrorCB</a> object.
+        </p>
+      </section>
+    </section>
+    -->
+    
+    <section>
+      <h2>Concepts and Definitions</h2>
+      <p>
+         This section describes some terminology and concepts that frame an understanding of the scenarios that 
+         follow. It is helpful to have a common understanding of some core concepts to ensure that the scenarios 
+         are interpreted uniformly.
+      </p>
+       <dl>
+         <dt>Stream</dt>
+         <dd>A stream including the implied derivative 
+           <code><a href="http://dev.w3.org/2011/webrtc/editor/webrtc.html#introduction">MediaStream</a></code>, 
+           can be conceptually understood as a tube or conduit between a source (the stream's generator) and a 
+           destination (the sink). Streams don't generally include any type of significant buffer, that is, content 
+           pushed into the stream from a source does not collect into any buffer for later collection. Rather, content 
+           is simply dropped on the floor if the stream is not connected to a sink. This document assumes the 
+           non-buffered view of streams as previously described.
+         </dd>
+         <dt><code>MediaStream</code> vs "media stream"</dt>
+         <dd>In some cases, I use these two terms interchangeably; my usage of the term "media stream" is intended as 
+           a generalization of the more specific <code>MediaStream</code> interface as currently defined in the 
+           WebRTC spec.</dd>
+         <dt><code>MediaStream</code> format</dt>
+         <dd>As stated in the WebRTC specification, the content flowing through a <code>MediaStream</code> is not in 
+            any particular underlying format:</dd>
+          <dd><blockquote>[The data from a <code>MediaStream</code> object does not necessarily have a canonical binary form; for 
+           example, it could just be "the video currently coming from the user's video camera". This allows user agents 
+           to manipulate media streams in whatever fashion is most suitable on the user's platform.]</blockquote></dd>
+          <dd>This document reinforces that view, especially when dealing with recording of the <code>MediaStream</code>'s content 
+          and the potential interaction with the <a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Streams API</a>.
+        </dd>
+        <dt>Virtualized device</dt>
+        <dd>Device virtualization (in my simplistic view) is the process of abstracting the settings for a device such 
+            that code interacts with the virtualized layer, rather than with the actual device itself. Audio devices are 
+            commonly virtualized. This allows many applications to use the audio device at the same time and apply 
+            different audio settings like volume independently of each other. It also allows audio to be interleaved on 
+            top of each other in the final output to the device. In some operating systems, such as Windows, a webcam's 
+            video source is not virtualized, meaning that only one application can have control over the device at any 
+            one time. In order for an app to use the webcam either another app already using the webcam must yield it up 
+            or the new app must "steal" the camera from the previous app. An API could be exposed from a device that 
+            changes the device configuration in such a way that prevents that device from being virtualized--for example,
+            if a "zoom" setting were applied to a webcam device. Changing the zoom level on the device itself would affect 
+            all potential virtualized versions of the device, and therefore defeat the virtualization.</dd>
+       </dl>
+      </p>
+    </section>
+    
+    <section>
+      <h2>Media Capture Scenarios</h2>
+
+      <section>
+        <h3>Stream initialization</h3>
+        <p>A web application must be able to initiate a request for access to the user's webcam(s) and/or microphone(s). 
+        Additionally, the web application should be able to "hint" at specific device characteristics that are desired by 
+        the particular usage scenario of the application. User consent is required before obtaining access to the requested 
+        stream.</p>
+        <p>When then media capture devices have been obtained (after user consent), the associated stream should be active 
+        and populated with the appropriate devices (likely in the form of tracks to re-use an existing 
+        <code>MediaStream</code> concept). The active capture devices will be configured according to user preference; the 
+        user may have an opportunity to configure the initial state of the devices, select specific devices, and/or elect 
+        to enable/disabled a subset of the requested devices at the point of consent or beyond—the user remains in control).
+        </p>
+        <section>
+         <h4>Privacy</h4>
+         <p>Specific information about a given webcam and/or microphone must not be available until after the user has 
+         granted consent. Otherwise "drive-by" fingerprinting of a UA's devices and characteristics can be obtained without 
+         the user's knowledge—a privacy issue.</p>
+        </section>
+
+        <p>The navigator.getUserMedia API fulfills these scenarios today.</p>
+
+        <section>
+         <h4>Issues</h4>
+         <ul>
+          <li>What are the privacy/fingerprinting implications of the current "error" callback? Is it sufficiently "scary" 
+              to warrant a change? Consider the following:
+              <ul>
+                 <li>If the user doesn’t have a webcam/mic, and the developer requests it, a UA would be expected to invoke 
+                     the error callback immediately.</li>
+                 <li>If the user does have a webcam/mic, and the developer requests it, a UA would be expected to prompt for 
+                     access. If the user denies access, then the error callback is invoked.</li>
+                 <li>Depending on the timing of the invocation of the error callback, scripts can still profile whether the 
+                     UA does or does not have a given device capability.</li>
+              </ul>
+          </li>
+          <li>In the case of a user with multiple video and/or audio capture devices, what specific permission is expected to 
+              be granted for the "video" and "audio" options presented to <code>getUserMedia</code>? For example, does "video"
+              permission mean that the user grants permission to any and all video capture devices? Similarly with "audio"? Is
+              it a specific device only, and if so, which one? Given the privacy point above, my recommendation is that "video"
+              permission represents permission to all possible video capture devices present on the user's device, therefore 
+              enabling switching scenarios (among video devices) to be possible without re-acquiring user consent. Same for 
+              "audio" and combinations of the two.
+          </li>
+          <li>When a user has only one of two requested device capabilities (for example only "audio" but not "video", and both 
+              "audio" and "video" are requested), should access be granted without the video or should the request fail?
+          </li>
+         </ul>
+        </section>
+      </section>
+
+      <section>
+       <h3>Stream re-initialization</h3>
+
+       <p>After requesting (and presumably gaining access to media capture devices) it is entirely possible for one or more of 
+       the requested devices to stop or fail (for example, if a video device is claimed by another application, or if the user 
+       unplugs a capture device or physically turns it off, or if the UA shuts down the device arbitrarily to conserve battery 
+       power). In such a scenario it should be reasonably simple for the application to be notified of the situation, and for 
+       the application to re-request access to the stream.
+       </p>
+       <p>Today, the <code>MediaStream</code> offers a single <code>ended</code> event. This could be sufficient for this 
+       scenario.
+       </p>
+       <p>Additional information might also be useful either in terms of <code>MediaStream</code> state such as an error object,
+       or additional events like an <code>error</code> event (or both).
+       </p>
+
+          <section>
+           <h4>Issues</h4>
+           <ul>
+            <li>How shall the stream be re-acquired efficiently? Is it merely a matter of re-requesting the entire 
+                <code>MediaStream</code>, or can an "ended" mediastream be quickly revived? Reviving a local media stream makes 
+                more sense in the context of the stream representing a set of device states, than it does when the stream 
+                represents a network source.
+            </li>
+            <li>What's the expected interaction model with regard to user-consent? For example, if the re-initialization 
+                request is for the same device(s), will the user be prompted for consent again?
+            </li>
+            <li>How can tug-of-war scenarios be avoided between two web applications both attempting to gain access to a 
+                non-virtualized device at the same time?
+            </li>
+           </ul>
+          </section>
+     </section>
+
+     <section>
+      <h3>Preview a stream</h3>
+      <p>The application should be able to connect a media stream (representing active media capture device(s) to a sink 
+       in order to "see" the content flowing through the stream. In nearly all digital capture scenarios, "previewing" 
+       the stream before initiating the capture is essential to the user in order to "compose" the shot (for example, 
+       digital cameras have a preview screen before a picture or video is captured; even in non-digital photography, the 
+       viewfinder acts as the "preview"). This is particularly important for visual media, but also for non-visual media 
+       like audio.
+      </p>
+      <p>Note that media streams connected to a preview output sink are not in a "recording" state as the media stream has 
+       no default buffer (see the <a>Stream</a> definition in section 2). Content conceptually "within" the media stream 
+       is streaming from the capture source device to the preview sink after which point the content is dropped (not 
+       saved).
+      </p>
+      <p>The application should be able to affect changes to the media capture device(s) settings via the media stream 
+       and view those changes happen in the preview.
+      </p>
+      <p>Today, the <code>MediaStream</code> object can be connected to several "preview" sinks in HTML5, including the 
+       <code>video</code> and <code>audio</code> elements. (This support should also extend to the <code>source</code> 
+       elements of each as well.) The connection is accomplished via <code>URL.createObjectURL</code>.
+      </p>
+      <p>These concepts are fully supported by the current WebRTC specification.</p>
+      <section>
+        <h4>Issues</h4>
+        <ul>
+         <li>Audio tag preview is somewhat problematic because of the acoustic feedback problem (interference that can 
+          result from a loop between a microphone input that picks up the output from a nearby speaker). There are 
+          software solutions that attempt to automatically compensate for these type of feedback problems. However, it 
+          may not be appropriate to require implementations to all support such an acoustic feedback prevention 
+          algorithm. Therefore, audio preview could be turned off by default and only enabled by specific opt-in. 
+          Implementations without acoustic feedback prevention could fail to enable the opt-in?
+         </li>
+         <li>It makes a lot of sense for a 1:1 association between the source and sink of a media stream; for example, 
+          one media stream to one video element in HTML5. It is less clear what the value might be of supporting 1:many 
+          media stream sinks—for example, it could be a significant performance load on the system to preview a media 
+          stream in multiple video elements at once. Implementation feedback here would be valuable. It would also be 
+          important to understand that scenario that required a 1:many viewing of a single media stream.
+         </li>
+         <li>Are there any use cases for stopping or re-starting the preview (exclusively) that are sufficiently different 
+          from the following scenarios?
+          <ul>
+           <li>Stopping/re-starting the device(s)—at the source of the media stream.</li>
+           <li>Assigning/clearing the URL from media stream sinks.</li>
+           <li>createObjectURL/revokeObjectURL – for controlling the [subsequent] connections to the media stream sink 
+            via a URL.
+           </li>
+          </ul>
+         </li>
+        </ul>
+      </section>
+     </section>
+
+     <section>
+      <h3>Stopping local devices</h3>
+      <p>End-users need to feel in control of their devices. Likewise, it is expected that developers using a media stream 
+       capture API will want to provide a mechanism for users to stop their in-use device(s) via the software (rather than 
+       using hardware on/off buttons which may not always be available).
+      </p>
+      <p>Stopping or ending a media stream source device(s) in this context implies that the media stream source device(s)
+       cannot be re-started. This is a distinct scenario from simply "muting" the video/audio tracks of a given media stream.
+      </p>
+      <p>The current WebRTC draft describes a <code>stop</code> API on a <code>LocalMediaStream</code> interface, whose 
+       purpose is to stop the media stream at its source.
+      </p>
+      <section>
+       <h4>Issues</h4>
+       <ul>
+        <li>Is there a scenario where end-users will want to stop just a single device, rather than all devices participating 
+         in the current media stream?
+        </li>
+       </ul>
+      </section>
+     </section>
+
+     <section>
+      <h3>Pre-processing</h3>
+      <p>Pre-processing scenarios are a bucket of scenarios that perform processing on the "raw" or "internal" characteristics 
+       of the media stream for the purpose of reporting information that would otherwise require processing of a known 
+       format (i.e., at the media stream sink—like Canvas, or via recording and post-processing), significant 
+       computationally-expensive scripting, etc.
+      </p>
+      <p>Pre-processing scenarios will require the UAs to provide an implementation (which may be non-trivial). This is 
+       required because the media stream has no internal format upon which a script-based implementation could be derived
+       (and I believe advocating for the specification of such a format is unwise).
+      </p>
+      <p>Pre-processing scenarios provide information that is generally needed <i>before</i> a stream need be connected to a 
+       sink or recorded.
+      </p>
+      <p>Pre-processing scenarios apply to both real-time-communication and local capture scenarios. Therefore, the 
+       specification of various pre-processing requirements may likely fall outside the scope of this task force. However, 
+       they are included here for scenario-completeness and to help ensure that a media capture API design takes them into 
+       account.
+      </p>
+      <section>
+       <h4>Examples</h4>
+       <ul>
+        <li>Audio end-pointing. As described in <a href="http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html">a 
+         speech API proposal</a>, audio end-pointing allows for the detection of noise, speech, or silence and raises events 
+         when these audio states change. End-pointing is necessary for scenarios that programmatically determine when to 
+         start and stop recording an audio stream for purposes of hands-free speech commands, dictation, and a variety of 
+         other speech and accessibility-related scenarios. The proposal linked above describes these scenarios in better 
+         detail. Audio end-pointing would be required as a pre-processing scenario because it is a prerequisite to 
+         starting/stopping a recorder of the media stream itself.
+        </li>
+        <li>Volume leveling/automatic gain control. The ability to automatically detect changes in audio loudness and adjust 
+         the input volume such that the output volume remains constant. These scenarios are useful in a variety of 
+         heterogeneous audio environments such as teleconferences, live broadcasting involving commercials, etc. 
+         Configuration options for volume/gain control of a media stream source device are also useful, and are explored 
+         later on.
+        </li>
+        <li>Video face-recognition and gesture detection. These scenarios are the visual analog to the previously described 
+         audio end-pointing scenarios. Face-recognition is useful in a variety of contexts from identifying faces in family 
+         photographs, to serving as part of an identity management system for system access. Likewise, gesture recognition 
+         can act as an input mechanism for a computer.
+        </li>
+       </ul>
+      </section>
+      <section>
+       <h4>Issues</h4>
+       <ul>
+        <li>In general the set of audio pre-processing scenarios is much more constrained than the set of possible visual 
+         pre-processing scenarios. Due to the large set of visual pre-processing scenarios (which could also be implemented 
+         by scenario-specific post-processing in most cases), we may recommended that visual-related pre-processing 
+         scenarios be excluded from the scope of our task force.
+        </li>
+        <li>The challenges of specifying pre-processing scenarios will be identifying what specific information should be 
+         conveyed by the platform at a level at which serves the widest variety of scenarios. For example, 
+         audio-end-pointing could be specified in high-level terms of firing events when specific words of a given language 
+         are identified, or could be as low-level as reporting when there is silence/background noise and when there's not. 
+         Not all scenarios will be able to be served by any API that is designed, therefore this group might choose to 
+         evaluate which scenarios (if any) are worth including in the first version of the API.
+        </li>
+       </ul>
+      </section>
+     </section>
+
+     <section>
+      <h3>Post-processing</h3>
+      <p>Post processing scenarios are a group of all scenarios that can be completed after either:</p>
+      <ol>
+       <li>Connecting the media stream to a sink (such as the <code>video</code> or <code>audio</code> elements</li>
+       <li>Recording the media stream to a known format (MIME type)</li>
+      </ol>
+      <p>Post-processing scenarios will continue to expand and grow as the web platform matures and gains capabilities. 
+       The key to understanding the available post-processing scenarios is to understand the other facets of the web 
+       platform that are available for use.
+      </p>
+      <section>
+       <h4>Web platform post-processing toolbox</h4>
+       <p>The common post-processing capabilities for media stream scenarios are built on a relatively small set of web 
+        platform capabilities:
+       </p>
+       <ul>
+        <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-video-element"><code>video</code></a> and 
+         <a href="http://dev.w3.org/html5/spec/Overview.html#the-audio-element"><code>audio</code></a> tags. These elements are natural 
+         candidates for media stream output sinks. Additionally, they provide an API (see 
+         <a href="http://dev.w3.org/html5/spec/Overview.html#htmlmediaelement">HTMLMediaElement</a>) for interacting with 
+         the source content. Note: in some cases, these elements are not well-specified for stream-type sources—this task 
+         force may need to drive some stream-source requirements into HTML5.
+        </li>
+        <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-canvas-element"><code>canvas</code></a> element 
+         and the <a href="http://dev.w3.org/html5/2dcontext/">Canvas 2D context</a>. The <code>canvas</code> element employs 
+         a fairly extensive 2D drawing API and will soon be extended with audio capabilities as well (<b>RichT, can you 
+         provide a link?</b>). Canvas' drawing API allows for drawing frames from a <code>video</code> element, which is 
+         the link between the media capture sink and the effects made possible via Canvas.
+        </li>
+        <li><a href="http://dev.w3.org/2006/webapi/FileAPI/">File API</a> and 
+         <a href="http://www.w3.org/TR/file-writer-api/">File API Writer</a>. The File API provides various methods for 
+         reading and writing to binary formats. The fundamental container for these binary files is the <code>Blob</code> 
+         which put simply is a read-only structure with a MIME type and a length. The File API integrates with many other 
+         web APIs such that the <code>Blob</code> can be used uniformly across the entire web platform. For example, 
+         <code>XMLHttpRequest</code>, form submission in HTML, message passing between documents and web workers 
+         (<code>postMessage</code>), and Indexed DB all support <code>Blob</code> use.
+        </li>
+        <li><a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Stream API</a>. A new addition to 
+         the WebApps WG, the <code>Stream</code> is another general-purpose binary container. The primary differences 
+         between a <code>Stream</code> and a <code>Blob</code> is that the <code>Stream</code> is read-once, and has no 
+         length. The Stream API includes a mechanism to buffer from a <code>Stream</code> into a <code>Blob</code>, and 
+         thus all <code>Stream</code> scenarios are a super-set of <code>Blob</code> scenarios.
+        </li>
+        <li>JavaScript <a href="http://wiki.ecmascript.org/doku.php?id=strawman:typed_arrays">TypedArrays</a>. Especially 
+         useful for post-processing scenarios, TypedArrays allow JavaScript code to crack-open a binary file 
+         (<code>Blob</code>) and read/write its contents using the numerical data types already provided by JavaScript. 
+         There's a cool explanation and example of TypedArrays 
+         <a href="http://blogs.msdn.com/b/ie/archive/2011/12/01/working-with-binary-data-using-typed-arrays.aspx">here</a>.
+        </li>
+       </ul>
+       </section>
+       <p>Of course, post-processing scenarios made possible after sending a media stream or recorded media stream to a 
+        server are unlimited.
+       </p>
+       <section>
+        <h4>Time sensitivity and performance</h4>
+        <p>Some post-processing scenarios are time-sensitive—especially those scenarios that involve processing large 
+         amounts of data while the user waits. Other post-processing scenarios s are long-running and can have a performance 
+         benefit if started before the end of the media stream segment is known. For example, a low-pass filter on a video.
+        </p>
+        <p>These scenarios generally take two approaches:</p>
+        <ol>
+         <li>Extract samples (video frames/audio clips) from a media stream sink and process each sample. Note that this 
+          approach is vulnerable to sample loss (gaps between samples) if post-processing is too slow.
+         </li>
+         <li>Record the media stream and extract samples from the recorded native format. Note that this approach requires 
+          significant understanding of the recorded native format.
+         </li>
+        </ol>
+        <p>Both approaches are valid for different types of scenarios.</p>
+        <p>The first approach is the technique described in the current WebRTC specification for the "take a picture" 
+         scenario.
+        </p>
+        <p>The second approach is somewhat problematic from a time-sensitivity/performance perspective given that the 
+         recorded content is only provided via a <code>Blob</code> today. A more natural fit for post-processing scenarios 
+         that are time-or-performance sensitive is to supply a <code>Stream</code> as output from a recorder. 
+         Thus time-or-performance sensitive post-processing applications can immediately start processing the [unfinished] 
+         recording, and non-sensitive applications can use the Stream API's <code>StreamReader</code> to eventually pack 
+         the full <code>Stream</code> into a <code>Blob</code>.
+        </p>
+       </section>
+       <section>
+        <h4>Examples</h4>
+        <ul>
+         <li>Image quality manipulation. If you copy the image data to a canvas element you can then get a data URI or 
+          blob where you can specify the desired encoding and quality e.g.
+          <pre class="sh_javascript">
+canvas.toDataURL('image/jpeg', 0.6);
+// or
+canvas.toBlob(function(blob) {}, 'image/jpeg', 0.2);</pre> 
+         </li>
+         <li>Image rotation. If you copy the image data to a canvas element and then obtain its 2D context you can then 
+          call rotate() on that context object to rotate the displayed 'image'. You can then obtain the manipulated image 
+          back via toDataURL or toBlob as above if you want to generate a file-like object that you can then pass around as 
+          required.
+         </li>
+         <li>Image scaling. Thumbnails or web image formatting can be done by scaling down the captured image to a common 
+          width/height and reduce the output quality.
+         </li>
+         <li>Speech-to-text. Post processing on a recorded audio format can be done to perform client-side speech 
+          recognition and conversion to text. Note, that speech recognition algorithms are generally done on the server for 
+          time-sensitive or performance reasons.
+         </li>
+        </ul>
+       </section>
+       <p>This task force should evaluate whether some extremely common post-processing scenarios should be included as 
+        pre-processing features.
+       </p>
+     </section>
+
+     <section>
+      <h3>Device Selection</h3>
+      <p>A particular user agent may have zero or more devices that provide the capability of audio or video capture. In 
+       consumer scenarios, this is typically a webcam with a microphone (which may or may not be combined), and a "line-in"
+       and or microphone audio jack. The enthusiast users (e.g., recording enthusiasts), may have many more available 
+       devices.
+      </p>
+      <p>Device selection in this section is not about the selection of audio vs. video capabilities, but about selection 
+       of multiple devices within a given "audio" or "video" category (i.e., "kind"). The term "device" and "available 
+       devices" used in this section refers to one or a collection of devices of a kind (e.g., that provide a common 
+       capability, such as a set of devices that all provide "video").
+      </p>
+      <p>Providing a mechanism for code to reliably enumerate the set of available devices enables programmatic control 
+       over device selection. Device selection is important in a number of scenarios. For example, the user selected the 
+       wrong camera (initially) and wants to change the media stream over to another camera. In another example, the 
+       developer wants to select the device with the highest resolution for recording.
+      </p>
+      <p>Depending on how stream initialization is managed in the consent user experience, device selection may or may not 
+       be a part of the UX. If not, then it becomes even more important to be able to change device selection after media 
+       stream initialization. The requirements of the user-consent experience will likely be out of scope for this task force.
+      </p>
+      <section>
+       <h4>Privacy</h4>
+       <ul>
+        <li>As mentioned in the "Stream initialization" section, exposing the set of available devices before media stream 
+         consent is given leads to privacy issues. Therefore, the device selection API should only be available after consent.
+        </li>
+        <li>Device selection should not be available for the set of devices within a given category/kind (e.g., "audio" 
+         devices) for which user consent was not granted.
+        </li>
+       </ul>
+      </section>
+      <p>A selected device should provide some state information that identifies itself as "selected" (so that the set of 
+       current device(s) in use can be programmatically determined). This is important because some relevant device information 
+       cannot be surfaced via an API, and correct device selection can only be made by selecting a device, connecting a sink, 
+       and providing the user a method for changing the device. For example, with multiple USB-attached webcams, there's no 
+       reliable mechanism to describe how each device is oriented (front/back/left/right) with respect to the user.
+      </p>
+      <p>Device selection should be a mechanism for exposing device capabilities which inform the developer of which device to 
+       select. In order for the developer to make an informed decision about which device to select, the developer's code would 
+       need to make some sort of comparison between devices—such a comparison should be done based on device capabilities rather 
+       than a guess, hint, or special identifier (see related issue below).
+      </p>
+      <p>Recording capabilities are an important decision-making point for media capture scenarios. However, recording capabilities 
+       are not directly correlated with individual devices, and as such should not be mixed with the device capabilities. For 
+       example, the capability of recording audio in AAC vs. MP3 is not correlated with a given audio device, and therefore not a 
+       decision making factor for device selection.
+      </p>
+      <p>The current WebRTC spec does not provide an API for discovering the available devices nor a mechanism for selection.
+      </p>
+      <section>
+       <h4>Issues</h4>
+       <ul>
+        <li>The specification should provide guidance on what set of devices are to be made available—should it be the set of 
+         potential devices, or the set of "currently available" devices (which I recommended since the non-available devices can't 
+         be utilized by the developer's code, thus it doesn't make much sense to include them).
+        </li>
+        <li>A device selection API should expose device capability rather than by device identity. Device identity is a poor practice 
+         because it leads to device-dependent testing code (for example, if "Name Brand Device", then…) similar to the problems that 
+         exist today on the web as a result of user-agent detection. A better model is to enable selection based on capabilities. 
+         Additionally, knowing the GUID or hardware name is not helpful to web developers as part of a scenario other than device 
+         identification (perhaps for purposes of providing device-specific help/troubleshooting, for example).
+        </li>
+       </ul>
+      </section>
+     </section>
+
+    <section>
+     <h3>Change user-selected device capabilities</h3>
+     <p>In addition to selecting a device based on its capabilities, individual media capture devices may support multiple modes of 
+      operation. For example, a webcam often supports a variety of resolutions which may be suitable for various scenarios (previewing 
+      or recording a sample whose destination is a web server over a slow network connection, recording archival HD video for storing 
+      locally). An audio device may have a gain control, allowing a developer to build a UI for an audio blender (varying the gain on 
+      multiple audio source devices until the desired blend is achieved).
+     </p>
+     <p>A media capture API should support a mechanism to configure a particular device dynamically to suite the expected scenario. 
+      Changes to the device should be reflected in the related media stream(s) themselves.
+     </p>
+     <p>Device capabilities that can be changed should be done in such a way that the changes are virtualized to the window that is 
+      consuming the API (see definition of "virtual device"). For example, if two applications are using a device, changes to the 
+      device's configuration in one window should not affect the other window.
+     </p>
+     <p>Changes to a device capability should be made in the form of requests (async operations rather than synchronous commands). 
+      Change requests allow a device time to make the necessary internal changes, which may take a relatively long time without 
+      blocking other script. Additionally, script code can be written to change device characteristics without careful error-detection 
+      (because devices without the ability to change the given characteristic would not need to throw an exception synchronously). 
+      Finally, a request model makes sense even in RTC scenarios, if one party of the teleconference, wants to issue a request that 
+      another party mute their device (for example). The device change request can be propagated over the <code>PeerConnection</code> 
+      to the sender asynchronously.
+     </p>
+     <p>In parallel, changes to a device's configuration should provide a notification when the change is made. This allows web 
+      developer code to monitor the status of a media stream's devices and report statistics and state information without polling the 
+      device (especially when the monitoring code is separate from the author's device-control code). This is also essential when the 
+      change requests are asynchronous; to allow the developer to know at which point the requested change has been made in the media 
+      stream (in order to perform synchronization, or start/stop a recording, for example).
+     </p>
+     <p>The current WebRTC spec only provides the "enabled" (on/off) capability for devices (where a device may be equated to a particular 
+      track object).
+     </p>
+     <section>
+      <h4>Issues</h4>
+      <ul>
+       <li>If changing a particular device capability cannot be virtualized, this media capture task force should consider whether that 
+        dynamic capability should be exposed to the web platform, and if so, what the usage policy around multiple access to that 
+        capability should be.
+       </li>
+       <li>The specifics of what happens to a recording-in-progress when device behavior is changed must be described in the spec.
+       </li>
+      </ul>
+     </section>
+    </section>
+
+    <section>
+     <h3>Multiple active devices</h3>
+     <p>In some scenarios, users may want to initiate capture from multiple devices at one time in multiple media streams. For example, 
+      in a home-security monitoring scenario, a user agent may want to capture 10 unique video streams representing various locations being 
+      monitored. The user may want to capture all 10 of these videos into one recording, or record all 10 individually (or some 
+      combination thereof).
+     </p>
+     <section>
+      <h4>Issues</h4>
+      <ul>
+       <li>Given that device selection should be restricted to only the "kind" of devices for which the user has granted consent, detection 
+        of multiple capture devices could only be done after a media stream was obtained. An API would therefore want to have a way of 
+        exposing the set of <i>all devices</i> available for use. That API could facilitate both switching to the given device in the 
+        current media stream, or some mechanism for creating a new media stream by activating a set of devices. By associating a track 
+        object with a device, this can be accomplished via <code>new MediaStream(tracks)</code> providing the desired tracks/devices used 
+        to create the new media stream. The constructor algorithm is modified to activate a track/device that is not "enabled".
+       </li>
+       <li>For many user agents (including mobile devices) preview of more than one media stream at a time can lead to performance problems. 
+        In many user agents, recording of more than one media stream can also lead to performance problems (dedicated encoding hardware 
+        generally supports the media stream recording scenario, and the hardware can only handle one stream at a time). Especially for 
+        recordings, an API should be designed such that it is not easy to accidentally start multiple recordings at once.
+       </li>
+      </ul>
+     </section>
+    </section>
+
+    <section>
+     <h3>Recording a media stream</h3>
+     <p>In its most basic form, recording a media stream is simply the process of converting the media stream into a known format. There's 
+      also an expectation that the recording will end within a reasonable time-frame (since local buffer space is not unlimited).
+     </p>
+     <p>Local media stream recordings are common in a variety of sharing scenarios such as:
+     </p>
+     <ul>
+      <li>record a video and upload to a video sharing site</li>
+      <li>record a picture for my user profile picture in a given web app</li>
+      <li>record audio for a translation site</li>
+      <li>record a video chat/conference</li>
+     </ul>
+     <p>There are other offline scenarios that are equally compelling, such as usage in native-camera-style apps, or web-based recording 
+      studios (where tracks are recorded and later mixed).
+     </p>
+     <p>The core functionality that supports most recording scenarios is a simple start/stop recording pair.
+     </p>
+     <p>Ongoing recordings should report progress to enable developers to build UIs that pass this progress notification along to users.
+     </p>
+     <p>Recording API should be designed to gracefully handle changes to the media stream, and should also report (and perhaps even 
+      attempt to recover from) failures at the media stream source during recording.
+     </p>
+     <p>Uses of the recorded information is covered in the Post-processing scenarios described previously. An additional usage is the 
+      possibility of default save locations. For example, by default a UA may store temporary recordings (those recordings that are 
+      in-progress) in a temp (hidden) folder. It may be desirable to be able to specify (or hint) at an alternate default recording 
+      location such as the users's common file location for videos or pictures.
+     </p>
+     <section>
+      <h4>DVR Scenarios</h4>
+      <p>Increasingly in the digital age, the ability to pause, rewind, and "go live" for streamed content is an expected scenario. 
+       While this scenario applies mostly to real-time communication scenarios (and not to local capture scenarios), it is worth 
+       mentioning for completeness.
+      </p>
+      <p>The ability to quickly "rewind" can be useful, especially in video conference scenarios, when you may want to quickly go 
+       back and hear something you just missed. In these scenarios, you either started a recording from the beginning of the conference 
+       and you want to seek back to a specific time, or you were only streaming it (not saving it) but you allowed yourself some amount 
+       of buffer in order to review the last X minutes of video.
+      </p>
+      <p>To support these scenarios, buffers must be introduced (because the media stream is not implicitly buffered for this scenario). 
+       In the pre-recorded case, a full recording is in progress, and as long as the UA can access previous parts of the recording 
+       (without terminating the recording) then this scenario could be possible.
+      </p>
+      <p>In the streaming case, the only way to support this scenario is to add a [configurable] buffer directly into the media stream 
+       itself. Given the complexities of this approach and the relatively limited scenarios, adding a buffer capability to a media stream 
+       object is not recommended.
+      </p>
+      <p>Note that most streaming scenarios (where DVR is supported) are made possible exclusively on the server to avoid accumulating 
+       large amounts of data (i.e., the buffer) on the client. Content protection also tends to require this limitation.
+      </p>
+     </section>
+     <section>
+       <h4>Issues</h4>
+       <ul>
+        <li>There are few (if any) scenarios that require support for overlapping recordings of a single media stream. Note, that the 
+         current <code>record</code> API supports overlapping recordings by simply calling <code>record()</code> twice. In the case of 
+         separate media streams (see previous section) overlapping recording makes sense. In either case, initiating multiple recordings 
+         should not be so easy so as to be accidental.
+        </li>
+       </ul>
+      </section>
+    </section>
+
+    <section>
+     <h3>Selection of recording method</h3>
+     <p>All post-processing scenarios for recorded data require a known [standard] format. It is therefore crucial that the media capture 
+      API provide a mechanism to specify the recording format. It is also important to be able to discover if a given format is supported.
+     </p>
+     <p>Most scenarios in which the recorded data is sent to the server for upload also have restrictions on the type of data that the server 
+      expects (one size doesn't fit all).
+     </p>
+     <p>It should not be possible to change recording on-the-fly without consequences (i.e., a stop and/or re-start or failure). It is 
+      recommended that the mechanism for specifying a recording format not make it too easy to change the format (e.g., setting the format 
+      as a property may not be the best design).
+     </p>
+     <section>
+      <h4>Format detection</h4>
+      <ul>
+       <li>If we wish to re-use existing web platform concepts for format capability detection, the HTML5 <code>HTMLMediaElement</code> 
+        supports an API called <code>canPlayType</code> which allows developer to probe the given UA for support of specific MIME types that 
+        can be played by <code>audio</code> and <code>video</code> elements. A recording format checker could use this same approach.
+       </li>
+      </ul>
+     </section>
+    </section>
+
+    <section>
+     <h3>Programmatic activation of camera app</h3>
+     <p>As mentioned in the introduction, declarative use of a capture device is out-of-scope. However, there are some potentially interesting 
+      uses of a hybrid programmatic/declarative model, where the configuration of a particular media stream is done exclusively via the user 
+      (as provided by some UA-specific settings UX), but the fine-grained control over the stream as well as the recording of the stream is 
+      handled programmatically.
+     </p>
+     <p>In particular, if the developer doesn't want to guess the user's preferred settings, or if there are specific settings that may not be 
+      available via the media capture API standard, they could be exposed in this manner.
+     </p>
+    </section>
+
+    <section>
+     <h3>Take a picture</h3>
+     <p>A common usage scenario of local device capture is to simply "take a picture". The hardware and optics of many camera-devices often 
+      support video in addition to photos, but can be set into a specific "camera mode" where the possible recording resolutions are 
+      significantly larger than their maximum video resolution.
+     </p>
+     <p>The advantage to having a photo-mode is to be able to capture these very high-resolution images (versus the post-processing scenarios 
+      that are possible with still-frames from a video source).
+     </p>
+     <p>Recording a picture is strongly tied to the "video" capability because a video preview is often an important component to setting up 
+      the scene and getting the right shot.
+     </p>
+     <p>Because photo capabilities are somewhat different from those of regular video capabilities, devices that support a specific "photo" 
+      mode, should likely provide their "photo" capabilities separately from their "video" capabilities.
+     </p>
+     <p>Many of the considerations that apply to recording also apply to taking a picture.
+     </p>
+     <section>
+      <h4>Issues</h4>
+      <ul>
+       <li>What are the implications on the device mode switch on video recordings that are in progress? Will there be a pause? Can this 
+        problem be avoided?
+       </li>
+       <li>Should a "photo mode" be a type of user media that can be requested via <code>getUserMedia</code>?
+       </li>
+      </ul>
+     </section>
+    </section>
+
+    <section>
+     <h3>Picture tracks</h3>
+     <p>Another common scenario for media streams is to share photos via a video stream. For example, a user may want to select a photo and 
+      attach the photo to an active media stream in order to share that photo via the stream. In another example, the photo can be used as a 
+      type of "video mute" where the photo can be sent in place of the active video stream when a video track is "disabled".
+     </p>
+     <section>
+      <h4>Issues</h4>
+      <ul>
+       <li>It may be desireable to specify a photo/static image as a track type in order to allow it to be toggled on/off with a video track. 
+        On the other hand, the sharing scenario could be fulfilled by simply providing an API to supply a photo for the video track "mute" 
+        option (assuming that there's not a scenario that involves creating a parallel media stream that has both the photo track and the current 
+        live video track active at once; such a use case could be satisfied by using two media streams instead).
+       </li>
+      </ul>
+     </section>
+    </section>
+
+    <section>
+     <h3>Caption Tracks</h3>
+     <p>The HTML5 <code>HTMLMediaElement</code> now has the ability to display captures and other "text tracks". While not directly applicable to 
+      local media stream scenarios (caption support is generally done out-of-band from the original capture), it could be something worth adding in 
+      order to integrate with HTML5 videos when the source is a PeerConnection where real-time captioning is being performed and needs to be displayed.
+     </p>
+    </section>
+
+    </section>
+  </body>
+</html>
Received on Tuesday, 6 December 2011 01:33:57 UTC