- From: Mercurial notifier <cvsmail@w3.org>
- Date: Tue, 06 Dec 2011 01:33:48 +0000
- To: public-dap-commits@w3.org
changeset: 36:d21e515ff4f5 tag: tip user: tleithea date: Mon Dec 05 17:32:19 2011 -0800 files: media-stream-capture/scenarios.html description: First Draft of Scenarios Document (including a bunch of commentary and issues) diff -r 5185030da020 -r d21e515ff4f5 media-stream-capture/scenarios.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/media-stream-capture/scenarios.html Mon Dec 05 17:32:19 2011 -0800 @@ -0,0 +1,863 @@ +<!DOCTYPE html> +<html> + <head> + <title>MediaStream Capture Scenarios</title> + <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/> + <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/respec.js' class='remove'></script> + <script type="text/javascript" src='http://dev.w3.org/2009/dap/ReSpec.js/js/sh_main.min.js' class='remove'></script> + <script type="text/javascript" class='remove'> + var respecConfig = { + specStatus: "CG-NOTE", + editors: [{ + name: "Travis Leithead", + company: "Microsoft Corp.", + url: "mailto:travis.leithead@microsoft.com?subject=MediaStream Capture Scenarios Feedback", + companyURL: "http://www.microsoft.com"}], + previousPublishDate: null, + noIDLIn: true, + }; + </script> + <script type="text/javascript" src='http://dev.w3.org/2009/dap/common/config.js' class='remove'></script> + <style type="text/css"> + /* ReSpec.js CSS optimizations (Richard Tibbett) - cut-n-paste :) */ + div.example { + border-top: 1px solid #ff4500; + border-bottom: 1px solid #ff4500; + background: #fff; + padding: 1em; + font-size: 0.9em; + margin-top: 1em; + } + div.example::before { + content: "Example"; + display: block; + width: 150px; + background: #ff4500; + color: #fff; + font-family: initial; + padding: 3px; + padding-left: 5px; + font-weight: bold; + margin: -1em 0 1em -1em; + } + + /* Clean up pre.idl */ + pre.idl::before { + font-size:0.9em; + } + + /* Add better spacing to sections */ + section, .section { + margin-bottom: 2em; + } + + /* Reduce note & issue render size */ + .note, .issue { + font-size:0.8em; + } + + /* Add addition spacing to <ol> and <ul> for rule definition */ + ol.rule li, ul.rule li { + padding:0.2em; + } + </style> + </head> + + <body> + <section id='abstract'> + <p> + This document collates the target scenarios for the Media Capture task force. Scenarios represent + the set of expected functionality that may be achieved by the use of the MediaStream Capture API. A set of + un-supported scenarios may also be documented here. + </p> + <p>This document builds on the assumption that the mechanism for obtaining fundamental access to local media + capture device(s) is <code>navigator.getUserMedia</code> (name/behavior subject to this task force), and that + the vehicle for delivery of the content from the local media capture device(s) is a <code>MediaStream</code>. + Hence the title of this note. + </p> + </section> + + <section id='sotd'> + <p> + This document will eventually represent the consensus of the media capture task force on the set of scenarios + supported by the MediaStream Capture API. If you wish to make comments regarding this document, please + send them to <a href="mailto:public-media-capture@w3.org">public-media-capture@w3.org</a> ( + <a href="mailto:public-media-capture-request@w3.org?subject=subscribe">subscribe</a>, + <a href="http://lists.w3.org/Archives/Public/public-media-capture/">archives</a>). + </p> + </section> + + <section class="informative"> + <h2>Introduction</h2> + <p> + One of the goals of the joint task force between the Device and Policy working group and the Web Real Time + Communications working groups is to bring media capture scenarios from both groups together into one unified + API that can address all relevant use cases. + </p> + <p> + The capture scenarios from WebRTC are primarily driven from real-time-communication-based scenarios, such as + the recording of live chats, teleconferences, and other media streamed from over the network from potentially + multiple sources. + </p> + <p> + The capture scenarios from DAP are primarily driven from "local" capture scenarios related to providing access + to a user agent's camera and related experiences. + </p> + <p> + Both groups include overlapping chartered deliverables in this space. Namely in DAP, + <a href="http://www.w3.org/2009/05/DeviceAPICharter">the charter specifies a recommendation-track deliverable</a>: + <ul> + <li> + <dt>Camera API</dt> + <dd>an API to manage a device's camera e.g. to take a picture</dd> + </li> + </ul> + </p> + <p> + And <a href="http://www.w3.org/2011/04/webrtc-charter.html">WebRTC's charter scope</a> describes enabling + real-time communications between web browsers that will require specific client-side technologies: + <ul> + <li>API functions to explore device capabilities, e.g. camera, microphone, speakers (currently in scope + for the <a href="http://www.w3.org/2009/dap/">Device APIs & Policy Working Group</a>)</li> + <li>API functions to capture media from local devices (camera and microphone) (currently in scope for the + <a href="http://www.w3.org/2009/dap/">Device APIs & Policy Working Group</a>)</li> + <li>API functions for encoding and other processing of those media streams,</li> + <li>API functions for decoding and processing (including echo cancelling, stream synchronization and a + number of other functions) of those streams at the incoming end,</li> + <li>Delivery to the user of those media streams via local screens and audio output devices (partially + covered with HTML5)</li> + </ul> + </p> + <p> + Note, that the scenarios described in this document specifically exclude peer-to-peer and networking scenarios + that do not overlap with local capture scenarios, as these are not considered in-scope for this task force. + </p> + <p> + Also excluded are scenarios that involve declarative capture scenarios, such as those where media capture can be + obtained and submitted to a server entirely without the use of script. Such scenarios generally involve the use + of a UA-specific app or mode for interacting with the recording device, altering settings and completing the + capture. Such scenarios are currently captured by the DAP working group's <a href="http://dev.w3.org/2009/dap/camera/">HTML Media Capture</a> + specification. + </p> + <p> + The scenarios contained in this document are specific to scenarios in which web applications require direct access + to the capture device, its settings, and the recording mechanism and output. Such scenarios have been deemed + crucial to building applications that can create a site-specific look-and-feel to the user's interaction with the + capture device, as well as utilize advanced functionality that may not be available in a declarative model. + </p> + </section> + + <!-- Travis: No conformance section necessary? + + <section id='conformance'> + <p> + This specification defines conformance criteria that apply to a single product: the + <dfn id="ua">user agent</dfn> that implements the interfaces that it contains. + </p> + <p> + Implementations that use ECMAScript to implement the APIs defined in this specification must implement + them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification + [[!WEBIDL]], as this specification uses that specification and terminology. + </p> + <p> + A conforming implementation is required to implement all fields defined in this specification. + </p> + + <section> + <h2>Terminology</h2> + <p> + The terms <dfn>document base URL</dfn>, <dfn>browsing context</dfn>, <dfn>event handler attribute</dfn>, + <dfn>event handler event type</dfn>, <dfn>task</dfn>, <dfn>task source</dfn> and <dfn>task queues</dfn> + are defined by the HTML5 specification [[!HTML5]]. + </p> + <p> + The <a>task source</a> used by this specification is the <dfn>device task source</dfn>. + </p> + <p> + To <dfn>dispatch a <code>success</code> event</dfn> means that an event with the name + <code>success</code>, which does not bubble and is not cancellable, and which uses the + <code>Event</code> interface, is to be dispatched at the <a>ContactFindCB</a> object. + </p> + <p> + To <dfn>dispatch an <code>error</code> event</dfn> means that an event with the name + <code>error</code>, which does not bubble and is not cancellable, and which uses the <code>Event</code> + interface, is to be dispatched at the <a>ContactErrorCB</a> object. + </p> + </section> + </section> + --> + + <section> + <h2>Concepts and Definitions</h2> + <p> + This section describes some terminology and concepts that frame an understanding of the scenarios that + follow. It is helpful to have a common understanding of some core concepts to ensure that the scenarios + are interpreted uniformly. + </p> + <dl> + <dt>Stream</dt> + <dd>A stream including the implied derivative + <code><a href="http://dev.w3.org/2011/webrtc/editor/webrtc.html#introduction">MediaStream</a></code>, + can be conceptually understood as a tube or conduit between a source (the stream's generator) and a + destination (the sink). Streams don't generally include any type of significant buffer, that is, content + pushed into the stream from a source does not collect into any buffer for later collection. Rather, content + is simply dropped on the floor if the stream is not connected to a sink. This document assumes the + non-buffered view of streams as previously described. + </dd> + <dt><code>MediaStream</code> vs "media stream"</dt> + <dd>In some cases, I use these two terms interchangeably; my usage of the term "media stream" is intended as + a generalization of the more specific <code>MediaStream</code> interface as currently defined in the + WebRTC spec.</dd> + <dt><code>MediaStream</code> format</dt> + <dd>As stated in the WebRTC specification, the content flowing through a <code>MediaStream</code> is not in + any particular underlying format:</dd> + <dd><blockquote>[The data from a <code>MediaStream</code> object does not necessarily have a canonical binary form; for + example, it could just be "the video currently coming from the user's video camera". This allows user agents + to manipulate media streams in whatever fashion is most suitable on the user's platform.]</blockquote></dd> + <dd>This document reinforces that view, especially when dealing with recording of the <code>MediaStream</code>'s content + and the potential interaction with the <a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Streams API</a>. + </dd> + <dt>Virtualized device</dt> + <dd>Device virtualization (in my simplistic view) is the process of abstracting the settings for a device such + that code interacts with the virtualized layer, rather than with the actual device itself. Audio devices are + commonly virtualized. This allows many applications to use the audio device at the same time and apply + different audio settings like volume independently of each other. It also allows audio to be interleaved on + top of each other in the final output to the device. In some operating systems, such as Windows, a webcam's + video source is not virtualized, meaning that only one application can have control over the device at any + one time. In order for an app to use the webcam either another app already using the webcam must yield it up + or the new app must "steal" the camera from the previous app. An API could be exposed from a device that + changes the device configuration in such a way that prevents that device from being virtualized--for example, + if a "zoom" setting were applied to a webcam device. Changing the zoom level on the device itself would affect + all potential virtualized versions of the device, and therefore defeat the virtualization.</dd> + </dl> + </p> + </section> + + <section> + <h2>Media Capture Scenarios</h2> + + <section> + <h3>Stream initialization</h3> + <p>A web application must be able to initiate a request for access to the user's webcam(s) and/or microphone(s). + Additionally, the web application should be able to "hint" at specific device characteristics that are desired by + the particular usage scenario of the application. User consent is required before obtaining access to the requested + stream.</p> + <p>When then media capture devices have been obtained (after user consent), the associated stream should be active + and populated with the appropriate devices (likely in the form of tracks to re-use an existing + <code>MediaStream</code> concept). The active capture devices will be configured according to user preference; the + user may have an opportunity to configure the initial state of the devices, select specific devices, and/or elect + to enable/disabled a subset of the requested devices at the point of consent or beyond—the user remains in control). + </p> + <section> + <h4>Privacy</h4> + <p>Specific information about a given webcam and/or microphone must not be available until after the user has + granted consent. Otherwise "drive-by" fingerprinting of a UA's devices and characteristics can be obtained without + the user's knowledge—a privacy issue.</p> + </section> + + <p>The navigator.getUserMedia API fulfills these scenarios today.</p> + + <section> + <h4>Issues</h4> + <ul> + <li>What are the privacy/fingerprinting implications of the current "error" callback? Is it sufficiently "scary" + to warrant a change? Consider the following: + <ul> + <li>If the user doesn’t have a webcam/mic, and the developer requests it, a UA would be expected to invoke + the error callback immediately.</li> + <li>If the user does have a webcam/mic, and the developer requests it, a UA would be expected to prompt for + access. If the user denies access, then the error callback is invoked.</li> + <li>Depending on the timing of the invocation of the error callback, scripts can still profile whether the + UA does or does not have a given device capability.</li> + </ul> + </li> + <li>In the case of a user with multiple video and/or audio capture devices, what specific permission is expected to + be granted for the "video" and "audio" options presented to <code>getUserMedia</code>? For example, does "video" + permission mean that the user grants permission to any and all video capture devices? Similarly with "audio"? Is + it a specific device only, and if so, which one? Given the privacy point above, my recommendation is that "video" + permission represents permission to all possible video capture devices present on the user's device, therefore + enabling switching scenarios (among video devices) to be possible without re-acquiring user consent. Same for + "audio" and combinations of the two. + </li> + <li>When a user has only one of two requested device capabilities (for example only "audio" but not "video", and both + "audio" and "video" are requested), should access be granted without the video or should the request fail? + </li> + </ul> + </section> + </section> + + <section> + <h3>Stream re-initialization</h3> + + <p>After requesting (and presumably gaining access to media capture devices) it is entirely possible for one or more of + the requested devices to stop or fail (for example, if a video device is claimed by another application, or if the user + unplugs a capture device or physically turns it off, or if the UA shuts down the device arbitrarily to conserve battery + power). In such a scenario it should be reasonably simple for the application to be notified of the situation, and for + the application to re-request access to the stream. + </p> + <p>Today, the <code>MediaStream</code> offers a single <code>ended</code> event. This could be sufficient for this + scenario. + </p> + <p>Additional information might also be useful either in terms of <code>MediaStream</code> state such as an error object, + or additional events like an <code>error</code> event (or both). + </p> + + <section> + <h4>Issues</h4> + <ul> + <li>How shall the stream be re-acquired efficiently? Is it merely a matter of re-requesting the entire + <code>MediaStream</code>, or can an "ended" mediastream be quickly revived? Reviving a local media stream makes + more sense in the context of the stream representing a set of device states, than it does when the stream + represents a network source. + </li> + <li>What's the expected interaction model with regard to user-consent? For example, if the re-initialization + request is for the same device(s), will the user be prompted for consent again? + </li> + <li>How can tug-of-war scenarios be avoided between two web applications both attempting to gain access to a + non-virtualized device at the same time? + </li> + </ul> + </section> + </section> + + <section> + <h3>Preview a stream</h3> + <p>The application should be able to connect a media stream (representing active media capture device(s) to a sink + in order to "see" the content flowing through the stream. In nearly all digital capture scenarios, "previewing" + the stream before initiating the capture is essential to the user in order to "compose" the shot (for example, + digital cameras have a preview screen before a picture or video is captured; even in non-digital photography, the + viewfinder acts as the "preview"). This is particularly important for visual media, but also for non-visual media + like audio. + </p> + <p>Note that media streams connected to a preview output sink are not in a "recording" state as the media stream has + no default buffer (see the <a>Stream</a> definition in section 2). Content conceptually "within" the media stream + is streaming from the capture source device to the preview sink after which point the content is dropped (not + saved). + </p> + <p>The application should be able to affect changes to the media capture device(s) settings via the media stream + and view those changes happen in the preview. + </p> + <p>Today, the <code>MediaStream</code> object can be connected to several "preview" sinks in HTML5, including the + <code>video</code> and <code>audio</code> elements. (This support should also extend to the <code>source</code> + elements of each as well.) The connection is accomplished via <code>URL.createObjectURL</code>. + </p> + <p>These concepts are fully supported by the current WebRTC specification.</p> + <section> + <h4>Issues</h4> + <ul> + <li>Audio tag preview is somewhat problematic because of the acoustic feedback problem (interference that can + result from a loop between a microphone input that picks up the output from a nearby speaker). There are + software solutions that attempt to automatically compensate for these type of feedback problems. However, it + may not be appropriate to require implementations to all support such an acoustic feedback prevention + algorithm. Therefore, audio preview could be turned off by default and only enabled by specific opt-in. + Implementations without acoustic feedback prevention could fail to enable the opt-in? + </li> + <li>It makes a lot of sense for a 1:1 association between the source and sink of a media stream; for example, + one media stream to one video element in HTML5. It is less clear what the value might be of supporting 1:many + media stream sinks—for example, it could be a significant performance load on the system to preview a media + stream in multiple video elements at once. Implementation feedback here would be valuable. It would also be + important to understand that scenario that required a 1:many viewing of a single media stream. + </li> + <li>Are there any use cases for stopping or re-starting the preview (exclusively) that are sufficiently different + from the following scenarios? + <ul> + <li>Stopping/re-starting the device(s)—at the source of the media stream.</li> + <li>Assigning/clearing the URL from media stream sinks.</li> + <li>createObjectURL/revokeObjectURL – for controlling the [subsequent] connections to the media stream sink + via a URL. + </li> + </ul> + </li> + </ul> + </section> + </section> + + <section> + <h3>Stopping local devices</h3> + <p>End-users need to feel in control of their devices. Likewise, it is expected that developers using a media stream + capture API will want to provide a mechanism for users to stop their in-use device(s) via the software (rather than + using hardware on/off buttons which may not always be available). + </p> + <p>Stopping or ending a media stream source device(s) in this context implies that the media stream source device(s) + cannot be re-started. This is a distinct scenario from simply "muting" the video/audio tracks of a given media stream. + </p> + <p>The current WebRTC draft describes a <code>stop</code> API on a <code>LocalMediaStream</code> interface, whose + purpose is to stop the media stream at its source. + </p> + <section> + <h4>Issues</h4> + <ul> + <li>Is there a scenario where end-users will want to stop just a single device, rather than all devices participating + in the current media stream? + </li> + </ul> + </section> + </section> + + <section> + <h3>Pre-processing</h3> + <p>Pre-processing scenarios are a bucket of scenarios that perform processing on the "raw" or "internal" characteristics + of the media stream for the purpose of reporting information that would otherwise require processing of a known + format (i.e., at the media stream sink—like Canvas, or via recording and post-processing), significant + computationally-expensive scripting, etc. + </p> + <p>Pre-processing scenarios will require the UAs to provide an implementation (which may be non-trivial). This is + required because the media stream has no internal format upon which a script-based implementation could be derived + (and I believe advocating for the specification of such a format is unwise). + </p> + <p>Pre-processing scenarios provide information that is generally needed <i>before</i> a stream need be connected to a + sink or recorded. + </p> + <p>Pre-processing scenarios apply to both real-time-communication and local capture scenarios. Therefore, the + specification of various pre-processing requirements may likely fall outside the scope of this task force. However, + they are included here for scenario-completeness and to help ensure that a media capture API design takes them into + account. + </p> + <section> + <h4>Examples</h4> + <ul> + <li>Audio end-pointing. As described in <a href="http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html">a + speech API proposal</a>, audio end-pointing allows for the detection of noise, speech, or silence and raises events + when these audio states change. End-pointing is necessary for scenarios that programmatically determine when to + start and stop recording an audio stream for purposes of hands-free speech commands, dictation, and a variety of + other speech and accessibility-related scenarios. The proposal linked above describes these scenarios in better + detail. Audio end-pointing would be required as a pre-processing scenario because it is a prerequisite to + starting/stopping a recorder of the media stream itself. + </li> + <li>Volume leveling/automatic gain control. The ability to automatically detect changes in audio loudness and adjust + the input volume such that the output volume remains constant. These scenarios are useful in a variety of + heterogeneous audio environments such as teleconferences, live broadcasting involving commercials, etc. + Configuration options for volume/gain control of a media stream source device are also useful, and are explored + later on. + </li> + <li>Video face-recognition and gesture detection. These scenarios are the visual analog to the previously described + audio end-pointing scenarios. Face-recognition is useful in a variety of contexts from identifying faces in family + photographs, to serving as part of an identity management system for system access. Likewise, gesture recognition + can act as an input mechanism for a computer. + </li> + </ul> + </section> + <section> + <h4>Issues</h4> + <ul> + <li>In general the set of audio pre-processing scenarios is much more constrained than the set of possible visual + pre-processing scenarios. Due to the large set of visual pre-processing scenarios (which could also be implemented + by scenario-specific post-processing in most cases), we may recommended that visual-related pre-processing + scenarios be excluded from the scope of our task force. + </li> + <li>The challenges of specifying pre-processing scenarios will be identifying what specific information should be + conveyed by the platform at a level at which serves the widest variety of scenarios. For example, + audio-end-pointing could be specified in high-level terms of firing events when specific words of a given language + are identified, or could be as low-level as reporting when there is silence/background noise and when there's not. + Not all scenarios will be able to be served by any API that is designed, therefore this group might choose to + evaluate which scenarios (if any) are worth including in the first version of the API. + </li> + </ul> + </section> + </section> + + <section> + <h3>Post-processing</h3> + <p>Post processing scenarios are a group of all scenarios that can be completed after either:</p> + <ol> + <li>Connecting the media stream to a sink (such as the <code>video</code> or <code>audio</code> elements</li> + <li>Recording the media stream to a known format (MIME type)</li> + </ol> + <p>Post-processing scenarios will continue to expand and grow as the web platform matures and gains capabilities. + The key to understanding the available post-processing scenarios is to understand the other facets of the web + platform that are available for use. + </p> + <section> + <h4>Web platform post-processing toolbox</h4> + <p>The common post-processing capabilities for media stream scenarios are built on a relatively small set of web + platform capabilities: + </p> + <ul> + <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-video-element"><code>video</code></a> and + <a href="http://dev.w3.org/html5/spec/Overview.html#the-audio-element"><code>audio</code></a> tags. These elements are natural + candidates for media stream output sinks. Additionally, they provide an API (see + <a href="http://dev.w3.org/html5/spec/Overview.html#htmlmediaelement">HTMLMediaElement</a>) for interacting with + the source content. Note: in some cases, these elements are not well-specified for stream-type sources—this task + force may need to drive some stream-source requirements into HTML5. + </li> + <li>HTML5 <a href="http://dev.w3.org/html5/spec/Overview.html#the-canvas-element"><code>canvas</code></a> element + and the <a href="http://dev.w3.org/html5/2dcontext/">Canvas 2D context</a>. The <code>canvas</code> element employs + a fairly extensive 2D drawing API and will soon be extended with audio capabilities as well (<b>RichT, can you + provide a link?</b>). Canvas' drawing API allows for drawing frames from a <code>video</code> element, which is + the link between the media capture sink and the effects made possible via Canvas. + </li> + <li><a href="http://dev.w3.org/2006/webapi/FileAPI/">File API</a> and + <a href="http://www.w3.org/TR/file-writer-api/">File API Writer</a>. The File API provides various methods for + reading and writing to binary formats. The fundamental container for these binary files is the <code>Blob</code> + which put simply is a read-only structure with a MIME type and a length. The File API integrates with many other + web APIs such that the <code>Blob</code> can be used uniformly across the entire web platform. For example, + <code>XMLHttpRequest</code>, form submission in HTML, message passing between documents and web workers + (<code>postMessage</code>), and Indexed DB all support <code>Blob</code> use. + </li> + <li><a href="http://dvcs.w3.org/hg/webapps/raw-file/tip/StreamAPI/Overview.htm">Stream API</a>. A new addition to + the WebApps WG, the <code>Stream</code> is another general-purpose binary container. The primary differences + between a <code>Stream</code> and a <code>Blob</code> is that the <code>Stream</code> is read-once, and has no + length. The Stream API includes a mechanism to buffer from a <code>Stream</code> into a <code>Blob</code>, and + thus all <code>Stream</code> scenarios are a super-set of <code>Blob</code> scenarios. + </li> + <li>JavaScript <a href="http://wiki.ecmascript.org/doku.php?id=strawman:typed_arrays">TypedArrays</a>. Especially + useful for post-processing scenarios, TypedArrays allow JavaScript code to crack-open a binary file + (<code>Blob</code>) and read/write its contents using the numerical data types already provided by JavaScript. + There's a cool explanation and example of TypedArrays + <a href="http://blogs.msdn.com/b/ie/archive/2011/12/01/working-with-binary-data-using-typed-arrays.aspx">here</a>. + </li> + </ul> + </section> + <p>Of course, post-processing scenarios made possible after sending a media stream or recorded media stream to a + server are unlimited. + </p> + <section> + <h4>Time sensitivity and performance</h4> + <p>Some post-processing scenarios are time-sensitive—especially those scenarios that involve processing large + amounts of data while the user waits. Other post-processing scenarios s are long-running and can have a performance + benefit if started before the end of the media stream segment is known. For example, a low-pass filter on a video. + </p> + <p>These scenarios generally take two approaches:</p> + <ol> + <li>Extract samples (video frames/audio clips) from a media stream sink and process each sample. Note that this + approach is vulnerable to sample loss (gaps between samples) if post-processing is too slow. + </li> + <li>Record the media stream and extract samples from the recorded native format. Note that this approach requires + significant understanding of the recorded native format. + </li> + </ol> + <p>Both approaches are valid for different types of scenarios.</p> + <p>The first approach is the technique described in the current WebRTC specification for the "take a picture" + scenario. + </p> + <p>The second approach is somewhat problematic from a time-sensitivity/performance perspective given that the + recorded content is only provided via a <code>Blob</code> today. A more natural fit for post-processing scenarios + that are time-or-performance sensitive is to supply a <code>Stream</code> as output from a recorder. + Thus time-or-performance sensitive post-processing applications can immediately start processing the [unfinished] + recording, and non-sensitive applications can use the Stream API's <code>StreamReader</code> to eventually pack + the full <code>Stream</code> into a <code>Blob</code>. + </p> + </section> + <section> + <h4>Examples</h4> + <ul> + <li>Image quality manipulation. If you copy the image data to a canvas element you can then get a data URI or + blob where you can specify the desired encoding and quality e.g. + <pre class="sh_javascript"> +canvas.toDataURL('image/jpeg', 0.6); +// or +canvas.toBlob(function(blob) {}, 'image/jpeg', 0.2);</pre> + </li> + <li>Image rotation. If you copy the image data to a canvas element and then obtain its 2D context you can then + call rotate() on that context object to rotate the displayed 'image'. You can then obtain the manipulated image + back via toDataURL or toBlob as above if you want to generate a file-like object that you can then pass around as + required. + </li> + <li>Image scaling. Thumbnails or web image formatting can be done by scaling down the captured image to a common + width/height and reduce the output quality. + </li> + <li>Speech-to-text. Post processing on a recorded audio format can be done to perform client-side speech + recognition and conversion to text. Note, that speech recognition algorithms are generally done on the server for + time-sensitive or performance reasons. + </li> + </ul> + </section> + <p>This task force should evaluate whether some extremely common post-processing scenarios should be included as + pre-processing features. + </p> + </section> + + <section> + <h3>Device Selection</h3> + <p>A particular user agent may have zero or more devices that provide the capability of audio or video capture. In + consumer scenarios, this is typically a webcam with a microphone (which may or may not be combined), and a "line-in" + and or microphone audio jack. The enthusiast users (e.g., recording enthusiasts), may have many more available + devices. + </p> + <p>Device selection in this section is not about the selection of audio vs. video capabilities, but about selection + of multiple devices within a given "audio" or "video" category (i.e., "kind"). The term "device" and "available + devices" used in this section refers to one or a collection of devices of a kind (e.g., that provide a common + capability, such as a set of devices that all provide "video"). + </p> + <p>Providing a mechanism for code to reliably enumerate the set of available devices enables programmatic control + over device selection. Device selection is important in a number of scenarios. For example, the user selected the + wrong camera (initially) and wants to change the media stream over to another camera. In another example, the + developer wants to select the device with the highest resolution for recording. + </p> + <p>Depending on how stream initialization is managed in the consent user experience, device selection may or may not + be a part of the UX. If not, then it becomes even more important to be able to change device selection after media + stream initialization. The requirements of the user-consent experience will likely be out of scope for this task force. + </p> + <section> + <h4>Privacy</h4> + <ul> + <li>As mentioned in the "Stream initialization" section, exposing the set of available devices before media stream + consent is given leads to privacy issues. Therefore, the device selection API should only be available after consent. + </li> + <li>Device selection should not be available for the set of devices within a given category/kind (e.g., "audio" + devices) for which user consent was not granted. + </li> + </ul> + </section> + <p>A selected device should provide some state information that identifies itself as "selected" (so that the set of + current device(s) in use can be programmatically determined). This is important because some relevant device information + cannot be surfaced via an API, and correct device selection can only be made by selecting a device, connecting a sink, + and providing the user a method for changing the device. For example, with multiple USB-attached webcams, there's no + reliable mechanism to describe how each device is oriented (front/back/left/right) with respect to the user. + </p> + <p>Device selection should be a mechanism for exposing device capabilities which inform the developer of which device to + select. In order for the developer to make an informed decision about which device to select, the developer's code would + need to make some sort of comparison between devices—such a comparison should be done based on device capabilities rather + than a guess, hint, or special identifier (see related issue below). + </p> + <p>Recording capabilities are an important decision-making point for media capture scenarios. However, recording capabilities + are not directly correlated with individual devices, and as such should not be mixed with the device capabilities. For + example, the capability of recording audio in AAC vs. MP3 is not correlated with a given audio device, and therefore not a + decision making factor for device selection. + </p> + <p>The current WebRTC spec does not provide an API for discovering the available devices nor a mechanism for selection. + </p> + <section> + <h4>Issues</h4> + <ul> + <li>The specification should provide guidance on what set of devices are to be made available—should it be the set of + potential devices, or the set of "currently available" devices (which I recommended since the non-available devices can't + be utilized by the developer's code, thus it doesn't make much sense to include them). + </li> + <li>A device selection API should expose device capability rather than by device identity. Device identity is a poor practice + because it leads to device-dependent testing code (for example, if "Name Brand Device", then…) similar to the problems that + exist today on the web as a result of user-agent detection. A better model is to enable selection based on capabilities. + Additionally, knowing the GUID or hardware name is not helpful to web developers as part of a scenario other than device + identification (perhaps for purposes of providing device-specific help/troubleshooting, for example). + </li> + </ul> + </section> + </section> + + <section> + <h3>Change user-selected device capabilities</h3> + <p>In addition to selecting a device based on its capabilities, individual media capture devices may support multiple modes of + operation. For example, a webcam often supports a variety of resolutions which may be suitable for various scenarios (previewing + or recording a sample whose destination is a web server over a slow network connection, recording archival HD video for storing + locally). An audio device may have a gain control, allowing a developer to build a UI for an audio blender (varying the gain on + multiple audio source devices until the desired blend is achieved). + </p> + <p>A media capture API should support a mechanism to configure a particular device dynamically to suite the expected scenario. + Changes to the device should be reflected in the related media stream(s) themselves. + </p> + <p>Device capabilities that can be changed should be done in such a way that the changes are virtualized to the window that is + consuming the API (see definition of "virtual device"). For example, if two applications are using a device, changes to the + device's configuration in one window should not affect the other window. + </p> + <p>Changes to a device capability should be made in the form of requests (async operations rather than synchronous commands). + Change requests allow a device time to make the necessary internal changes, which may take a relatively long time without + blocking other script. Additionally, script code can be written to change device characteristics without careful error-detection + (because devices without the ability to change the given characteristic would not need to throw an exception synchronously). + Finally, a request model makes sense even in RTC scenarios, if one party of the teleconference, wants to issue a request that + another party mute their device (for example). The device change request can be propagated over the <code>PeerConnection</code> + to the sender asynchronously. + </p> + <p>In parallel, changes to a device's configuration should provide a notification when the change is made. This allows web + developer code to monitor the status of a media stream's devices and report statistics and state information without polling the + device (especially when the monitoring code is separate from the author's device-control code). This is also essential when the + change requests are asynchronous; to allow the developer to know at which point the requested change has been made in the media + stream (in order to perform synchronization, or start/stop a recording, for example). + </p> + <p>The current WebRTC spec only provides the "enabled" (on/off) capability for devices (where a device may be equated to a particular + track object). + </p> + <section> + <h4>Issues</h4> + <ul> + <li>If changing a particular device capability cannot be virtualized, this media capture task force should consider whether that + dynamic capability should be exposed to the web platform, and if so, what the usage policy around multiple access to that + capability should be. + </li> + <li>The specifics of what happens to a recording-in-progress when device behavior is changed must be described in the spec. + </li> + </ul> + </section> + </section> + + <section> + <h3>Multiple active devices</h3> + <p>In some scenarios, users may want to initiate capture from multiple devices at one time in multiple media streams. For example, + in a home-security monitoring scenario, a user agent may want to capture 10 unique video streams representing various locations being + monitored. The user may want to capture all 10 of these videos into one recording, or record all 10 individually (or some + combination thereof). + </p> + <section> + <h4>Issues</h4> + <ul> + <li>Given that device selection should be restricted to only the "kind" of devices for which the user has granted consent, detection + of multiple capture devices could only be done after a media stream was obtained. An API would therefore want to have a way of + exposing the set of <i>all devices</i> available for use. That API could facilitate both switching to the given device in the + current media stream, or some mechanism for creating a new media stream by activating a set of devices. By associating a track + object with a device, this can be accomplished via <code>new MediaStream(tracks)</code> providing the desired tracks/devices used + to create the new media stream. The constructor algorithm is modified to activate a track/device that is not "enabled". + </li> + <li>For many user agents (including mobile devices) preview of more than one media stream at a time can lead to performance problems. + In many user agents, recording of more than one media stream can also lead to performance problems (dedicated encoding hardware + generally supports the media stream recording scenario, and the hardware can only handle one stream at a time). Especially for + recordings, an API should be designed such that it is not easy to accidentally start multiple recordings at once. + </li> + </ul> + </section> + </section> + + <section> + <h3>Recording a media stream</h3> + <p>In its most basic form, recording a media stream is simply the process of converting the media stream into a known format. There's + also an expectation that the recording will end within a reasonable time-frame (since local buffer space is not unlimited). + </p> + <p>Local media stream recordings are common in a variety of sharing scenarios such as: + </p> + <ul> + <li>record a video and upload to a video sharing site</li> + <li>record a picture for my user profile picture in a given web app</li> + <li>record audio for a translation site</li> + <li>record a video chat/conference</li> + </ul> + <p>There are other offline scenarios that are equally compelling, such as usage in native-camera-style apps, or web-based recording + studios (where tracks are recorded and later mixed). + </p> + <p>The core functionality that supports most recording scenarios is a simple start/stop recording pair. + </p> + <p>Ongoing recordings should report progress to enable developers to build UIs that pass this progress notification along to users. + </p> + <p>Recording API should be designed to gracefully handle changes to the media stream, and should also report (and perhaps even + attempt to recover from) failures at the media stream source during recording. + </p> + <p>Uses of the recorded information is covered in the Post-processing scenarios described previously. An additional usage is the + possibility of default save locations. For example, by default a UA may store temporary recordings (those recordings that are + in-progress) in a temp (hidden) folder. It may be desirable to be able to specify (or hint) at an alternate default recording + location such as the users's common file location for videos or pictures. + </p> + <section> + <h4>DVR Scenarios</h4> + <p>Increasingly in the digital age, the ability to pause, rewind, and "go live" for streamed content is an expected scenario. + While this scenario applies mostly to real-time communication scenarios (and not to local capture scenarios), it is worth + mentioning for completeness. + </p> + <p>The ability to quickly "rewind" can be useful, especially in video conference scenarios, when you may want to quickly go + back and hear something you just missed. In these scenarios, you either started a recording from the beginning of the conference + and you want to seek back to a specific time, or you were only streaming it (not saving it) but you allowed yourself some amount + of buffer in order to review the last X minutes of video. + </p> + <p>To support these scenarios, buffers must be introduced (because the media stream is not implicitly buffered for this scenario). + In the pre-recorded case, a full recording is in progress, and as long as the UA can access previous parts of the recording + (without terminating the recording) then this scenario could be possible. + </p> + <p>In the streaming case, the only way to support this scenario is to add a [configurable] buffer directly into the media stream + itself. Given the complexities of this approach and the relatively limited scenarios, adding a buffer capability to a media stream + object is not recommended. + </p> + <p>Note that most streaming scenarios (where DVR is supported) are made possible exclusively on the server to avoid accumulating + large amounts of data (i.e., the buffer) on the client. Content protection also tends to require this limitation. + </p> + </section> + <section> + <h4>Issues</h4> + <ul> + <li>There are few (if any) scenarios that require support for overlapping recordings of a single media stream. Note, that the + current <code>record</code> API supports overlapping recordings by simply calling <code>record()</code> twice. In the case of + separate media streams (see previous section) overlapping recording makes sense. In either case, initiating multiple recordings + should not be so easy so as to be accidental. + </li> + </ul> + </section> + </section> + + <section> + <h3>Selection of recording method</h3> + <p>All post-processing scenarios for recorded data require a known [standard] format. It is therefore crucial that the media capture + API provide a mechanism to specify the recording format. It is also important to be able to discover if a given format is supported. + </p> + <p>Most scenarios in which the recorded data is sent to the server for upload also have restrictions on the type of data that the server + expects (one size doesn't fit all). + </p> + <p>It should not be possible to change recording on-the-fly without consequences (i.e., a stop and/or re-start or failure). It is + recommended that the mechanism for specifying a recording format not make it too easy to change the format (e.g., setting the format + as a property may not be the best design). + </p> + <section> + <h4>Format detection</h4> + <ul> + <li>If we wish to re-use existing web platform concepts for format capability detection, the HTML5 <code>HTMLMediaElement</code> + supports an API called <code>canPlayType</code> which allows developer to probe the given UA for support of specific MIME types that + can be played by <code>audio</code> and <code>video</code> elements. A recording format checker could use this same approach. + </li> + </ul> + </section> + </section> + + <section> + <h3>Programmatic activation of camera app</h3> + <p>As mentioned in the introduction, declarative use of a capture device is out-of-scope. However, there are some potentially interesting + uses of a hybrid programmatic/declarative model, where the configuration of a particular media stream is done exclusively via the user + (as provided by some UA-specific settings UX), but the fine-grained control over the stream as well as the recording of the stream is + handled programmatically. + </p> + <p>In particular, if the developer doesn't want to guess the user's preferred settings, or if there are specific settings that may not be + available via the media capture API standard, they could be exposed in this manner. + </p> + </section> + + <section> + <h3>Take a picture</h3> + <p>A common usage scenario of local device capture is to simply "take a picture". The hardware and optics of many camera-devices often + support video in addition to photos, but can be set into a specific "camera mode" where the possible recording resolutions are + significantly larger than their maximum video resolution. + </p> + <p>The advantage to having a photo-mode is to be able to capture these very high-resolution images (versus the post-processing scenarios + that are possible with still-frames from a video source). + </p> + <p>Recording a picture is strongly tied to the "video" capability because a video preview is often an important component to setting up + the scene and getting the right shot. + </p> + <p>Because photo capabilities are somewhat different from those of regular video capabilities, devices that support a specific "photo" + mode, should likely provide their "photo" capabilities separately from their "video" capabilities. + </p> + <p>Many of the considerations that apply to recording also apply to taking a picture. + </p> + <section> + <h4>Issues</h4> + <ul> + <li>What are the implications on the device mode switch on video recordings that are in progress? Will there be a pause? Can this + problem be avoided? + </li> + <li>Should a "photo mode" be a type of user media that can be requested via <code>getUserMedia</code>? + </li> + </ul> + </section> + </section> + + <section> + <h3>Picture tracks</h3> + <p>Another common scenario for media streams is to share photos via a video stream. For example, a user may want to select a photo and + attach the photo to an active media stream in order to share that photo via the stream. In another example, the photo can be used as a + type of "video mute" where the photo can be sent in place of the active video stream when a video track is "disabled". + </p> + <section> + <h4>Issues</h4> + <ul> + <li>It may be desireable to specify a photo/static image as a track type in order to allow it to be toggled on/off with a video track. + On the other hand, the sharing scenario could be fulfilled by simply providing an API to supply a photo for the video track "mute" + option (assuming that there's not a scenario that involves creating a parallel media stream that has both the photo track and the current + live video track active at once; such a use case could be satisfied by using two media streams instead). + </li> + </ul> + </section> + </section> + + <section> + <h3>Caption Tracks</h3> + <p>The HTML5 <code>HTMLMediaElement</code> now has the ability to display captures and other "text tracks". While not directly applicable to + local media stream scenarios (caption support is generally done out-of-band from the original capture), it could be something worth adding in + order to integrate with HTML5 videos when the source is a PeerConnection where real-time captioning is being performed and needs to be displayed. + </p> + </section> + + </section> + </body> +</html>
Received on Tuesday, 6 December 2011 01:33:57 UTC