RE: updates to requirements document from Young, Milan on 2012-07-11 (public-media-capture@w3.org from July 2012)

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 11 Jul 2012 18:25:58 +0000
To: "Timothy B. Terriberry" <tterriberry@mozilla.com>, "public-media-capture@w3.org" <public-media-capture@w3.org>, Randell Jesup <randell-ietf@jesup.org>, Travis Leithead <travis.leithead@microsoft.com>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A47932E@SOM-EXCH04.nuance.com>

Timothy, in your post below, you brought up RTP transport several times.  While I agree that RTP is a good solution for many use cases, it's not a good fit for all.  The translation scenario over a network that you put forward is a perfect example.

My argument centers on two points:
  * Accurate translation is usually more important than timely translation.  Sure there is a spectrum where some users might be willing to trade a few points of accuracy in order to avoid batch processing, but I suspect that as long as responses are in the 1-10 second range, the primary driver is accuracy.
  * The correlation between accuracy and audio quality is strong.  You mentioned that recognizers function better off raw audio, and yes there is some truth to that.  But for audio at any particular bandwith (which is the real metric of concern), modern codecs like Opus and Speex will easily outperform.  Also consider that the effect of dropped packets can be devastating.

Given this, it's clear that reliable transports are the right way to address the use case.  The question is how to reflect this in the requirements.  Randal mentioned WebRTC over TCP, and I agree that could be made to work.  But when I have mentioned that sort of use case in the past, the response was poor.  Basically, those folks argue that if developers want reliable transport, there are options available to them already.  True, but those options are not presently linked with device capture streams, which is what this group is about.

How would you feel with a more generic requirement like "The Application must have some means of streaming media using reliable transports"?  That requirement could be treated as orthogonal to the JS-access requirement that Travis recently posted.

Thanks

-----Original Message-----
From: Timothy B. Terriberry [mailto:tterriberry@mozilla.com] 
Sent: Wednesday, July 11, 2012 10:30 AM
To: public-media-capture@w3.org
Subject: Re: updates to requirements document

Young, Milan wrote:
> I believe this newly proposed requirement **is** tied to existing 
> material in the spec.  Section 5.10 reads:
> 
> "Local media stream captures are common in a variety of sharing 
> scenarios such as:
> 
> capture a video and upload to a video sharing site
> 
> capture a picture for my user profile picture in a given web app
> 
> capture audio for a translation site
> 
> capture a video chat/conference"
> 
> I'd argue that perhaps the first two and definitely the third scenario 
> require the application layer to have access to the media.

1) What you really want is not ex post facto access to the encoded form of data from a camera, but a general method of encoding a stream. As soon as you want to do any processing on the client side (even as simple as cropping, scaling, etc.) you're going to want to re-encode before uploading. At that point, I have no idea what this requirement has to do with capture. It applies equally to a MediaStream from any source.

In practice in WebRTC , the encoding actually happens right before the data goes to the network, and the process is intimately tied to the real-time nature of RTP and the constraints of the network. An "encoded representation of the media" doesn't exist before that point. You could satisfy this use-case in some (non-ideal) form today by doing what Randell suggests (using WebRTC and capturing the RTP stream, a la SIPREC). That at least wouldn't require any additional spec work.

2) For the image capture case, you almost certainly don't want an encoded video stream, you want to encode an image. There's already a way to do this (via the Canvas APIs).

3) For translation (which implies speech recognition), a) if you're doing this on the client-side, you want access to the _uncompressed_ media, not the compressed form. Every re-compression step only makes your job harder, and b) if you're doing this on the server side, then latency becomes very important, and the RTP recording suggested in step
1 is actually what you want, not some offline storage format.

4) Again, if you want to record this on the server, you want access to the RTP (preferably at the conference mixer, assuming there is one). No need for a browser API for that case. If you want to record it on the client, you want the general encoding API outlined in 1), but again this has nothing to do with media capture (as in camera/microphone access).

>From the scenarios outlined above, I'm still looking for where the
MediaSource API (which "extends HTMLMediaElement to allow JavaScript to generate media streams for playback") becomes at all relevant. Please clue me in.

Received on Wednesday, 11 July 2012 18:26:25 UTC