RE: terminology (was: updates to requirements document) from Young, Milan on 2012-07-13 (public-media-capture@w3.org from July 2012)

From: Young, Milan <Milan.Young@nuance.com>
Date: Fri, 13 Jul 2012 21:53:53 +0000
To: Jim Barnett <Jim.Barnett@genesyslab.com>, Travis Leithead <travis.leithead@microsoft.com>, Stefan Hakansson LK <stefan.lk.hakansson@ericsson.com>, "public-media-capture@w3.org" <public-media-capture@w3.org>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A4799CE@SOM-EXCH04.nuance.com>
Jim's description is essentially correct, but I would like to offer one point of clarification around the term "real time".  Most of the time that term is used in the context of RTP which implies that the consumer is only interested in the packets that fall within the playback window.

The functionality we need to address the audio transcription scenario (put forward in section 5.10) is a different kind of "real time".  I had referred to it as sending the audio "as capture is in progress".  I suspect the size of the buffers are still roughly the same (eg 50ms), but with the requirement that they are delivered on a reliable transport.  Most of the time I'd expect delivery to happen within 5s, but if it takes a bit longer on a congested network, for this use case, it's worth the wait.

It might be best to push this class of delivery down to WebRTC.  But it that cannot be arranged, then giving the JS layer the encoded samples seems like a reasonable alternative.

Thanks


-----Original Message-----
From: Jim Barnett [mailto:Jim.Barnett@genesyslab.com] 
Sent: Friday, July 13, 2012 12:40 PM
To: Travis Leithead; Stefan Hakansson LK; public-media-capture@w3.org
Subject: RE: terminology (was: updates to requirements document)

I think that Milan has a different use case in mind than either Stefan or Travis is thinking of.  He wants to do speech recognition on the audio.  For that he needs to capture the audio in chunks in real time (can't wait till the user is done speaking).  He also wants to select a different, reliable, transport - not UDP. The app will be grabbing buffers of audio data and shipping them off to a remote recognition engine.  I'll let Milan explain in detail, but it is rather different from recording to a file.  (It's also very useful - Genesys and other contact center companies will be interested in using webRTC for this, since it lets us build  speech recognition into a web page.)

- Jim

-----Original Message-----
From: Travis Leithead [mailto:travis.leithead@microsoft.com]
Sent: Friday, July 13, 2012 2:59 PM
To: Jim Barnett; Stefan Hakansson LK; public-media-capture@w3.org
Subject: RE: terminology (was: updates to requirements document)

Likewise, "record" and "capture" are synonyms to me too. In general, it seems like there are some other words we could use to be more precise, since we might be having misunderstandings based on terminology, which would be unfortunate.

My understanding of the original proposal for recording (see
http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-3) was that you could call a record() API to start _encoding_ the camera/mic's raw data into some binary format. Here I think the words "capture" and "record"
both seem to refer to this process. At some point in the future you could call getRecordedData() (see
http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-5) which would then asynchronously create a Blob object containing the encoded binary data in some known format (blob.type would indicate the mime type for the encoding whatever the UA decided to use -- there was no control or hint mechanism available via the API for encoded format selection). I believe the returned Blob was supposed to be a "complete" file, meaning that it's encoding contained a definitive start and end point, and was
*not* a binary slice of some larger file. In other words, the returned Blob could be played directly in the html audio or video tag, or saved to a file system for storage, or sent over XHR to a server.

So, when you mentioned the word "chunks" below, were you referring to the idea of calling getRecordedData() multiple times (assuming that each subsequent call reset the start-point of the next recording--which is actually *not* how that API was specified in fact)? Rather than "chunks"
I think of these as completely separate "capture" sessions--they are complete captures from end-to-end.

When I think of "chunks" I think of incomplete segments of the larger encoded in-progress capture. The point at which the larger encoded data buffer is sliced (to make a "chunk") might be arbitrary or not. I think that is something we can discuss. If it's arbitrary, than the JavaScript processing the raw encoded "chunks" must understand the format well-enough to know when there's not enough data available to correctly process a chunk, or where to stop. This is similar to how the HTML parser handles incoming bits from the wire before it determines what a page's encoding is. If we decide that the chunks must be sliced at more "appropriate" places, then the UA's must in turn implement this same logic given an understanding of the encoding in use. As an implementor, it seems like it would be much faster to just dump raw bits out of a slice arbitrarily (perhaps as quickly as possible after encoding) and let the JavaScript code deal with how to interpret them. In this case, the returned data should probably be in an TypedArray of some form.



> -----Original Message-----
> From: Jim Barnett [mailto:Jim.Barnett@genesyslab.com]
> Sent: Friday, July 13, 2012 6:16 AM
> To: Stefan Hakansson LK; public-media-capture@w3.org
> Subject: RE: updates to requirements document
> 
> Stefan,
> 
>   English is my native language and I don't  know the difference 
> between 'capture' and 'record' either.  The requirements doc used 
> 'capture' so I kept it, and introduced 'record' because that's the
term I normally use.
> If we can agree on a single term to use, I'll gladly update the spec.
> 
> 
> - Jim
> 
> -----Original Message-----
> From: Stefan Hakansson LK [mailto:stefan.lk.hakansson@ericsson.com]
> Sent: Friday, July 13, 2012 9:06 AM
> To: public-media-capture@w3.org
> Subject: Re: updates to requirements document
> 
> Milan,
> 
> isn't your core proposal that we should have a requirement that allows

> recording of audio (and it would apply to video as well I guess) to a
files, i.e.
> some kind of continuous chunked recording?
> 
> I think that would make sense (and that was how the original, 
> underspecified, recording function worked IIRC), and that those chunks

> would be possible to use as source in the MediaSource API proposal 
> (even if my number one priority would be that those files would be 
> possible to use as a source to the audio/video elements).
> 
> I do not understand why we would add words about "encoded" and so on 
> though. We don't use that kind of language in any other req, why here?
> 
> Stefan
> 
> PS English is not my native language, I would be very glad if someone 
> could explain the difference between "capture" and "record" for me - I

> must admit I do not know the difference. Ideally I would like one word

> meaning something like "using a mike/cam to start producing data" and 
> another one for "storing that data to a file".
> 
> 
> On 07/11/2012 06:04 PM, Young, Milan wrote:
> > Sorry if I'm missing context, but is there counter proposal or are 
> > you
> just warning us that this is a long haul?
> >
> > Thanks
> >
> > -----Original Message-----
> > From: Timothy B. Terriberry [mailto:tterriberry@mozilla.com]
> > Sent: Wednesday, July 11, 2012 8:50 AM
> > To: public-media-capture@w3.org
> > Subject: Re: updates to requirements document
> >
> > Randell Jesup wrote:
> >> And...  Defining the associated control information needed for 
> >> decoding is a significant task, especially as it would need to be 
> >> codec-agnostic.  (Which from the conversation I think you realize.)

> >> This also is an API that I believe we at Mozilla (or some of us) 
> >> disagree with (though I'm not the person primarily following this; 
> >> I think Robert O'Callahan and Tim Terriberry are).
> >
> > More than just codec-agnostic. It would have to be a) flexible 
> > enough to support all the formats people care about (already 
> > challenging by
> > itself) while b) well-defined enough to be re-implementable by every
> vendor in a compatible way. This leaves you quite a fine needle to
thread.
> >
> > I don't want people to under-estimate how much work is involved
here.
> >
> >
> 
> 
> 
>
Received on Friday, 13 July 2012 21:54:20 UTC