- From: Stefan Hakansson LK <stefan.lk.hakansson@ericsson.com>
- Date: Wed, 25 Jul 2012 18:31:22 +0200
- To: "Young, Milan" <Milan.Young@nuance.com>
- CC: Jim Barnett <Jim.Barnett@genesyslab.com>, Travis Leithead <travis.leithead@microsoft.com>, "public-media-capture@w3.org" <public-media-capture@w3.org>
On 07/25/2012 06:21 PM, Young, Milan wrote: > The use case for translation is already established in the document. > If we can't agree on a method for addressing that use case over > email, then we should simply add a general requirement as a > placeholder. I suggest: > > "The UA must expose capabilities for transmitting audio suitable for > live speech recognition." > > Objections? More of a question right now: what happened to the API that was proposed for this purpose (if I understand correctly)? It is part of the final report of the Speech XG (http://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/), and I think I heard rumors of a variant of it being proposed for implementation. Br, Stefan > > > -----Original Message----- From: Jim Barnett > [mailto:Jim.Barnett@genesyslab.com] Sent: Monday, July 23, 2012 7:15 > AM To: Young, Milan; Stefan Hakansson LK; Travis Leithead Cc: > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > I think that the speech recognition use case is important, so I agree > with Milan. If we can't agree on something via email, we should > add this as a topic for the next F2F. > > - Jim By the way, are there any other comments on the requirements > doc? I'm ready to make more changes. > > -----Original Message----- From: Young, Milan > [mailto:Milan.Young@nuance.com] Sent: Monday, July 23, 2012 10:12 AM > To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > I'd like to keep this discussion active. Are folks in agreement with > what I've written below? If not, is there a planned F2F where we > could add this to an agenda? > > Thanks > > > -----Original Message----- From: Young, Milan > [mailto:Milan.Young@nuance.com] Sent: Monday, July 16, 2012 8:40 AM > To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > Perhaps we're dealing with different use cases, but for the > translation scenario, requiring the application layer to poll the UA > for complete audio snippets is not optimal. This would tend to > produce both irregular intervals and add a significant overhead to > the encoding. > > I suggest that it would be better to use something like the WebAudio > API proposal. Namely an interface where the UA pushes blobs to the > JS layer on fixed interval. The data in the blobs would be complete > from an encoding perspective, but only directly playable when > prepended with all previous blobs in the session. > > I also suggest that the application layer should be given the ability > to select from available codecs at the start of the capture session. > If this is too complicated then we should specify that the UA SHOULD > prefer codecs optimized for voice since that would be the most common > audio type originating from the desktop microphone. > > Thanks > > > -----Original Message----- From: Stefan Hakansson LK > [mailto:stefan.lk.hakansson@ericsson.com] Sent: Monday, July 16, 2012 > 12:24 AM To: Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: Re: terminology (was: updates to > requirements document) > > On 07/13/2012 08:58 PM, Travis Leithead wrote: >> Likewise, "record" and "capture" are synonyms to me too. In >> general, it seems like there are some other words we could use to >> be more precise, since we might be having misunderstandings based >> on terminology, which would be unfortunate. > > I would like that. I would like one word for "enabling the mike/cam > to start producing samples". This would correspond to what > "getUserMedia" does. And another for storing those samples to a > file. > >> >> My understanding of the original proposal for recording (see >> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-3) was that >> you could call a record() API to start _encoding_ the camera/mic's >> raw data into some binary format. Here I think the words "capture" >> and "record" both seem to refer to this process. At some point in >> the future you could call getRecordedData() (see >> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-5) which >> would then asynchronously create a Blob object containing the >> encoded binary > >> data in some known format (blob.type would indicate the mime type >> for the encoding whatever the UA decided to use -- there was no >> control or > >> hint mechanism available via the API for encoded format selection). >> I believe the returned Blob was supposed to be a "complete" file, >> meaning that it's encoding contained a definitive start and end >> point, > >> and was *not* a binary slice of some larger file. In other words, >> the returned Blob could be played directly in the html audio or >> video tag, > >> or saved to a file system for storage, or sent over XHR to a >> server. >> >> So, when you mentioned the word "chunks" below, were you referring >> to the idea of calling getRecordedData() multiple times (assuming >> that each subsequent call reset the start-point of the next >> recording--which is actually *not* how that API was specified in >> fact)? Rather than "chunks" I think of these as completely >> separate "capture" sessions--they are complete captures from >> end-to-end. > > I must admit I had not thought through in detail. I had in mind > something that would allow you to continuously record, but spit out > the result in smaller segments ("chunks"). I had not thought about > how the application should act to get that done. > >> >> When I think of "chunks" I think of incomplete segments of the >> larger encoded in-progress capture. The point at which the larger >> encoded data buffer is sliced (to make a "chunk") might be >> arbitrary or not. I think that is something we can discuss. If it's >> arbitrary, than the JavaScript processing the raw encoded "chunks" >> must understand the format well-enough to know when there's not >> enough data available to correctly process a chunk, or where to >> stop. This is similar to how the HTML parser handles incoming bits >> from the wire before it determines what a page's encoding is. If we >> decide that the chunks must be sliced at more "appropriate" places, >> then the UA's must in turn implement this same logic given an >> understanding of the encoding in use. As an implementor, it seems >> like it would be much faster to just dump raw bits out of a slice >> arbitrarily (perhaps as quickly as possible after encoding) and let >> the JavaScript code deal with how to interpret them. In this case, >> the returned data should probably be in an TypedArray of some >> form. >> >> >> >>> -----Original Message----- From: Jim Barnett >>> [mailto:Jim.Barnett@genesyslab.com] Sent: Friday, July 13, 2012 >>> 6:16 AM To: Stefan Hakansson LK; public-media-capture@w3.org >>> Subject: RE: updates to requirements document >>> >>> Stefan, >>> >>> English is my native language and I don't know the difference >>> between 'capture' and 'record' either. The requirements doc >>> used 'capture' so I kept it, and introduced 'record' because >>> that's the term I normally use. If we can agree on a single term >>> to use, I'll gladly update the spec. >>> >>> >>> - Jim >>> >>> -----Original Message----- From: Stefan Hakansson LK >>> [mailto:stefan.lk.hakansson@ericsson.com] Sent: Friday, July 13, >>> 2012 9:06 AM To: public-media-capture@w3.org Subject: Re: updates >>> to requirements document >>> >>> Milan, >>> >>> isn't your core proposal that we should have a requirement that >>> allows recording of audio (and it would apply to video as well I >>> guess) to a files, i.e. some kind of continuous chunked >>> recording? >>> >>> I think that would make sense (and that was how the original, >>> underspecified, recording function worked IIRC), and that those >>> chunks would be possible to use as source in the MediaSource API >>> proposal (even if my number one priority would be that those >>> files would be possible to use as a source to the audio/video >>> elements). >>> >>> I do not understand why we would add words about "encoded" and so >>> on though. We don't use that kind of language in any other req, >>> why here? >>> >>> Stefan >>> >>> PS English is not my native language, I would be very glad if >>> someone > >>> could explain the difference between "capture" and "record" for >>> me - I must admit I do not know the difference. Ideally I would >>> like one word meaning something like "using a mike/cam to start >>> producing data" and another one for "storing that data to a >>> file". >>> >>> >>> On 07/11/2012 06:04 PM, Young, Milan wrote: >>>> Sorry if I'm missing context, but is there counter proposal or >>>> are you >>> just warning us that this is a long haul? >>>> >>>> Thanks >>>> >>>> -----Original Message----- From: Timothy B. Terriberry >>>> [mailto:tterriberry@mozilla.com] Sent: Wednesday, July 11, >>>> 2012 8:50 AM To: public-media-capture@w3.org Subject: Re: >>>> updates to requirements document >>>> >>>> Randell Jesup wrote: >>>>> And... Defining the associated control information needed >>>>> for decoding is a significant task, especially as it would >>>>> need to be codec-agnostic. (Which from the conversation I >>>>> think you realize.) This also is an API that I believe we at >>>>> Mozilla (or some > >>>>> of us) disagree with (though I'm not the person primarily >>>>> following > >>>>> this; I think Robert O'Callahan and Tim Terriberry are). >>>> >>>> More than just codec-agnostic. It would have to be a) flexible >>>> enough to support all the formats people care about (already >>>> challenging by itself) while b) well-defined enough to be >>>> re-implementable by every >>> vendor in a compatible way. This leaves you quite a fine needle >>> to thread. >>>> >>>> I don't want people to under-estimate how much work is >>>> involved here. >>>> >>>> >>> >>> >>> >>> >> >> > > > >
Received on Wednesday, 25 July 2012 16:31:52 UTC