- From: Young, Milan <Milan.Young@nuance.com>
- Date: Wed, 25 Jul 2012 17:03:14 +0000
- To: Stefan Hakansson LK <stefan.lk.hakansson@ericsson.com>
- CC: Jim Barnett <Jim.Barnett@genesyslab.com>, Travis Leithead <travis.leithead@microsoft.com>, "public-media-capture@w3.org" <public-media-capture@w3.org>
The SpeechXG report was a combination document. It contained a proposed API for interfacing with a speech engine, and a method for transmitting the audio and control commands to a remote engine. Google recently took a subset of the API and used it as a starting point for a Community Group specification [1]. There has been no progress towards a remote engine interface. [1] http://www.w3.org/community/speech-api/ -----Original Message----- From: Stefan Hakansson LK [mailto:stefan.lk.hakansson@ericsson.com] Sent: Wednesday, July 25, 2012 9:31 AM To: Young, Milan Cc: Jim Barnett; Travis Leithead; public-media-capture@w3.org Subject: Re: terminology (was: updates to requirements document) On 07/25/2012 06:21 PM, Young, Milan wrote: > The use case for translation is already established in the document. > If we can't agree on a method for addressing that use case over email, > then we should simply add a general requirement as a placeholder. I > suggest: > > "The UA must expose capabilities for transmitting audio suitable for > live speech recognition." > > Objections? More of a question right now: what happened to the API that was proposed for this purpose (if I understand correctly)? It is part of the final report of the Speech XG (http://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/), and I think I heard rumors of a variant of it being proposed for implementation. Br, Stefan > > > -----Original Message----- From: Jim Barnett > [mailto:Jim.Barnett@genesyslab.com] Sent: Monday, July 23, 2012 7:15 > AM To: Young, Milan; Stefan Hakansson LK; Travis Leithead Cc: > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > I think that the speech recognition use case is important, so I agree > with Milan. If we can't agree on something via email, we should add > this as a topic for the next F2F. > > - Jim By the way, are there any other comments on the requirements > doc? I'm ready to make more changes. > > -----Original Message----- From: Young, Milan > [mailto:Milan.Young@nuance.com] Sent: Monday, July 23, 2012 10:12 AM > To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > I'd like to keep this discussion active. Are folks in agreement with > what I've written below? If not, is there a planned F2F where we > could add this to an agenda? > > Thanks > > > -----Original Message----- From: Young, Milan > [mailto:Milan.Young@nuance.com] Sent: Monday, July 16, 2012 8:40 AM > To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: RE: terminology (was: updates to > requirements document) > > Perhaps we're dealing with different use cases, but for the > translation scenario, requiring the application layer to poll the UA > for complete audio snippets is not optimal. This would tend to > produce both irregular intervals and add a significant overhead to the > encoding. > > I suggest that it would be better to use something like the WebAudio > API proposal. Namely an interface where the UA pushes blobs to the JS > layer on fixed interval. The data in the blobs would be complete from > an encoding perspective, but only directly playable when prepended > with all previous blobs in the session. > > I also suggest that the application layer should be given the ability > to select from available codecs at the start of the capture session. > If this is too complicated then we should specify that the UA SHOULD > prefer codecs optimized for voice since that would be the most common > audio type originating from the desktop microphone. > > Thanks > > > -----Original Message----- From: Stefan Hakansson LK > [mailto:stefan.lk.hakansson@ericsson.com] Sent: Monday, July 16, 2012 > 12:24 AM To: Travis Leithead Cc: Jim Barnett; > public-media-capture@w3.org Subject: Re: terminology (was: updates to > requirements document) > > On 07/13/2012 08:58 PM, Travis Leithead wrote: >> Likewise, "record" and "capture" are synonyms to me too. In general, >> it seems like there are some other words we could use to be more >> precise, since we might be having misunderstandings based on >> terminology, which would be unfortunate. > > I would like that. I would like one word for "enabling the mike/cam to > start producing samples". This would correspond to what "getUserMedia" > does. And another for storing those samples to a file. > >> >> My understanding of the original proposal for recording (see >> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-3) was that you >> could call a record() API to start _encoding_ the camera/mic's raw >> data into some binary format. Here I think the words "capture" >> and "record" both seem to refer to this process. At some point in the >> future you could call getRecordedData() (see >> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-5) which would >> then asynchronously create a Blob object containing the encoded >> binary > >> data in some known format (blob.type would indicate the mime type for >> the encoding whatever the UA decided to use -- there was no control >> or > >> hint mechanism available via the API for encoded format selection). >> I believe the returned Blob was supposed to be a "complete" file, >> meaning that it's encoding contained a definitive start and end >> point, > >> and was *not* a binary slice of some larger file. In other words, the >> returned Blob could be played directly in the html audio or video >> tag, > >> or saved to a file system for storage, or sent over XHR to a server. >> >> So, when you mentioned the word "chunks" below, were you referring to >> the idea of calling getRecordedData() multiple times (assuming that >> each subsequent call reset the start-point of the next >> recording--which is actually *not* how that API was specified in >> fact)? Rather than "chunks" I think of these as completely separate >> "capture" sessions--they are complete captures from end-to-end. > > I must admit I had not thought through in detail. I had in mind > something that would allow you to continuously record, but spit out > the result in smaller segments ("chunks"). I had not thought about how > the application should act to get that done. > >> >> When I think of "chunks" I think of incomplete segments of the larger >> encoded in-progress capture. The point at which the larger encoded >> data buffer is sliced (to make a "chunk") might be arbitrary or not. >> I think that is something we can discuss. If it's arbitrary, than the >> JavaScript processing the raw encoded "chunks" >> must understand the format well-enough to know when there's not >> enough data available to correctly process a chunk, or where to stop. >> This is similar to how the HTML parser handles incoming bits from the >> wire before it determines what a page's encoding is. If we decide >> that the chunks must be sliced at more "appropriate" places, then the >> UA's must in turn implement this same logic given an understanding of >> the encoding in use. As an implementor, it seems like it would be >> much faster to just dump raw bits out of a slice arbitrarily (perhaps >> as quickly as possible after encoding) and let the JavaScript code >> deal with how to interpret them. In this case, the returned data >> should probably be in an TypedArray of some form. >> >> >> >>> -----Original Message----- From: Jim Barnett >>> [mailto:Jim.Barnett@genesyslab.com] Sent: Friday, July 13, 2012 >>> 6:16 AM To: Stefan Hakansson LK; public-media-capture@w3.org >>> Subject: RE: updates to requirements document >>> >>> Stefan, >>> >>> English is my native language and I don't know the difference >>> between 'capture' and 'record' either. The requirements doc used >>> 'capture' so I kept it, and introduced 'record' because that's the >>> term I normally use. If we can agree on a single term to use, I'll >>> gladly update the spec. >>> >>> >>> - Jim >>> >>> -----Original Message----- From: Stefan Hakansson LK >>> [mailto:stefan.lk.hakansson@ericsson.com] Sent: Friday, July 13, >>> 2012 9:06 AM To: public-media-capture@w3.org Subject: Re: updates to >>> requirements document >>> >>> Milan, >>> >>> isn't your core proposal that we should have a requirement that >>> allows recording of audio (and it would apply to video as well I >>> guess) to a files, i.e. some kind of continuous chunked recording? >>> >>> I think that would make sense (and that was how the original, >>> underspecified, recording function worked IIRC), and that those >>> chunks would be possible to use as source in the MediaSource API >>> proposal (even if my number one priority would be that those files >>> would be possible to use as a source to the audio/video elements). >>> >>> I do not understand why we would add words about "encoded" and so on >>> though. We don't use that kind of language in any other req, why >>> here? >>> >>> Stefan >>> >>> PS English is not my native language, I would be very glad if >>> someone > >>> could explain the difference between "capture" and "record" for me - >>> I must admit I do not know the difference. Ideally I would like one >>> word meaning something like "using a mike/cam to start producing >>> data" and another one for "storing that data to a file". >>> >>> >>> On 07/11/2012 06:04 PM, Young, Milan wrote: >>>> Sorry if I'm missing context, but is there counter proposal or are >>>> you >>> just warning us that this is a long haul? >>>> >>>> Thanks >>>> >>>> -----Original Message----- From: Timothy B. Terriberry >>>> [mailto:tterriberry@mozilla.com] Sent: Wednesday, July 11, >>>> 2012 8:50 AM To: public-media-capture@w3.org Subject: Re: >>>> updates to requirements document >>>> >>>> Randell Jesup wrote: >>>>> And... Defining the associated control information needed for >>>>> decoding is a significant task, especially as it would need to be >>>>> codec-agnostic. (Which from the conversation I think you >>>>> realize.) This also is an API that I believe we at Mozilla (or >>>>> some > >>>>> of us) disagree with (though I'm not the person primarily >>>>> following > >>>>> this; I think Robert O'Callahan and Tim Terriberry are). >>>> >>>> More than just codec-agnostic. It would have to be a) flexible >>>> enough to support all the formats people care about (already >>>> challenging by itself) while b) well-defined enough to be >>>> re-implementable by every >>> vendor in a compatible way. This leaves you quite a fine needle to >>> thread. >>>> >>>> I don't want people to under-estimate how much work is involved >>>> here. >>>> >>>> >>> >>> >>> >>> >> >> > > > >
Received on Wednesday, 25 July 2012 17:03:47 UTC