RE: terminology (was: updates to requirements document) from Young, Milan on 2012-07-25 (public-media-capture@w3.org from July 2012)

From: Young, Milan <Milan.Young@nuance.com>
Date: Wed, 25 Jul 2012 17:03:14 +0000
To: Stefan Hakansson LK <stefan.lk.hakansson@ericsson.com>
CC: Jim Barnett <Jim.Barnett@genesyslab.com>, Travis Leithead <travis.leithead@microsoft.com>, "public-media-capture@w3.org" <public-media-capture@w3.org>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A48438E@SOM-EXCH04.nuance.com>
The SpeechXG report was a combination document.  It contained a proposed API for interfacing with a speech engine, and a method for transmitting the audio and control commands to a remote engine.

Google recently took a subset of the API and used it as a starting point for a Community Group specification [1].  There has been no progress towards a remote engine interface.

[1] http://www.w3.org/community/speech-api/


-----Original Message-----
From: Stefan Hakansson LK [mailto:stefan.lk.hakansson@ericsson.com] 
Sent: Wednesday, July 25, 2012 9:31 AM
To: Young, Milan
Cc: Jim Barnett; Travis Leithead; public-media-capture@w3.org
Subject: Re: terminology (was: updates to requirements document)

On 07/25/2012 06:21 PM, Young, Milan wrote:
> The use case for translation is already established in the document.
> If we can't agree on a method for addressing that use case over email, 
> then we should simply add a general requirement as a placeholder.  I 
> suggest:
>
> "The UA must expose capabilities for transmitting audio suitable for 
> live speech recognition."
>
> Objections?

More of a question right now: what happened to the API that was proposed for this purpose (if I understand correctly)? It is part of the final report of the Speech XG (http://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/),
and I think I heard rumors of a variant of it being proposed for implementation.

Br,
Stefan

>
>
> -----Original Message----- From: Jim Barnett 
> [mailto:Jim.Barnett@genesyslab.com] Sent: Monday, July 23, 2012 7:15 
> AM To: Young, Milan; Stefan Hakansson LK; Travis Leithead Cc:
> public-media-capture@w3.org Subject: RE: terminology (was: updates to 
> requirements document)
>
> I think that the speech recognition use case is important, so I agree 
> with Milan.  If we can't  agree on something via email, we should add 
> this as a topic for the next F2F.
>
> - Jim By the way, are there any other comments on the requirements 
> doc?  I'm ready to make more changes.
>
> -----Original Message----- From: Young, Milan 
> [mailto:Milan.Young@nuance.com] Sent: Monday, July 23, 2012 10:12 AM
> To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; 
> public-media-capture@w3.org Subject: RE: terminology (was: updates to 
> requirements document)
>
> I'd like to keep this discussion active.  Are folks in agreement with 
> what I've written below?  If not, is there a planned F2F where we 
> could add this to an agenda?
>
> Thanks
>
>
> -----Original Message----- From: Young, Milan 
> [mailto:Milan.Young@nuance.com] Sent: Monday, July 16, 2012 8:40 AM
> To: Stefan Hakansson LK; Travis Leithead Cc: Jim Barnett; 
> public-media-capture@w3.org Subject: RE: terminology (was: updates to 
> requirements document)
>
> Perhaps we're dealing with different use cases, but for the 
> translation scenario, requiring the application layer to poll the UA 
> for complete audio snippets is not optimal.  This would tend to 
> produce both irregular intervals and add a significant overhead to the 
> encoding.
>
> I suggest that it would be better to use something like the WebAudio 
> API proposal.  Namely an interface where the UA pushes blobs to the JS 
> layer on fixed interval.  The data in the blobs would be complete from 
> an encoding perspective, but only directly playable when prepended 
> with all previous blobs in the session.
>
> I also suggest that the application layer should be given the ability 
> to select from available codecs at the start of the capture session.
> If this is too complicated then we should specify that the UA SHOULD 
> prefer codecs optimized for voice since that would be the most common 
> audio type originating from the desktop microphone.
>
> Thanks
>
>
> -----Original Message----- From: Stefan Hakansson LK 
> [mailto:stefan.lk.hakansson@ericsson.com] Sent: Monday, July 16, 2012
> 12:24 AM To: Travis Leithead Cc: Jim Barnett; 
> public-media-capture@w3.org Subject: Re: terminology (was: updates to 
> requirements document)
>
> On 07/13/2012 08:58 PM, Travis Leithead wrote:
>> Likewise, "record" and "capture" are synonyms to me too. In general, 
>> it seems like there are some other words we could use to be more 
>> precise, since we might be having misunderstandings based on 
>> terminology, which would be unfortunate.
>
> I would like that. I would like one word for "enabling the mike/cam to 
> start producing samples". This would correspond to what "getUserMedia" 
> does. And another for storing those samples to a file.
>
>>
>> My understanding of the original proposal for recording (see
>> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-3) was that you 
>> could call a record() API to start _encoding_ the camera/mic's raw 
>> data into some binary format. Here I think the words "capture"
>> and "record" both seem to refer to this process. At some point in the 
>> future you could call getRecordedData() (see
>> http://www.w3.org/TR/2011/WD-webrtc-20111027/#methods-5) which would 
>> then asynchronously create a Blob object containing the encoded 
>> binary
>
>> data in some known format (blob.type would indicate the mime type for 
>> the encoding whatever the UA decided to use -- there was no control 
>> or
>
>> hint mechanism available via the API for encoded format selection).
>> I believe the returned Blob was supposed to be a "complete" file, 
>> meaning that it's encoding contained a definitive start and end 
>> point,
>
>> and was *not* a binary slice of some larger file. In other words, the 
>> returned Blob could be played directly in the html audio or video 
>> tag,
>
>> or saved to a file system for storage, or sent over XHR to a server.
>>
>> So, when you mentioned the word "chunks" below, were you referring to 
>> the idea of calling getRecordedData() multiple times (assuming that 
>> each subsequent call reset the start-point of the next 
>> recording--which is actually *not* how that API was specified in 
>> fact)? Rather than "chunks" I think of these as completely separate 
>> "capture" sessions--they are complete captures from end-to-end.
>
> I must admit I had not thought through in detail. I had in mind 
> something that would allow you to continuously record, but spit out 
> the result in smaller segments ("chunks"). I had not thought about how 
> the application should act to get that done.
>
>>
>> When I think of "chunks" I think of incomplete segments of the larger 
>> encoded in-progress capture. The point at which the larger encoded 
>> data buffer is sliced (to make a "chunk") might be arbitrary or not. 
>> I think that is something we can discuss. If it's arbitrary, than the 
>> JavaScript processing the raw encoded "chunks"
>> must understand the format well-enough to know when there's not 
>> enough data available to correctly process a chunk, or where to stop. 
>> This is similar to how the HTML parser handles incoming bits from the 
>> wire before it determines what a page's encoding is. If we decide 
>> that the chunks must be sliced at more "appropriate" places, then the 
>> UA's must in turn implement this same logic given an understanding of 
>> the encoding in use. As an implementor, it seems like it would be 
>> much faster to just dump raw bits out of a slice arbitrarily (perhaps 
>> as quickly as possible after encoding) and let the JavaScript code 
>> deal with how to interpret them. In this case, the returned data 
>> should probably be in an TypedArray of some form.
>>
>>
>>
>>> -----Original Message----- From: Jim Barnett 
>>> [mailto:Jim.Barnett@genesyslab.com] Sent: Friday, July 13, 2012
>>> 6:16 AM To: Stefan Hakansson LK; public-media-capture@w3.org
>>> Subject: RE: updates to requirements document
>>>
>>> Stefan,
>>>
>>> English is my native language and I don't  know the difference 
>>> between 'capture' and 'record' either.  The requirements doc used 
>>> 'capture' so I kept it, and introduced 'record' because that's the 
>>> term I normally use. If we can agree on a single term to use, I'll 
>>> gladly update the spec.
>>>
>>>
>>> - Jim
>>>
>>> -----Original Message----- From: Stefan Hakansson LK 
>>> [mailto:stefan.lk.hakansson@ericsson.com] Sent: Friday, July 13,
>>> 2012 9:06 AM To: public-media-capture@w3.org Subject: Re: updates to 
>>> requirements document
>>>
>>> Milan,
>>>
>>> isn't your core proposal that we should have a requirement that 
>>> allows recording of audio (and it would apply to video as well I
>>> guess) to a files, i.e. some kind of continuous chunked recording?
>>>
>>> I think that would make sense (and that was how the original, 
>>> underspecified, recording function worked IIRC), and that those 
>>> chunks would be possible to use as source in the MediaSource API 
>>> proposal (even if my number one priority would be that those files 
>>> would be possible to use as a source to the audio/video elements).
>>>
>>> I do not understand why we would add words about "encoded" and so on 
>>> though. We don't use that kind of language in any other req, why 
>>> here?
>>>
>>> Stefan
>>>
>>> PS English is not my native language, I would be very glad if 
>>> someone
>
>>> could explain the difference between "capture" and "record" for me - 
>>> I must admit I do not know the difference. Ideally I would like one 
>>> word meaning something like "using a mike/cam to start producing 
>>> data" and another one for "storing that data to a file".
>>>
>>>
>>> On 07/11/2012 06:04 PM, Young, Milan wrote:
>>>> Sorry if I'm missing context, but is there counter proposal or are 
>>>> you
>>> just warning us that this is a long haul?
>>>>
>>>> Thanks
>>>>
>>>> -----Original Message----- From: Timothy B. Terriberry 
>>>> [mailto:tterriberry@mozilla.com] Sent: Wednesday, July 11,
>>>> 2012 8:50 AM To: public-media-capture@w3.org Subject: Re:
>>>> updates to requirements document
>>>>
>>>> Randell Jesup wrote:
>>>>> And...  Defining the associated control information needed for 
>>>>> decoding is a significant task, especially as it would need to be 
>>>>> codec-agnostic.  (Which from the conversation I think you 
>>>>> realize.) This also is an API that I believe we at Mozilla (or 
>>>>> some
>
>>>>> of us) disagree with (though I'm not the person primarily 
>>>>> following
>
>>>>> this; I think Robert O'Callahan and Tim Terriberry are).
>>>>
>>>> More than just codec-agnostic. It would have to be a) flexible 
>>>> enough to support all the formats people care about (already 
>>>> challenging by itself) while b) well-defined enough to be 
>>>> re-implementable by every
>>> vendor in a compatible way. This leaves you quite a fine needle to 
>>> thread.
>>>>
>>>> I don't want people to under-estimate how much work is involved 
>>>> here.
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
Received on Wednesday, 25 July 2012 17:03:47 UTC