Re: approaches to recording from Cullen Jennings (fluffy) on 2012-10-28 (public-media-capture@w3.org from October 2012)

From: Cullen Jennings (fluffy) <fluffy@cisco.com>
Date: Sun, 28 Oct 2012 06:49:48 +0000
To: "Young, Milan" <Milan.Young@nuance.com>
CC: Dan Burnett <dburnett@voxeo.com>, Jim Barnett <Jim.Barnett@genesyslab.com>, Harald Alvestrand <harald@alvestrand.no>, "public-media-capture@w3.org" <public-media-capture@w3.org>
Message-ID: <C5E08FE080ACFD4DAE31E4BDBF944EB1118A6EDD@xmb-aln-x02.cisco.com>

There has been discussion at times of a protocol where the extraction of the classic speech reco vectors is done at the client then those are spend to the speech reco engine on the server side. This allows the vectors to get audio that has not been compressed, and pushes the heavy signal processing load to the client while keeping all the proprietary processing logic for speech reco on the server side and keeping the bandwidth needed between the two very low. 


On Oct 18, 2012, at 2:22 , "Young, Milan" <Milan.Young@nuance.com> wrote:

> From: Dan Burnett [mailto:dburnett@voxeo.com] 
> Sent: Wednesday, October 17, 2012 10:26 AM
> To: Jim Barnett
> Cc: Harald Alvestrand; public-media-capture@w3.org
> Subject: Re: approaches to recording
>  
> I actually have been convinced for a while now that a new IETF protocol would be needed to best suit the needs of the speech recognition and synthesis industry.  The characteristics are very different from what humans need.
> [Milan] Does your request for a new “protocol” center on alternate transport/framing over the existing WebRTC methodology, or is this an entirely new scheme?
>  
>  
> Humans deal okay with gaps but need a steady-speed play out.
>  
> Speech recognition engines don't deal well with gaps but can handle big bursts of data at once.  Additionally, they don't need the data in real-time but do need to receive the final sample very close to when the person said it.  Real-time is helpful because the processor can be working as the data is streamed, but the worst thing is to have to wait at the end for the last sample, because that delay adds on to when the recognizer can respond.
> [Milan] I agree waiting for the last sample is not ideal, but I’m unclear how this would be addressed.
>  
>  
> -- dan
>  
> On Oct 12, 2012, at 11:39 AM, Jim Barnett wrote:
> 
> 
> Harald,
> The lack of real-time delivery is not normally an issue for speech recognition systems, because they run many times faster than real time, and can catch up quickly once the data is available.  So if the delays are short enough, the user will not perceive them.  And if the delays are longer, well… then speech recognition will take a long time.  People are used to stuff being slow on the internet, aren’t they?
>  
> -          Jim
>  
> From: Harald Alvestrand [mailto:harald@alvestrand.no] 
> Sent: Friday, October 12, 2012 11:35 AM
> To: public-media-capture@w3.org
> Subject: Re: approaches to recording
>  
> On 10/11/2012 12:50 AM, Jim Barnett wrote:
> I just want to observe that lossless streaming is what we (= the contact center and speech industry) want for  talking to a speech recognition system.  It would be ideal if PeerConnection supported it.  Failing that, it would be nice if the Recorder supported it,  but in a pinch we figure that we can use the track-level API to deliver buffers of speech data and let the JS code set up the TCP/IP connection. 
>  
> Of course lossless streaming (truly guaranteed delivery) implies non-real-time streaming (or, more formally, having to deal with the possibility that delivery will be delayed beyond real-time), given that the Internet is a lossy medium.
> 
> To another thread: Yes, having the constructor for the recorder take a MIME type parameter would imply that you set the codec to be used. I think we all agree that the data coming out of a recording interface is encoded.
> 
>            Harald
> 
>

Received on Sunday, 28 October 2012 06:50:17 UTC