RE: approaches to recording from Young, Milan on 2012-10-18 (public-media-capture@w3.org from October 2012)

From: Young, Milan <Milan.Young@nuance.com>
Date: Thu, 18 Oct 2012 00:22:39 +0000
To: Dan Burnett <dburnett@voxeo.com>, Jim Barnett <Jim.Barnett@genesyslab.com>
CC: Harald Alvestrand <harald@alvestrand.no>, "public-media-capture@w3.org" <public-media-capture@w3.org>
Message-ID: <B236B24082A4094A85003E8FFB8DDC3C1A4ABDDF@SOM-EXCH04.nuance.com>

From: Dan Burnett [mailto:dburnett@voxeo.com]
Sent: Wednesday, October 17, 2012 10:26 AM
To: Jim Barnett
Cc: Harald Alvestrand; public-media-capture@w3.org
Subject: Re: approaches to recording

I actually have been convinced for a while now that a new IETF protocol would be needed to best suit the needs of the speech recognition and synthesis industry.  The characteristics are very different from what humans need.
[Milan] Does your request for a new "protocol" center on alternate transport/framing over the existing WebRTC methodology, or is this an entirely new scheme?

Humans deal okay with gaps but need a steady-speed play out.

Speech recognition engines don't deal well with gaps but can handle big bursts of data at once.  Additionally, they don't need the data in real-time but do need to receive the final sample very close to when the person said it.  Real-time is helpful because the processor can be working as the data is streamed, but the worst thing is to have to wait at the end for the last sample, because that delay adds on to when the recognizer can respond.
[Milan] I agree waiting for the last sample is not ideal, but I'm unclear how this would be addressed.

-- dan

On Oct 12, 2012, at 11:39 AM, Jim Barnett wrote:

Harald,
The lack of real-time delivery is not normally an issue for speech recognition systems, because they run many times faster than real time, and can catch up quickly once the data is available.  So if the delays are short enough, the user will not perceive them.  And if the delays are longer, well... then speech recognition will take a long time.  People are used to stuff being slow on the internet, aren't they?

-          Jim

From: Harald Alvestrand [mailto:harald@alvestrand.no]<mailto:[mailto:harald@alvestrand.no]>
Sent: Friday, October 12, 2012 11:35 AM
To: public-media-capture@w3.org<mailto:public-media-capture@w3.org>
Subject: Re: approaches to recording

On 10/11/2012 12:50 AM, Jim Barnett wrote:
I just want to observe that lossless streaming is what we (= the contact center and speech industry) want for  talking to a speech recognition system.  It would be ideal if PeerConnection supported it.  Failing that, it would be nice if the Recorder supported it,  but in a pinch we figure that we can use the track-level API to deliver buffers of speech data and let the JS code set up the TCP/IP connection.

Of course lossless streaming (truly guaranteed delivery) implies non-real-time streaming (or, more formally, having to deal with the possibility that delivery will be delayed beyond real-time), given that the Internet is a lossy medium.

To another thread: Yes, having the constructor for the recorder take a MIME type parameter would imply that you set the codec to be used. I think we all agree that the data coming out of a recording interface is encoded.

           Harald

Received on Thursday, 18 October 2012 00:23:08 UTC