W3C home > Mailing lists > Public > www-voice@w3.org > October to December 2005

RE: Voice Recognition Profiles

From: Shires, Glen <glen.shires@intel.com>
Date: Fri, 11 Nov 2005 14:34:03 -0500
Message-ID: <E00D1670E16BF849B254F7A0B8EED7BA0350C28C@pysmsx401.amr.corp.intel.com>
To: <www-voice@w3.org>

A simple way to address the different technologies used for profiles is to store the voice samples as plain audio. For example, by standardizing on a common training text (e.g. a few paragraphs in the public domain), and asking users to make a high-quality recording of it in a standardized way, then this audio could be used as input to virtually any speech recognition system for training.  As an example, the recording could be standardized to be 24kHz sampling rate, 16-bits/sample and stored in a specific non-lossy format and recorded through a specified near-field microphone.  Speech recognition systems could then process this audio to match the input characteristics of their own system; for example, mimicking properties of their microphone and environment and re-sampling to a different sampling rate.

Thus, the same audio samples could be used for training virtually any speech recognition system. For example, they could be recorded on a PC using a standardized application, then uploaded to a central web-site and downloaded by other devices that you use.

This is my personal opinion,
Glen Shires

From: www-voice-request@w3.org [mailto:www-voice-request@w3.org] On Behalf Of Baggia Paolo
Sent: Friday, November 11, 2005 2:53 AM
To: B.K. DeLong
Cc: Baggia Paolo; www-voice@w3.org
Subject: Re: Voice Recognition Profiles

Dear ..,

I'd like to give you some more information on the background
of your proposals. 

There are at least two broad classes of ASR:
- telephony ASR
- dictation ASR

The former does not require any kind of training, because it is
designed to be used by all possible speakers of a given language,
so the ASR is using a general acoustic model trained on a large
population of speakers.

Conversely the latter is for a personal use, so the training
is used for improving the performances on given speaker. Even in
this field from a very long training session (reading predefined
sentences) the current version of dictation ASR are using general
acoustic models as a baseline, so the training needed is reduced.

For telephony ASR there are approaches to adapt online the acoustic
models to improve the performance of the actual speaker. This is done
during the course of the speech interaction, without the need of
an explicit training phase.

A second aspect is that it is very premature to speak of a
Voice Recognition Profile today. All the technologies are different
so it is almost impossible to hava a standard profile, but your
idea is in principle good.

THis is my personal opinion,
Paolo Baggia, Loquendo.

Voice Recognition Profiles

This message: [ Message body ] [ Respond ] [ More options ] 
Related messages: [ Next message ] [ Previous message ] 
From: B.K. DeLong 
Date: Fri, 28 Oct 2005 08:26:32 -0400
To: www-voice@w3.org 

I'm not sure if this is the right place to discuss this - I looked 
through the archives of this list and several TRs from the Voice 
activity and didn't really find anything to answer my question.

Have any efforts been made to make a standard for voice recognition 
training profiles? Is "training" even necessary any more for voice 
recognition systems?

So when I load up a voice recognition program, I am told to read 
several lines or paragraphs of text so it can match the text content 
with my voice. For every program I try, I have to retrain it all over 
again. In theory, if I move from my computer to my car and try to 
activate my GPS system by voice, it needs to be trained. If I go to 
an ATM or drive-thru where one can automatically order by voice, I 
need to spend several minutes correcting the system until I'm 
connected with a human operator.

Why not create a standard profile for voice recognition that all 
voice-recognition applications can use? That way, when I come to a 
new system I need to "train", I just type in my SSN or some other UID 
which tells the system to pull my VRP (Voice Recognition Profile), 
out of a centralized directory service, allowing me to immediately 
use the system.

In theory, each time I access a new service, whatever actions I take 
and corrections I make in the process, would be noted in the file for 
the next time I access a service - a live, constantly-growing, 
learning profile.

Does such a standard or technology effort exist?

B.K. DeLong
+1.617.797.8471 (Note new number)

http://www.brain-stream.com Play.
http://www.bostonredcross.org Volunteer.
http://www.the-leaky-cauldron.org Potter.
http://www.hackerfoundation.org Future.
http://www.wkdelong.org Son.

PGP Fingerprint:
38D4 D4D4 5819 8667 DFD5 A62D AF61 15FF 297D 67FE


Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.

This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please send an e_mail to
<mailto:webmaster@telecomitalia.it>webmaster@telecomitalia.it. Thank you
Received on Friday, 11 November 2005 19:34:24 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:07:38 UTC