[Prev][Index][Thread]

LDC Office: New Corpus from the LDC




------- Forwarded Message

Posted-Date: Mon, 17 Jul 1995 14:31:45 EDT
Approved-By:  tei-l <U59467@UICVM.BITNET>
Message-Id:  <9507171831.AA17759@unagi.cis.upenn.edu>
Date:         Tue, 18 Jul 1995 09:56:21 CDT
Reply-To: LDC Office <ldc@unagi.cis.upenn.edu>
Sender: Text Encoding Initiative public discussion list
              <TEI-L@uicvm.cc.uic.edu>
From: LDC Office <ldc@unagi.cis.upenn.edu>
Subject:      New Corpus from the LDC
To: Multiple recipients of list TEI-L <TEI-L@uicvm.cc.uic.edu>
content-length: 2392

                  Announcing a NEW RELEASE from the
                     LINGUISTIC DATA CONSORTIUM:


                   PHONEBOOK: NYNEX Isolated Words


PhoneBook is a phonetically-rich, isolated-word, telephone-speech
database, created because of (1) the lack of available
large-vocabulary isolated-word data, (2) anticipated continued
importance of isolated-word and keyword-spotting technology to
speech-recognition-based applications over the telephone, and
(3) findings that continuous-speech training data is inferior to
isolated-word training for isolated-word recognition.

The goal of PhoneBook is to serve as a large database of American
English word utterances incorporating all phonemes in as many
segmental/stress contexts as are likely to produce coarticulatory
variations, while also spanning a variety of talkers and telephone
transmission characteristics.  We anticipate that it will be useful
in ways analogous to TIMIT/NTIMIT.

The core section of PhoneBook consists of a total of 93,667
isolated-word utterances, totaling 23 hours of speech.  This breaks
down to 7979 distinct words, each said by an average of 11.7 talkers,
with 1358 talkers each saying up to 75 words.  All data were
collected in 8-bit mu-law digital form directly from a T1 telephone
line.  Talkers were adult native speakers of American English chosen
to be demographically representative of the U.S.

Given the large set of talkers being recruited for PhoneBook
database, it made sense to exploit the opportunity to collect
additional utterances.  We have chosen spontaneous numerical
utterances, because of widespread interest in them and the need for
very large numbers of talkers for research into spontaneous-speech
effects.  We restricted to just three spontaneous digit sequences and
one money amount, as the lists for the core of PhoneBook have been
designed to approach the limit of reasonable duration for a caller's
session.  As a result, PhoneBook contains a total of 5105 spontaneous
utterances.

Questions and orders for PHONEBOOK should be sent to
ldc@unagi.cis.upenn.edu.

Please be aware that we have changed our World Wide Web URL.  The new
URL for the LDC home page is:

        http://www.cis.upenn.edu/~ldc

Our ftp address has not changed.  LDC is at ftp.cis.upenn.edu under
pub/ldc.  When accessing anonymous ftp, use your computer id or
"anonymous" when asked for password.

------- End of Forwarded Message