Re: Are phonemic/phonetic representations available on the input side? from James Salsman on 2000-08-18 (www-voice@w3.org from July to September 2000)

From: James Salsman <j.salsman@bovik.org>
Date: Thu, 17 Aug 2000 22:12:52 -0700 (PDT)
To: "Phil" <philshinn@mediaone.net>, lenzo@cs.cmu.edu, rkm@cs.cmu.edu
Cc: <www-voice@w3.org>
Message-Id: <200008180512.WAA21613@shell9.ba.best.com>

On www-voice@w3.org, "Phil" wrote:
>
> Suppose someone wanted the phonemic or phonetic output of a speech
> recognizer, rather than the orthography.  Is this included anywhere in the
> latest spec?  I couldn't find it.  Also, how about access to various
> acoustic parameters like segmental pitch, amplitude and duration?

That information is crucial for educational applications.  For an 
illustration of the reasons it is so important, please see Figures 
2 and 3 of this article:

  http://polyglot.cal.msu.edu/llt/vol2num2/article3/

Trying to make educational software based on automatic speech 
recognition that can't identify phoneme segment boundaries is as 
pointless as trying to make word processors that can't wrap lines 
on word boundaries.  But that hasn't stopped dozens of such 
products from being released to a generally underwhelmed public.

Sadly, phoneme segmentation time alignments, or for that matter 
access to any aspect of speech waveforms provided to automatic
recognition systems, are not mentioned in W3C documents at all.

There is some interesting historical background on this subject.  
Microsoft is the only major speech recognition technology vendor 
that has ever supported phoneme segmentation (in their v4.0a SSDK,
released in 1998, using the ISResGraphEx interface) however, they 
have since purchased Entropic Labs, Inc. and eliminated their HTK 
product line (which was the only other commercial product that
provided phoneme segmentation information) and Microsoft removed 
all access to segmentation recognition results in their new SSDK
and SAPI version 5.0, released less than a month ago.  

Dragon, IBM, Sun, and all the other speech recognition vendors 
have never provided the necessary information, even though it is 
easily available from the Viterbi beam-search HMM recognition 
routines that they all use.

Lernout & Hauspie has announced a product line specifically for 
language learning applications which is in alpha now, if not 
already in beta, and will be fully released by the end of the year.  
I've seen some preliminary marketing reports on L&H's language 
learning application kit, but after asking all those companies to 
provide the features, ever since Mike Rozak announced Microsoft's 
SAPI at the 1994 AVIOS conference in San Jose, I'll believe it 
when I see it shipping.  The good news is that those products 
should be available for all of the languages that L&H supports; at 
least twelve different languages I believe.

The CMU Sphinx II and III recognition systems have both recently 
been released as open source with a commerce-friendly BSD-style 
license, including phoneme models, and both are capable of phoneme 
segmentation time alignment, though Sphinx III is said to do a 
much better job of it.  You can find them both at:

  http://sourceforge.net/projects/cmusphinx/

Sphinx II is in the "File Releases" area, and Sphinx III is in 
the "CVS Repository."  

I would like to ask Kevin Lenzo and/or M. K. Ravishankar to follow 
up, if they would, with an example of using Sphinx III to do 
phoneme segmentation alignments, as I've not yet been able to 
determine how this is done.

Kevin, Ravi -- would either of you please explain how to do that?  
Thank you both for the hard work you've put into the open source 
Sphinx systems.

Cheers,
James

Received on Friday, 18 August 2000 01:13:21 UTC