- From: James Salsman <j.salsman@bovik.org>
- Date: Thu, 17 Aug 2000 22:12:52 -0700 (PDT)
- To: "Phil" <philshinn@mediaone.net>, lenzo@cs.cmu.edu, rkm@cs.cmu.edu
- Cc: <www-voice@w3.org>
On www-voice@w3.org, "Phil" wrote: > > Suppose someone wanted the phonemic or phonetic output of a speech > recognizer, rather than the orthography. Is this included anywhere in the > latest spec? I couldn't find it. Also, how about access to various > acoustic parameters like segmental pitch, amplitude and duration? That information is crucial for educational applications. For an illustration of the reasons it is so important, please see Figures 2 and 3 of this article: http://polyglot.cal.msu.edu/llt/vol2num2/article3/ Trying to make educational software based on automatic speech recognition that can't identify phoneme segment boundaries is as pointless as trying to make word processors that can't wrap lines on word boundaries. But that hasn't stopped dozens of such products from being released to a generally underwhelmed public. Sadly, phoneme segmentation time alignments, or for that matter access to any aspect of speech waveforms provided to automatic recognition systems, are not mentioned in W3C documents at all. There is some interesting historical background on this subject. Microsoft is the only major speech recognition technology vendor that has ever supported phoneme segmentation (in their v4.0a SSDK, released in 1998, using the ISResGraphEx interface) however, they have since purchased Entropic Labs, Inc. and eliminated their HTK product line (which was the only other commercial product that provided phoneme segmentation information) and Microsoft removed all access to segmentation recognition results in their new SSDK and SAPI version 5.0, released less than a month ago. Dragon, IBM, Sun, and all the other speech recognition vendors have never provided the necessary information, even though it is easily available from the Viterbi beam-search HMM recognition routines that they all use. Lernout & Hauspie has announced a product line specifically for language learning applications which is in alpha now, if not already in beta, and will be fully released by the end of the year. I've seen some preliminary marketing reports on L&H's language learning application kit, but after asking all those companies to provide the features, ever since Mike Rozak announced Microsoft's SAPI at the 1994 AVIOS conference in San Jose, I'll believe it when I see it shipping. The good news is that those products should be available for all of the languages that L&H supports; at least twelve different languages I believe. The CMU Sphinx II and III recognition systems have both recently been released as open source with a commerce-friendly BSD-style license, including phoneme models, and both are capable of phoneme segmentation time alignment, though Sphinx III is said to do a much better job of it. You can find them both at: http://sourceforge.net/projects/cmusphinx/ Sphinx II is in the "File Releases" area, and Sphinx III is in the "CVS Repository." I would like to ask Kevin Lenzo and/or M. K. Ravishankar to follow up, if they would, with an example of using Sphinx III to do phoneme segmentation alignments, as I've not yet been able to determine how this is done. Kevin, Ravi -- would either of you please explain how to do that? Thank you both for the hard work you've put into the open source Sphinx systems. Cheers, James
Received on Friday, 18 August 2000 01:13:21 UTC