Re: OT: TTS Engine from Doug Schepers on 2004-04-08 (www-voice@w3.org from April to June 2004)

From: Doug Schepers <doug@schepers.cc>
Date: Wed, 7 Apr 2004 21:26:49 -0400
To: <turner@atr.net>, "RJ Auburn" <rj@voxeo.com>
Cc: <www-voice@w3.org>
Message-ID: <010a01c41d08$92c42b10$6501a8c0@Raven>
Hi, Turner-

I happen to be a student of Linguistics at the University of North
Carolina-Chapel Hill (note, however, that I'm more interested in syntax and
cogntive linguistics than phonetics or phonemics). Recently a "quiet room"
was built there, and I may have access to it over the summer. There may or
may not be recording equipment suitable for the task there. I do know people
with some prosumer recording equipment which may also be suitable, or we
could just use a nice PC and Praat. ;) I would have to confirm access to
these resources, but I will look into it.

I have a fairly pleasant speaking voice, and I'd be willing to devote my
time to a project such as this. I may also be able to recruit voice talent
among some singer or actor friends of mine, but certainly no guarantees
there. It would be nice to have a female voice, too. I suppose that an
advantage to my doing it would be that I would definitely not charge for my
time, and I'd be available again for resampling as needed. Caveat: I'm not a
professional announcer.

As you said, the real question is: what would we need to record? I'm farily
clueless here, being a complete newbie to the subject. All the consonants
and vowels in various environments, like stop-initial/stop-final,
glide-intial, etc.? Maybe there's a comprehensive list of the sounds
needed... that would be a great help.

I suspect that I am not qualified to head up a project like this, but I
would be willing to follow someone else's lead. Hopefully it could leverage
existing Open Source software, such as Festival (if that's compatible to
non-formant TTS) or other packages. While my own desires are for a C/C++
solution, there's nothing to stop someone from doing a Java port as well,
since that is a good cross-platform language.

Another consideration is mobile devices. I think that any solution should
consider how it might be used on smaller devices, whether the code is
clientside or served wirelessly, or some clever combination of the two that
optimizes the bandwidth/local processor/local storage triangle.

Obviously, such a plugin should be as standards-driven as possible. SSML 1.0
and VoiceXML 2.0 are new recommendations, right? Who knows, maybe we could
even use those, somehow. ;)

Regards-
-Doug the Dreamer

Turner Rentz , III wrote:
|
| The TTS problem roughly divides up into formant vs. non formant.
| Formant speech is generally free.
|
| The barrier to Synthesized non-formant TTS is basically the studio
| time of the voice talent required to produce it, plus software
| development.
|
| Hosting is definitely out of the question. Licenses are
| sold on a per port basis, so you'd have to license
| a big port number just to be free for everyone else
| then quality of service would drop because we're public
| internet UDP.
|
|
| We should go for our own TTS development here
| if we're going for open source, and just get one of us
| with a nice voice to do the studio time.
|
|
| Question: What is the minimum time in the studio we need to be able
| to accurately reproduce the major elements of prosody?
|
| (my guess is about 40 hours. how do you arrive at numbers like this?)
|
Received on Wednesday, 7 April 2004 21:27:00 UTC