T.E.O.'s Draft--Cascading Speech Style Sheets (txt)

I'm sorry, but this proposal from Europe is a *joke*! (apologies if I sound
rude --that is not the intent)

We're talking of a style sheet specification --not a speech synthesizer.
I'm completely bemused by the assertion at the end that states
"not many people can afford expensive devices so we are making a simple one"

JuanJo Miguez writes:
 > T.E.O.'s Draft--Cascading Speech Style Sheets
 >                K.U. Leuven
 > Ing. to be Juan Jose Miguez Iglesias  mailto:Juanjo.Miguez@KULeuven.ac.be
 > ir. Filip Evenepoel                   mailto:Filip.Evenepoel@KULeuven.ac.be
 > ir. Bart BAwens                       mailto:Bart.Bauwens@KULeuven.ac.be
 > Prof.dr.ir Jan Engelen                mailto:Jan.Engelen@KULeuven.ac.be
 > Prof.ing Antonio S. Pena from the E.T.S.I.Telecomunication of Vigo (Spain)
 > -------------------
 > The T.E.O. group at the Katholique University of Leuven in Belgium 
 > believe that the best way to include Speech within the CSS is to make it 
 > simple and general, so that it's easy to use. We agree with the Raman T.V.
 > Initial Draft:
 >  (http://www.eit.com/msgid/199602130050.QAA10031@labrador.mv.us.adobe.com)
 > that is very interesting to include Speech in the CSS but we don't want 
 > to make it very complicated. Many people doesn't even know decibels, most 
 > actual speech synthesizers are mono and it's easier to give values to 
 > some features with numbers (in a more theoretical way, then this values 
 > will be mapped to the real values for each synthesizer). You can see this 
 > page with your browser in HTML in the URL:
 > http://www.esat.kuleuven.ac.be/~juanjo/csss1.html
 > We have defined the set of properties for Cascading Speech Style Sheets 
 > like in the CSS1 Working draft:
 > Speech
 > ------
 >     Volume
 >     	Value: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
 > 	Initial: 0
 > 	Applies to: All elements
 > 	Example: volume: 5
 > 	The reason why the default value is 0 is because normally there 
 > 	will not be sound, but in the case that other value is specified 
 >         the speech syntetizer will start working. There are many sets of 
 > 	values in the volume range (and all the other set of properties) 
 > 	depending on which speech synthesizer you use, so theese theoretical
 > 	values will be mapped into the real values used by the synthesizer.
 > 	We think this way is easier than Raman's one, where the user 
 > 	should know to make his own style sheet how what decibels are. In 
 > 	fact really few people know about this (engineers, Physics and so on).
 > 	To make it easy we let people decide between a set of ten values 
 > 	that will be mapped by expert people to the real values in the
 > 	synthesizer. 
 >     Speed
 >         Value: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8| 9 | 10 |
 >         Initial: UA specific
 >         Applies to: All elements
 >         Example: speed: 6
 > 	Some users (specially between blind people) prefers very high 
 > 	speed speech because they have a very good hearing so they could 
 > 	go very fast reading web pages. That is the reason why we chose this 
 >         big range. Of course "speed: 0" is not allowed because you could 
 > 	not hear anything. 
 >     Voice-type
 >         Value: | child1 | child2 | male1 | male2 | female1 | female2 |
 >         Initial: UA specific
 >         Applies to: All elements
 >         Example: voice-type: female1
 > 	This is the way to set the phisical features of the articulating 
 > 	voice. For example the voice of a boy, a woman, a man, sounds 
 > 	different, and that is the reason.
 >     Pitch
 >         Value: | 1 | 2 | 3 | 4 | 5 | 6 |
 >         Initial: UA specific
 >         Applies to: All elements
 >         Example: pitch: 4
 >         This is a small range for the medium frequency (F0). The same 
 > 	person (the same voice type) can talk (in media) more grave or 
 > 	less, which gives the appearance to be a different voice. If we 
 > 	try to combine "Pitch" and "voice-type" for example: 
 > 	if voice-type=child1,F0=1 (low voice)--> real medium frequency:150Hz 
 > 	if voice-type=child1,F0=6 (high voice)-> real medium frequency:350Hz 
 > 	if voice-type=male2, F0=1 (low voice)--> real medium frequency: 50Hz 
 > 	if voice-type=male2, F0=6 (high voice)-> real medium frequency:150Hz 
 > 	All this voices sounds different. We have a big range of different
 > 	voices because F0 (Pitch frequency) is mapped to different values
 > 	of real frequency depending on the voice-type. That's why 6 
 > 	possible values of pitch are enough to make a simple definition with
 > 	36 different voices. 
 > 	When a user wants to write his personal CSSS, he can try any of the
 > 	available values, and it will work because they will be mapped to real
 > 	and typical values. With Raman's specification someone could try with
 > 	an average-pitch of 5 Hertzs, but it will sound bad. We prefer to let 
 > 	people choose a relative number than an exact and perhaps wrong number
 > 	of average pitch.   
 >     Prosidy
 >         Value: | on | off |
 >         Initial: on
 >         Applies to: All elements
 >         Example: prosidy: off
 > 	With prosidy activated the synthesizer gives the entonation (the 
 > 	evolution of F0 along the time) which will sound hard, soft, angry
 > 	questionable..... If you have "prosidy:off" the result will be 
 > 	like the voice of a robot (blind people prefer this kind of voice 
 > 	and also hearing very fast voice) 
 >     Language
 >         Value: defined in the ISO 639 (Codes for the representation of 
 > 	the names of languages)
 >         Initial: en
 >         Applies to: All elements
 >         Example: language: fr
 > 	You can specify any language because the way to pronounce the same 
 > 	message is different between countries (e.g. fr,nl,es,en....). 
 > 	For example the Apollo II (multilingual speech syntesizer) 	
 > 	supports 7 languages (russian, english, french, spanish...). The 
 > 	default value is english because it's the most used language in 
 > 	the web, and although many languages are not supported nor 
 > 	perhaps will be in the future, it's better to include all than a 
 > 	little part of them.
 > 	We try to make understandable speech, but we think that it's
 > 	difficult to make a speech synthesizer speaking in all the dialects
 > 	of all the world's countries, as Raman suggests in his draft. It
 > 	could be possible, but not many people could afford it. We are just
 > 	thinking to make easy for the final user and with the devices that
 > 	are now mostly used, so that this could be working soon because there
 > 	are many people that needs it very much as soon as possible (blind or
 > 	impaired people) 
 > This is a DRAFT, we have discussed about it, and now is your turn to say if 
 > you like as it is, or you would like to talk about some features. I hope 
 > you will tell us what you think about it. Thank you!
 > Kath. Universiteit Leuven--Dept.Electrotechniek (ESAT), T.E.O.
 > mailto:Juanjo.Miguez@KULeuven.ac.be
 > ----------------------------------------------------------------


Best Regards,

      Adobe Systems                 Tel: 1 (415) 962 3945   (B-1 115)
      Advanced Technology Group     Fax: 1 (415) 962 6063 
      1585 Charleston Road          Email: raman@adobe.com 
      Mountain View, CA 94039 -7900  raman@cs.cornell.edu
      http://www-atg/People/Raman.html (Internal To Adobe)
      http://www.cs.cornell.edu/Info/People/raman/raman.html  (Cornell)

Disclaimer: The opinions expressed are my own and in no way should be taken
            as representative of my employer, Adobe Systems Inc.

Follow-Ups: References: