T.E.O.'s Draft--Cascading Speech Style Sheets (txt)

T.E.O.'s Draft--Cascading Speech Style Sheets
               K.U. Leuven

Ing. to be Juan Jose Miguez Iglesias  mailto:Juanjo.Miguez@KULeuven.ac.be
ir. Filip Evenepoel                   mailto:Filip.Evenepoel@KULeuven.ac.be
ir. Bart BAwens                       mailto:Bart.Bauwens@KULeuven.ac.be
Prof.dr.ir Jan Engelen                mailto:Jan.Engelen@KULeuven.ac.be
Prof.ing Antonio S. Pena from the E.T.S.I.Telecomunication of Vigo (Spain)


The T.E.O. group at the Katholique University of Leuven in Belgium 
believe that the best way to include Speech within the CSS is to make it 
simple and general, so that it's easy to use. We agree with the Raman T.V.
Initial Draft:

that is very interesting to include Speech in the CSS but we don't want 
to make it very complicated. Many people doesn't even know decibels, most 
actual speech synthesizers are mono and it's easier to give values to 
some features with numbers (in a more theoretical way, then this values 
will be mapped to the real values for each synthesizer). You can see this 
page with your browser in HTML in the URL:


We have defined the set of properties for Cascading Speech Style Sheets 
like in the CSS1 Working draft:

    	Value: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
	Initial: 0
	Applies to: All elements
	Example: volume: 5

	The reason why the default value is 0 is because normally there 
	will not be sound, but in the case that other value is specified 
        the speech syntetizer will start working. There are many sets of 
	values in the volume range (and all the other set of properties) 
	depending on which speech synthesizer you use, so theese theoretical
	values will be mapped into the real values used by the synthesizer.
	We think this way is easier than Raman's one, where the user 
	should know to make his own style sheet how what decibels are. In 
	fact really few people know about this (engineers, Physics and so on).
	To make it easy we let people decide between a set of ten values 
	that will be mapped by expert people to the real values in the
        Value: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8| 9 | 10 |
        Initial: UA specific
        Applies to: All elements
        Example: speed: 6

	Some users (specially between blind people) prefers very high 
	speed speech because they have a very good hearing so they could 
	go very fast reading web pages. That is the reason why we chose this 
        big range. Of course "speed: 0" is not allowed because you could 
	not hear anything. 
        Value: | child1 | child2 | male1 | male2 | female1 | female2 |
        Initial: UA specific
        Applies to: All elements
        Example: voice-type: female1

	This is the way to set the phisical features of the articulating 
	voice. For example the voice of a boy, a woman, a man, sounds 
	different, and that is the reason.
        Value: | 1 | 2 | 3 | 4 | 5 | 6 |
        Initial: UA specific
        Applies to: All elements
        Example: pitch: 4
        This is a small range for the medium frequency (F0). The same 
	person (the same voice type) can talk (in media) more grave or 
	less, which gives the appearance to be a different voice. If we 
	try to combine "Pitch" and "voice-type" for example: 
	if voice-type=child1,F0=1 (low voice)--> real medium frequency:150Hz 
	if voice-type=child1,F0=6 (high voice)-> real medium frequency:350Hz 
	if voice-type=male2, F0=1 (low voice)--> real medium frequency: 50Hz 
	if voice-type=male2, F0=6 (high voice)-> real medium frequency:150Hz 
	All this voices sounds different. We have a big range of different
	voices because F0 (Pitch frequency) is mapped to different values
	of real frequency depending on the voice-type. That's why 6 
	possible values of pitch are enough to make a simple definition with
	36 different voices. 
	When a user wants to write his personal CSSS, he can try any of the
	available values, and it will work because they will be mapped to real
	and typical values. With Raman's specification someone could try with
	an average-pitch of 5 Hertzs, but it will sound bad. We prefer to let 
	people choose a relative number than an exact and perhaps wrong number
	of average pitch.   
        Value: | on | off |
        Initial: on
        Applies to: All elements
        Example: prosidy: off
	With prosidy activated the synthesizer gives the entonation (the 
	evolution of F0 along the time) which will sound hard, soft, angry
	questionable..... If you have "prosidy:off" the result will be 
	like the voice of a robot (blind people prefer this kind of voice 
	and also hearing very fast voice) 
        Value: defined in the ISO 639 (Codes for the representation of 
	the names of languages)
        Initial: en
        Applies to: All elements
        Example: language: fr

	You can specify any language because the way to pronounce the same 
	message is different between countries (e.g. fr,nl,es,en....). 
	For example the Apollo II (multilingual speech syntesizer) 	
	supports 7 languages (russian, english, french, spanish...). The 
	default value is english because it's the most used language in 
	the web, and although many languages are not supported nor 
	perhaps will be in the future, it's better to include all than a 
	little part of them.
	We try to make understandable speech, but we think that it's
	difficult to make a speech synthesizer speaking in all the dialects
	of all the world's countries, as Raman suggests in his draft. It
	could be possible, but not many people could afford it. We are just
	thinking to make easy for the final user and with the devices that
	are now mostly used, so that this could be working soon because there
	are many people that needs it very much as soon as possible (blind or
	impaired people) 
This is a DRAFT, we have discussed about it, and now is your turn to say if 
you like as it is, or you would like to talk about some features. I hope 
you will tell us what you think about it. Thank you!

Kath. Universiteit Leuven--Dept.Electrotechniek (ESAT), T.E.O.