Re: Seeking speech synthesizer min/max capabilities. from Jon Gunderson on 2000-07-03 (w3c-wai-ua@w3.org from July to September 2000)

From: Jon Gunderson <jongund@uiuc.edu>
Date: Mon, 03 Jul 2000 09:54:38 -0500
To: Ian Jacobs <ij@w3.org>, unagi69@concentric.net, janina@afb.net, pav@oce.nl, kerscher@mail.montana.com
Cc: w3c-wai-ua@w3.org
Message-Id: <4.3.1.2.20000703094801.00c24170@staff.uiuc.edu>
These are the parameters that the SAPI speech engine supports.  Actual 
values available for each parameter, as I have said before, is queried by 
the software using the APIs provided by SAPI.  The following is taken from 
the Microsoft SAPI software development kit (SDK) and is in the section 
called control tags.

  DirectTextToSpeech API

--------------------------------------------------------------------------------

Control Tags
This section describes the text-to-speech control tags that can be embedded 
in the source text to improve the prosody of text-to-speech translation:

Syntax
Conventions
Chr
Com
Ctx (some speech engines may not support this tag)
Dem (new for SAPI 4.0 -- some speech engines may not support this tag)
Emp (some speech engines may not support this tag)
Eng
Mrk
Pau
Pit
Pra (new for SAPI 4.0 -- some speech engines may not support this tag)
Prn (some speech engines may not support this tag)
Pro (some speech engines may not support this tag)
Prt (some speech engines may not support this tag)
RmS (new for SAPI 4.0 -- some speech engines may not support this tag)
RmW (new for SAPI 4.0 -- some speech engines may not support this tag)
RPit (new for SAPI 4.0 -- some speech engines may not support this tag)
RPrn (new for SAPI 4.0 -- some speech engines may not support this tag)
RSpd (new for SAPI 4.0 -- some speech engines may not support this tag)
Rst (new for SAPI 4.0)
Spd
Vce (some speech engines may not support this tag)
Vol
A text-to-speech engine can usually translate individual words to speech 
successfully. However, as soon as the engine speaks a sentence, the 
perceived quality of its translation decreases because the engine cannot 
correctly synthesize human prosody -- the inflection, accent, and timing of 
human speech.

The prosody of translated speech can be improved by using text-to-speech 
control tags to better simulate human speech. Control tags can be embedded 
in the source text passed to an engine with the ITTSCentral::TextData 
member function, or they can be inserted into the text that is currently 
playing by calling the ITTSCentral::Inject member function. This allows an 
application to alter prosody of text as it is spoken, without having to 
reconstruct the text-to-speech buffers.

This section describes control tags that you can use to alter the prosody 
of text translated into speech. All tags are optional except the bookmark 
(Mrk) tag, which must be supported.

Syntax
Text-to-speech control tags follow these general rules of syntax:

All tags begin and end with a backslash character (\).
The backslash character is not allowed within a tag.
An odd number of backslash characters in tagged text produce undefined 
behavior in the engine.
Tags are case-insensitive. For example, \vce\ is the same as \VCE\.
Tags are white-space -- dependent. For example, \Rst\ is not the same as \ 
Rst \.
To include a backslash character in tagged text, but outside a tag, use a 
double backslash (\\).

If the application has the tagged text bit on and wishes to speak a 
filename, such as "c:\windows\system\test.txt", then it should double up 
the backslashes (e.g., "c:\\windows\\system\\test.txt").

Samples:

\ ctx="e-mail" \ - Should be parsed as an "e-mail" tag.
\ctx="e-mail" - Should be ignored since it's an unclosed tag at the end of 
a document.
\\ctx="e-mail"\\ - Speaks "back-slash c t x equals e-mail back-slash"
\\\\\\\\ctx="e-mail"\\\\\\\\ - Speaks "backslash backslash backslash 
backslash c t x equals e-mail backslash backslash backslash backslash"
\ctm="e-mail"\ - Ignored because it's an unknown tag.
When the text-to-speech engine encounters a tag it does not understand, the 
tag is ignored.

Tags are persistent from one call to the ITTSCentral::TextData member 
function to another. For example, if the \ctx="e-mail"\ tag is passed to an 
engine that supports that tag, the engine stays in the "e-mail" context 
until another tag changes the context.

Conventions
This section uses the following typographic conventions.

Example  Description
Chr Bold type indicates speech-inflection keywords.
string Italic indicates placeholders for information you supply, such as a 
character or context string.
[[option]] Double square brackets indicate items that are optional.
[[option...]] Three dots (an ellipsis) following an item indicate that more 
items having the same form may appear.
"C" Quotation marks are required to delimit strings.

Chr
\Chr=string[[,string...]]\
Sets the character of the voice.

Example: \Chr="Angry"\

Parameter Description
string String that specifies the characteristics of the voice.

The default characteristic is "Normal," which produces a normal tone of 
voice. Although the Chr tag is less specific than setting the inflection, 
stress, attack, and whispering qualities individually, it is easier to use 
and allows the engine more flexibility and intelligence in its response.

Several characteristics can be specified in the same tag, separated by 
commas. Depending on its capabilities, an engine may not support all of the 
characteristics listed here, or it may support additional characteristics. 
Some common characteristics are the following:


"Angry"	
"Business"	
"Calm"	
"Depressed"	
"Excited"	
"Falsetto"	
"Happy"	
"Loud"	
"Monotone"
"Perky"
"Quiet"
"Sarcastic"
"Scared"
"Shout"
"Tense"
"Whisper"

Com
\Com=string\
Embeds a comment in the text. Comments are not translated into speech.

Example: \com="This is a comment."\

Parameter Description
string Text of the comment.

Ctx
\Ctx=string\
Note:   Some speech engines may not support this tag.
Sets the context for the text that follows, which determines how symbols 
are spoken.

Example: \ctx="Unknown"\

Parameter Description
string String that specifies the context.

This parameter can be one of these strings:

Context string  Description
"Address" Addresses and/or phone numbers
"C" Code in the C or C++ programming language
"Document" Text document
"E-Mail" Electronic mail
"Numbers" Numbers, dates, times, and so on
"Spreadsheet" Spreadsheet document
"Unknown" Context is unknown (default)
"Normal" Normal/default speech mode.

Dem
\Dem\
Note:   Some speech engines may not support this tag.
De-emphasizes the next word.

Emp
\Emp\
Note:   Some speech engines may not support this tag.
Emphasizes the next word to be spoken.

Example: the \Emp\truth, the \Emp\whole truth, and nothing \Emp\but the truth.

Eng
\Eng;GUID:command\
\Eng:command\

Embeds an engine-specific command that affects only the engine with the 
specified globally unique identifier (GUID). Subsequent engine-specific 
tags for that engine can omit the GUID until an engine-specific tag for a 
different engine is used.

Parameter Description
GUID Globally unique identifier of the engine.
command Engine-specific command.

Mrk
\Mrk=number\
Indicates a bookmark in the text.

Example: \Mrk=75000\

Parameter Description
number Number of the bookmark.

When the text-to-speech engine encounters this tag, it notifies the 
application by calling the ITTSBufNotifySink::BookMark member function.

Bookmarks have a DWORD range (specified in the dwMarkNum parameter of the 
ITTSBufNotifySink::BookMark member function), and the Mrk tag accepts a 
decimal representation. Bookmark number zero (\Mrk=0\) is reserved; a 
ITTSBufNotifySink::BookMark member function is not sent for bookmark number 
zero.

Bookmark tags are inserted directly into the text sent to the engine when 
calling the ITTSCentral::TextData member function.

Pau
\Pau=number\
Pauses speech for the specified number of milliseconds.

Example: \Pau=1000\

Parameter Description
number Number of milliseconds to pause.

Pit
\Pit=number\
Sets the baseline pitch of the text-to-speech mode to the specified value 
in hertz.

Example: \Pit=90\

Parameter Description
number Pitch, in hertz.

The actual pitch fluctuates above and below this baseline. Embedding a Pit 
tag in text is the same as calling the ITTSAttributes::PitchSet member 
function.

Pra
\Pra=value\
Note:   Some speech engines may not support this tag.
Sets the pitch range.

Parameter Description
value Pitch range in Hz.

Prn
\Prn=text=pronunciation\
Note:   Some speech engines may not support this tag.
Indicates how to pronounce text by passing the phonetic equivalent to the 
engine.

Example: \Prn=tomato=tomaato\

Parameter Description
text Text to pronounce.
pronunciation Phonetic equivalent for pronouncing the text.

Using the Prn tag without specifying the word pronunciation (\Prn=text\) 
will undo the changes to the pronunciation; that is, it will return the 
word to the original pronunciation.

This tag is valid only for engines that support the International Phonetic 
Alphabet. The engine should continue to use this pronunciation for the 
current text-to-speech mode and should store the pronunciation in its 
lexicon for later use. If a lexicon entry already exists for a particular 
word, the Prn tag should be ignored for that word.

Pro
\Pro=number\
Note:   Some speech engines may not support this tag.
Activates and deactivates prosodic rules, which affect pitch, speaking 
rate, and volume of words independently of control tags embedded in the 
text. Prosodic rules are applied by the engine.

Example: \Pro=0\

Parameter Description
number Number that indicates whether to activate or deactivate prosodic 
rules. Setting number to 1, the default, activates prosodic rules. Setting 
number to 0 deactivates prosodic rules.

Prosody does not have a corresponding ITTSAttributes interface as speed and 
pitch do. If the engine supports control of prosody, use the Pro tag.

Prt
\Prt=string\
Note:   Some speech engines may not support this tag.
Indicates the part of speech of the next word.

Example: \prt="Abbr"\

Parameter Description
string String that indicates the part of speech.

This parameter can be one of these strings:

String  Description
"Abbr" Abbreviation
"Adj" Adjective
"Adv" Adverb
"Card" Cardinal number
"Conj" Conjunction
"Cont" Contraction
"Det" Determiner
"Interj" Interjection
"N" Noun
"Ord" Ordinal number
"Prep" Preposition
"Pron" Pronoun
"Prop" Proper noun
"Punct" Punctuation
"Quant" Quantifier
"V" Verb

For more information   about the parts of speech, see VOICEPARTOFSPEECH.

RmS
\RmS=number\
Note:   Some speech engines may not support this tag.
Sets reading mode to spelling out each letter of each word, or turns it 
off. Not all engines support this tag, so an application can get more 
consistent results by normalizing the text itself and putting spaces 
between letters.

Parameter Description
number Number that indicates whether to spell out each letter of each word. 
Setting number to 1 sets reading mode to spelling out each letter of each 
word. Setting number to 0, the default, turns it off.

RmW
\RmW=number\
Note:   Some speech engines may not support this tag.
Sets reading mode to leaving audible pauses between each word, or turns it 
off. Not all engines support this tag, so an application can get more 
consistent results by normalizing the text itself and putting periods 
between words.

Parameter Description
number Number that indicates whether to leave audible pauses between each 
word. Setting number to 1 sets reading mode to leave audible pauses between 
each word. Setting number to 0, the default, turns it off.

RPit
\RPit=value\
Note:   Some speech engines may not support this tag.
Sets the relative pitch.

Parameter Description
value Relative pitch. 100 is the default/original for the voice.

RPrn
\RPrn=value\
Note:   Some speech engines may not support this tag.
Set the relative pitch range.

Parameter Description
value Relative pitch range. 100 is the default/original for the voice.

RSpd
\RSpd=value\
Note:   Some speech engines may not support this tag.
Sets the relative speed. 100 is the default/original for the voice.

Parameter Description
value Relative speed. 100 is the default/original for the voice.

Rst
\Rst\
Resets the engine to the default settings for the current mode, as though 
the mode had just been re-created.

Spd
\Spd=number\
Sets the baseline average talking speed of the text-to-speech mode to the 
specified number of words per minute.

Example: \Spd=120\

Parameter Description
number Baseline average talking speed, in words per minute.

Embedding a Spd tag in text is the same as calling the 
ITTSAttributes::SpeedSet member function.

Vce
\Vce=charact=value[[,charact=value]]\
Note:   Some speech engines may not support this tag.
Instructs the engine to change its speaking voice to one that has the 
specified characteristics. The voice characteristics change as though a new 
mode object were created (using that voice) and used. The pitch, speed, 
volume, etc. revert to the defaults for the new voice. If 
ITTSCentral::ModeGet is called, it will reflect the new mode.

Parameter Description
charact One of the characteristics listed in the table below.
value String that specifies the characteristic.

Characteristics are specified in the order of importance. The engine 
selects a voice that most closely matches the characteristics specified at 
the beginning of the list.

The charact=value argument can be any of the following:

Language=language. Requests the engine speak in the specified language
Accent=accent. Requests the engine speak in the specified accent. For 
example, if language="English" and Accent="French" the engine will speak 
English with a French accent.
Dialect=dialect. Requests the engine speak in the specified dialect
Gender=gender. Specifies the gender of the voice: "Male", "Female", or 
"Neutral".
Speaker=speakername. Specifies the name of the voice, or NULL if the name 
is unimportant
Age=age. Specifies the age of the voice, which can be one of the values 
shown below
Style=style. The personality of the voice -- for example: "Business", 
"Casual", "Computer", "Excited", or "Singsong".
Age string  Description
"Baby" About 1 year old
"Toddler" About 3 years old
"Child" About 6 years old
"Adolescent" About 14 years old
"Adult" Between 20 and 60 years old
"Elderly" Over 60 years old

Vol
\Vol=number\
Sets the baseline speaking volume for the text-to-speech mode.

Example: \Vol=32768\

Parameter Description
number Baseline speaking volume.

The volume level is a linear range from 0 for absolute silence to 65535 for 
the maximum monaural volume. The default is 65535.

If you specify a value greater than 65535, the engine assumes that you want 
to set the left and right channels separately and converts the value to a 
double word, using the low word for the left channel and the high word for 
the right channel. For example, a value of 65536 sets the left channel to 
the maximum baseline speaking volume and the right channel to the minimum.

Embedding a Vol tag in text is similar to calling the 
ITTSAttributes::VolumeSet member function.

© 1995-1998 Microsoft Corporation. All rights reserved.
Jon Gunderson, Ph.D., ATP
Coordinator of Assistive Communication and Information Technology
Division of Rehabilitation - Education Services
MC-574
College of Applied Life Studies
University of Illinois at Urbana/Champaign
1207 S. Oak Street, Champaign, IL  61820

Voice: (217) 244-5870
Fax: (217) 333-0248

E-mail: jongund@uiuc.edu

WWW: http://www.staff.uiuc.edu/~jongund
WWW: http://www.w3.org/wai/ua
Received on Monday, 3 July 2000 10:55:13 UTC