- From: Jon Gunderson <jongund@uiuc.edu>
- Date: Mon, 03 Jul 2000 09:54:38 -0500
- To: Ian Jacobs <ij@w3.org>, unagi69@concentric.net, janina@afb.net, pav@oce.nl, kerscher@mail.montana.com
- Cc: w3c-wai-ua@w3.org
These are the parameters that the SAPI speech engine supports. Actual values available for each parameter, as I have said before, is queried by the software using the APIs provided by SAPI. The following is taken from the Microsoft SAPI software development kit (SDK) and is in the section called control tags. DirectTextToSpeech API -------------------------------------------------------------------------------- Control Tags This section describes the text-to-speech control tags that can be embedded in the source text to improve the prosody of text-to-speech translation: Syntax Conventions Chr Com Ctx (some speech engines may not support this tag) Dem (new for SAPI 4.0 -- some speech engines may not support this tag) Emp (some speech engines may not support this tag) Eng Mrk Pau Pit Pra (new for SAPI 4.0 -- some speech engines may not support this tag) Prn (some speech engines may not support this tag) Pro (some speech engines may not support this tag) Prt (some speech engines may not support this tag) RmS (new for SAPI 4.0 -- some speech engines may not support this tag) RmW (new for SAPI 4.0 -- some speech engines may not support this tag) RPit (new for SAPI 4.0 -- some speech engines may not support this tag) RPrn (new for SAPI 4.0 -- some speech engines may not support this tag) RSpd (new for SAPI 4.0 -- some speech engines may not support this tag) Rst (new for SAPI 4.0) Spd Vce (some speech engines may not support this tag) Vol A text-to-speech engine can usually translate individual words to speech successfully. However, as soon as the engine speaks a sentence, the perceived quality of its translation decreases because the engine cannot correctly synthesize human prosody -- the inflection, accent, and timing of human speech. The prosody of translated speech can be improved by using text-to-speech control tags to better simulate human speech. Control tags can be embedded in the source text passed to an engine with the ITTSCentral::TextData member function, or they can be inserted into the text that is currently playing by calling the ITTSCentral::Inject member function. This allows an application to alter prosody of text as it is spoken, without having to reconstruct the text-to-speech buffers. This section describes control tags that you can use to alter the prosody of text translated into speech. All tags are optional except the bookmark (Mrk) tag, which must be supported. Syntax Text-to-speech control tags follow these general rules of syntax: All tags begin and end with a backslash character (\). The backslash character is not allowed within a tag. An odd number of backslash characters in tagged text produce undefined behavior in the engine. Tags are case-insensitive. For example, \vce\ is the same as \VCE\. Tags are white-space -- dependent. For example, \Rst\ is not the same as \ Rst \. To include a backslash character in tagged text, but outside a tag, use a double backslash (\\). If the application has the tagged text bit on and wishes to speak a filename, such as "c:\windows\system\test.txt", then it should double up the backslashes (e.g., "c:\\windows\\system\\test.txt"). Samples: \ ctx="e-mail" \ - Should be parsed as an "e-mail" tag. \ctx="e-mail" - Should be ignored since it's an unclosed tag at the end of a document. \\ctx="e-mail"\\ - Speaks "back-slash c t x equals e-mail back-slash" \\\\\\\\ctx="e-mail"\\\\\\\\ - Speaks "backslash backslash backslash backslash c t x equals e-mail backslash backslash backslash backslash" \ctm="e-mail"\ - Ignored because it's an unknown tag. When the text-to-speech engine encounters a tag it does not understand, the tag is ignored. Tags are persistent from one call to the ITTSCentral::TextData member function to another. For example, if the \ctx="e-mail"\ tag is passed to an engine that supports that tag, the engine stays in the "e-mail" context until another tag changes the context. Conventions This section uses the following typographic conventions. Example Description Chr Bold type indicates speech-inflection keywords. string Italic indicates placeholders for information you supply, such as a character or context string. [[option]] Double square brackets indicate items that are optional. [[option...]] Three dots (an ellipsis) following an item indicate that more items having the same form may appear. "C" Quotation marks are required to delimit strings. Chr \Chr=string[[,string...]]\ Sets the character of the voice. Example: \Chr="Angry"\ Parameter Description string String that specifies the characteristics of the voice. The default characteristic is "Normal," which produces a normal tone of voice. Although the Chr tag is less specific than setting the inflection, stress, attack, and whispering qualities individually, it is easier to use and allows the engine more flexibility and intelligence in its response. Several characteristics can be specified in the same tag, separated by commas. Depending on its capabilities, an engine may not support all of the characteristics listed here, or it may support additional characteristics. Some common characteristics are the following: "Angry" "Business" "Calm" "Depressed" "Excited" "Falsetto" "Happy" "Loud" "Monotone" "Perky" "Quiet" "Sarcastic" "Scared" "Shout" "Tense" "Whisper" Com \Com=string\ Embeds a comment in the text. Comments are not translated into speech. Example: \com="This is a comment."\ Parameter Description string Text of the comment. Ctx \Ctx=string\ Note: Some speech engines may not support this tag. Sets the context for the text that follows, which determines how symbols are spoken. Example: \ctx="Unknown"\ Parameter Description string String that specifies the context. This parameter can be one of these strings: Context string Description "Address" Addresses and/or phone numbers "C" Code in the C or C++ programming language "Document" Text document "E-Mail" Electronic mail "Numbers" Numbers, dates, times, and so on "Spreadsheet" Spreadsheet document "Unknown" Context is unknown (default) "Normal" Normal/default speech mode. Dem \Dem\ Note: Some speech engines may not support this tag. De-emphasizes the next word. Emp \Emp\ Note: Some speech engines may not support this tag. Emphasizes the next word to be spoken. Example: the \Emp\truth, the \Emp\whole truth, and nothing \Emp\but the truth. Eng \Eng;GUID:command\ \Eng:command\ Embeds an engine-specific command that affects only the engine with the specified globally unique identifier (GUID). Subsequent engine-specific tags for that engine can omit the GUID until an engine-specific tag for a different engine is used. Parameter Description GUID Globally unique identifier of the engine. command Engine-specific command. Mrk \Mrk=number\ Indicates a bookmark in the text. Example: \Mrk=75000\ Parameter Description number Number of the bookmark. When the text-to-speech engine encounters this tag, it notifies the application by calling the ITTSBufNotifySink::BookMark member function. Bookmarks have a DWORD range (specified in the dwMarkNum parameter of the ITTSBufNotifySink::BookMark member function), and the Mrk tag accepts a decimal representation. Bookmark number zero (\Mrk=0\) is reserved; a ITTSBufNotifySink::BookMark member function is not sent for bookmark number zero. Bookmark tags are inserted directly into the text sent to the engine when calling the ITTSCentral::TextData member function. Pau \Pau=number\ Pauses speech for the specified number of milliseconds. Example: \Pau=1000\ Parameter Description number Number of milliseconds to pause. Pit \Pit=number\ Sets the baseline pitch of the text-to-speech mode to the specified value in hertz. Example: \Pit=90\ Parameter Description number Pitch, in hertz. The actual pitch fluctuates above and below this baseline. Embedding a Pit tag in text is the same as calling the ITTSAttributes::PitchSet member function. Pra \Pra=value\ Note: Some speech engines may not support this tag. Sets the pitch range. Parameter Description value Pitch range in Hz. Prn \Prn=text=pronunciation\ Note: Some speech engines may not support this tag. Indicates how to pronounce text by passing the phonetic equivalent to the engine. Example: \Prn=tomato=tomaato\ Parameter Description text Text to pronounce. pronunciation Phonetic equivalent for pronouncing the text. Using the Prn tag without specifying the word pronunciation (\Prn=text\) will undo the changes to the pronunciation; that is, it will return the word to the original pronunciation. This tag is valid only for engines that support the International Phonetic Alphabet. The engine should continue to use this pronunciation for the current text-to-speech mode and should store the pronunciation in its lexicon for later use. If a lexicon entry already exists for a particular word, the Prn tag should be ignored for that word. Pro \Pro=number\ Note: Some speech engines may not support this tag. Activates and deactivates prosodic rules, which affect pitch, speaking rate, and volume of words independently of control tags embedded in the text. Prosodic rules are applied by the engine. Example: \Pro=0\ Parameter Description number Number that indicates whether to activate or deactivate prosodic rules. Setting number to 1, the default, activates prosodic rules. Setting number to 0 deactivates prosodic rules. Prosody does not have a corresponding ITTSAttributes interface as speed and pitch do. If the engine supports control of prosody, use the Pro tag. Prt \Prt=string\ Note: Some speech engines may not support this tag. Indicates the part of speech of the next word. Example: \prt="Abbr"\ Parameter Description string String that indicates the part of speech. This parameter can be one of these strings: String Description "Abbr" Abbreviation "Adj" Adjective "Adv" Adverb "Card" Cardinal number "Conj" Conjunction "Cont" Contraction "Det" Determiner "Interj" Interjection "N" Noun "Ord" Ordinal number "Prep" Preposition "Pron" Pronoun "Prop" Proper noun "Punct" Punctuation "Quant" Quantifier "V" Verb For more information about the parts of speech, see VOICEPARTOFSPEECH. RmS \RmS=number\ Note: Some speech engines may not support this tag. Sets reading mode to spelling out each letter of each word, or turns it off. Not all engines support this tag, so an application can get more consistent results by normalizing the text itself and putting spaces between letters. Parameter Description number Number that indicates whether to spell out each letter of each word. Setting number to 1 sets reading mode to spelling out each letter of each word. Setting number to 0, the default, turns it off. RmW \RmW=number\ Note: Some speech engines may not support this tag. Sets reading mode to leaving audible pauses between each word, or turns it off. Not all engines support this tag, so an application can get more consistent results by normalizing the text itself and putting periods between words. Parameter Description number Number that indicates whether to leave audible pauses between each word. Setting number to 1 sets reading mode to leave audible pauses between each word. Setting number to 0, the default, turns it off. RPit \RPit=value\ Note: Some speech engines may not support this tag. Sets the relative pitch. Parameter Description value Relative pitch. 100 is the default/original for the voice. RPrn \RPrn=value\ Note: Some speech engines may not support this tag. Set the relative pitch range. Parameter Description value Relative pitch range. 100 is the default/original for the voice. RSpd \RSpd=value\ Note: Some speech engines may not support this tag. Sets the relative speed. 100 is the default/original for the voice. Parameter Description value Relative speed. 100 is the default/original for the voice. Rst \Rst\ Resets the engine to the default settings for the current mode, as though the mode had just been re-created. Spd \Spd=number\ Sets the baseline average talking speed of the text-to-speech mode to the specified number of words per minute. Example: \Spd=120\ Parameter Description number Baseline average talking speed, in words per minute. Embedding a Spd tag in text is the same as calling the ITTSAttributes::SpeedSet member function. Vce \Vce=charact=value[[,charact=value]]\ Note: Some speech engines may not support this tag. Instructs the engine to change its speaking voice to one that has the specified characteristics. The voice characteristics change as though a new mode object were created (using that voice) and used. The pitch, speed, volume, etc. revert to the defaults for the new voice. If ITTSCentral::ModeGet is called, it will reflect the new mode. Parameter Description charact One of the characteristics listed in the table below. value String that specifies the characteristic. Characteristics are specified in the order of importance. The engine selects a voice that most closely matches the characteristics specified at the beginning of the list. The charact=value argument can be any of the following: Language=language. Requests the engine speak in the specified language Accent=accent. Requests the engine speak in the specified accent. For example, if language="English" and Accent="French" the engine will speak English with a French accent. Dialect=dialect. Requests the engine speak in the specified dialect Gender=gender. Specifies the gender of the voice: "Male", "Female", or "Neutral". Speaker=speakername. Specifies the name of the voice, or NULL if the name is unimportant Age=age. Specifies the age of the voice, which can be one of the values shown below Style=style. The personality of the voice -- for example: "Business", "Casual", "Computer", "Excited", or "Singsong". Age string Description "Baby" About 1 year old "Toddler" About 3 years old "Child" About 6 years old "Adolescent" About 14 years old "Adult" Between 20 and 60 years old "Elderly" Over 60 years old Vol \Vol=number\ Sets the baseline speaking volume for the text-to-speech mode. Example: \Vol=32768\ Parameter Description number Baseline speaking volume. The volume level is a linear range from 0 for absolute silence to 65535 for the maximum monaural volume. The default is 65535. If you specify a value greater than 65535, the engine assumes that you want to set the left and right channels separately and converts the value to a double word, using the low word for the left channel and the high word for the right channel. For example, a value of 65536 sets the left channel to the maximum baseline speaking volume and the right channel to the minimum. Embedding a Vol tag in text is similar to calling the ITTSAttributes::VolumeSet member function. © 1995-1998 Microsoft Corporation. All rights reserved. Jon Gunderson, Ph.D., ATP Coordinator of Assistive Communication and Information Technology Division of Rehabilitation - Education Services MC-574 College of Applied Life Studies University of Illinois at Urbana/Champaign 1207 S. Oak Street, Champaign, IL 61820 Voice: (217) 244-5870 Fax: (217) 333-0248 E-mail: jongund@uiuc.edu WWW: http://www.staff.uiuc.edu/~jongund WWW: http://www.w3.org/wai/ua
Received on Monday, 3 July 2000 10:55:13 UTC