- From: Evan Kirshenbaum <evan@poirot.hpl.hp.com>
- Date: Tue, 13 Feb 1996 18:16:26 -0800
- To: raman@mv.us.adobe.com (Raman T. V.), www-style@w3.org
- Cc: szilles@mv.us.adobe.com, jking@mv.us.adobe.com, wmperry@spry.com
> Here is a first-cut at a draft specification for speech stylesheets. Good first cut. I do [of course :-)] have some suggestions. First off, a caveat: while I have a fair bit of experience in language design, I have almost none in auditory or speech systems. My main observation is that you have a lot of attributes specified very precisely and numerically. One of the strengths that I see in CSS is that it allows the author to specify the values using meaningful symbols, which allows the user to customize their browser to map onto desired interpretations. As a simple example, I certainly don't want to listen to a page whose designer has specified the volume in decibels without knowing whether I am listening to it through headphones or playing it to a lecture hall. I would rather have them tell me that it is louder or softer relative to some baseline volume which I get to set. On the same thrust, you occasionally talk in terms of free-form strings which the browser will interpret (as for the specification of voice). This will only work if there is some relatively widely agreed upon standard for naming the resource, as there is for fonts. You are generally better off coming up with a set of values specified by the standard and using a URL (or system-dependent string) as a fallback to point to a description of the resource. Finally, you have several places in which you allow "device-specific" values. This is generally dangerous, especially as different devices may assign different meanings to the various values. If you must allow this (and I would recommend against it in favor of being a little lenient in allowing people to play with adding attributes), I'd make sure that the code identifies the device that the attribute and value are to be interpreted with respect to. On to the specifics: - For volume, I certainly wouldn't specify a concrete number of decibels. (And if you must allow this, at least force the author to suffix "dB".) I'd go more with a set analogous to that used for font-size: very soft, soft, normal, loud, very loud. For relative values, I'd probably allow [much] louder/softer. - For voice-family, you have the problem (I assume) that there aren't any good standards for names. As with font-family, I'd define a few that can be assumed. My recommendation would be male/female-[adult/child/elder]-[<n>] where the optional trailing index can be used to contrast similar voices (male-child-1 with male-child-2). If there is a way to describe a voice, it should probably be allowable to point at it by URL or name. As with font-family, it should be possible to specify a list of such values, with the browser picking the first that it understands. - For speech-rate, I'd append "wpm" to the number. I'd also allow (and recommend the use of): very slow, slow, normal, fast, very fast. For relative values, I'd add [much] slower/faster. - For average-pitch, I'd append "Hz" to the number. I'd also allow (and recommend the use of): soprano, alto, tenor, barritone, bass (and perhaps a couple of others), as well as [much] higher/lower. - For pitch-range, I'd add something like: monotone, normal, animated (and if this is the place to add it: whisper, scream, shriek, etc.) and possibly [much] more/less animated. - For stress, the notion that some elements of a sentence get primary, secondary, or tertiary stress is hard to map onto elements. For relative stress of elements with respect to the surrounding context, perhaps: destressed, unstressed, [weakly/highly] stressed, with [much] more/less stressed as the relative. Perhaps the attribute should be changed to "emphasis". - For richness, I'd try to select a set of canonical symbolic values. - For speech-other, either drop entirely or make the value be a list of triples, with the name of the device (or schema) encoded as well. - For pause-before-pause (which should probably change to simply pause-before), etc., add "ms" after the number, and add: none, very short, short, medium, long, very long, as well as [much] shorter/longer. - For pronunciation-mode, you need to define at least a first cut at a canonical set (which not all browsers need be able to understand). - language, country, and dialect are all combined in a single value according to RFC 1766, which is used as the value of the LANG attribute in the HTML internationalization draft and the value oft the Content-Language header in MIME (and therfore HTTP). It probably doesn't hurt to have a single attribute with an rfc1766 value, but the information should already be available to the browser, and I'm not sure what the appropriate behavior should be if the element's LANG attribute and its style sheet's language attribute disagree. - for the various non-speech cues, I would recommend highly against talking about file names. URLs are probably best, but a good base set of assumable effects is probably a good idea. - for these cues (especially for during-sound), you probably want to be able to specify a "cue-volume", as it will probably want to be different from the speech volume. ---- Evan Kirshenbaum +------------------------------------ HP Laboratories |The plural of "anecdote" 1501 Page Mill Road, Building 1U |is not "data" Palo Alto, CA 94304 kirshenbaum@hpl.hp.com (415)857-7572 http://www.hpl.hp.com/personal/Evan_Kirshenbaum/
Received on Tuesday, 13 February 1996 21:16:51 UTC