Re: Initial Draft --Cascaded Speech Style Sheets from Evan Kirshenbaum on 1996-02-14 (www-style@w3.org from February 1996)

From: Evan Kirshenbaum <evan@poirot.hpl.hp.com>
Date: Tue, 13 Feb 1996 18:16:26 -0800
To: raman@mv.us.adobe.com (Raman T. V.), www-style@w3.org
Cc: szilles@mv.us.adobe.com, jking@mv.us.adobe.com, wmperry@spry.com
Message-Id: <9602131816.ZM16119@poirot.hpl.hp.com>
> Here is a first-cut at a draft specification for speech stylesheets.

Good first cut.  I do [of course :-)] have some suggestions.

First off, a caveat: while I have a fair bit of experience in language
design, I have almost none in auditory or speech systems.

My main observation is that you have a lot of attributes specified
very precisely and numerically.  One of the strengths that I see in
CSS is that it allows the author to specify the values using
meaningful symbols, which allows the user to customize their browser
to map onto desired interpretations.  As a simple example, I certainly
don't want to listen to a page whose designer has specified the volume
in decibels without knowing whether I am listening to it through
headphones or playing it to a lecture hall.  I would rather have them
tell me that it is louder or softer relative to some baseline volume
which I get to set.

On the same thrust, you occasionally talk in terms of free-form
strings which the browser will interpret (as for the specification of
voice).  This will only work if there is some relatively widely agreed
upon standard for naming the resource, as there is for fonts.  You are
generally better off coming up with a set of values specified by the
standard and using a URL (or system-dependent string) as a fallback to
point to a description of the resource.

Finally, you have several places in which you allow "device-specific"
values.  This is generally dangerous, especially as different devices
may assign different meanings to the various values.  If you must
allow this (and I would recommend against it in favor of being a
little lenient in allowing people to play with adding attributes), I'd
make sure that the code identifies the device that the attribute and
value are to be interpreted with respect to.

On to the specifics:

- For volume, I certainly wouldn't specify a concrete number of
  decibels.  (And if you must allow this, at least force the author to
  suffix "dB".)  I'd go more with a set analogous to that used for
  font-size: very soft, soft, normal, loud, very loud.  For relative
  values, I'd probably allow [much] louder/softer.

- For voice-family, you have the problem (I assume) that there aren't
  any good standards for names.  As with font-family, I'd define a few
  that can be assumed.  My recommendation would be

     male/female-[adult/child/elder]-[<n>]

  where the optional trailing index can be used to contrast similar
  voices (male-child-1 with male-child-2).  If there is a way to
  describe a voice, it should probably be allowable to point at it by
  URL or name.  As with font-family, it should be possible to specify
  a list of such values, with the browser picking the first that it
  understands.

- For speech-rate, I'd append "wpm" to the number.  I'd also allow
  (and recommend the use of): very slow, slow, normal, fast, very
  fast.  For relative values, I'd add [much] slower/faster.

- For average-pitch, I'd append "Hz" to the number.  I'd also allow
  (and recommend the use of): soprano, alto, tenor, barritone, bass
  (and perhaps a couple of others), as well as [much] higher/lower.

- For pitch-range, I'd add something like: monotone, normal, animated
  (and if this is the place to add it: whisper, scream, shriek, etc.)
  and possibly [much] more/less animated.

- For stress, the notion that some elements of a sentence get primary,
  secondary, or tertiary stress is hard to map onto elements.  For
  relative stress of elements with respect to the surrounding context,
  perhaps: destressed, unstressed, [weakly/highly] stressed, with
  [much] more/less stressed as the relative.  Perhaps the attribute
  should be changed to "emphasis".

- For richness, I'd try to select a set of canonical symbolic values.

- For speech-other, either drop entirely or make the value be a list
  of triples, with the name of the device (or schema) encoded as well.

- For pause-before-pause (which should probably change to simply
  pause-before), etc., add "ms" after the number, and add: none, very
  short, short, medium, long, very long, as well as [much]
  shorter/longer.

- For pronunciation-mode, you need to define at least a first cut at a
  canonical set (which not all browsers need be able to understand).

- language, country, and dialect are all combined in a single value
  according to RFC 1766, which is used as the value of the LANG
  attribute in the HTML internationalization draft and the value oft
  the Content-Language header in MIME (and therfore HTTP).  It
  probably doesn't hurt to have a single attribute with an rfc1766
  value, but the information should already be available to the
  browser, and I'm not sure what the appropriate behavior should be if
  the element's LANG attribute and its style sheet's language
  attribute disagree.

- for the various non-speech cues, I would recommend highly against
  talking about file names.  URLs are probably best, but a good base
  set of assumable effects is probably a good idea.

- for these cues (especially for during-sound), you probably want to
  be able to specify a "cue-volume", as it will probably want to be
  different from the speech volume.


----
Evan Kirshenbaum                       +------------------------------------
    HP Laboratories                    |The plural of "anecdote"
    1501 Page Mill Road, Building 1U   |is not "data"
    Palo Alto, CA  94304

    kirshenbaum@hpl.hp.com
    (415)857-7572

    http://www.hpl.hp.com/personal/Evan_Kirshenbaum/
Received on Tuesday, 13 February 1996 21:16:51 UTC