CSS Speech: Updated StyleSheet Specification

Here is a revised version of the cascaded speech stylesheet based on feedback
from the net community.
Thanks to everyone on this list for all the useful feedback and
help in tracking down problems with my initial draft.

--Raman

<html>
  <head>
<!--$Id: speech-spec.html,v 1.5 1996/02/26 20:12:25 raman Exp $
Description: Cascading Style Sheets For Aural Presentations
Author: T. V. Raman <raman@adobe.com>
Keywords: Speech, Audio, Rendering Styles -->
<title>Style Sheets For Producing Aural  Renderings</title>
<Author> T. V. Raman <br>
<A mailto="raman@adobe.com">Raman@adobe.com</a>
</author>
</head>
<body>


<h1>Style Sheets For Producing Spoken Renderings</h1>

This document defined style-sheet extensions that add property-value
definitions specific to aural renderings.
This initial specification attempts to define properties that will be general
while at the same time allowing browser implementors maximal flexibility in
exploiting the features provided by different auditory displays.  As the
functionality provided by such displays becomes standardized this
specification will evolve to encompass the features they provide.

<P>

Note that <em>speech</em> style-sheets  play the dual role of specifying how a
document should be rendered aurally to a user who is functionally blind,
i.e. is not currently looking at a visual display, 
and may also specify how a visual rendering should be augmented with sound
cues to provide a truly multimodal rendering.
Examples of situations where a user is <em>functionally blind</em>
in terms of looking at the computer display include:
<UL>
<LI> A user wishing to read a document while driving.
<LI> Users engaged in other eyes-busy tasks.
<LI>  A visually impaired user.
</UL>

<H2>Design Philosophy</H2>

A simple minded approach would dictate that an aural browser use the
information present  in the standard stylesheet to convey  the same
information aurally.
This would only fit the scenario of producing a faithful aural presentation of
a WWW document to someone who cannot see the visual display.
Such an aural rendering is not desirable in general because the decisions made
by a visual rendering system such as line and page breaks are irrelevant to
the listener. Aurally rendering the visual layout is not adequate to convey
structural information aurally.
<P>

We adopt the more sophisticated solution of defining a separate (possibly
cascaded) speech style-sheet  so as to:

<UL>
<LI> Realize that the aural rendering is essentially independent of the visual rendering.
<LI> Allow orthogonal aural and visual views.
<LI> Allow complementary aural and visual views.
<LI> Allow future browsers to optionally implement both aural
     and visual views to produce truly <em>multimodal</em> documents.
</UL>
<P>

This said, an auditory browser is free to use the information provided by the
standard visual stylesheet  to augment the aural rendering where necessary.
  Thus, when rendering a well-written document that uses the emphasis tag to  mark emphasized
phrases, such an aural browser would use the speech properties specified for
emphasis in the speech stylesheet.
However, if a document uses layout specific tags such as &LT;IT&GT; 
an aural browser can fall back on a default rendering that maps specific
speech properties to the visual layout tags.
In general, the speech stylesheet will not attempt to specify the mapping
between visual layout tags and speech properties, instead leaving it to
specific browser implementations to decide how such tags are rendered.


<H2>Aural Properties</H2>

In the following, we enumerate each property along with its possible
values. Explanatory paragraphs describe how a browser might use such
properties and their possible effect. The syntax used in the speech
style-sheet will be the same as defined in CSS1 --hence, this document will
not explicitly define the syntax. For all purposes, this document should be
considered as an appendix to (or part of) the CSS1 specification.
<P>

In the following, we enumerate a collection of aural properties  that allows
designers to exploit the capabilities of a wide range of auditory displays.
Implementors using simpler audio output devices are free to map  
properties specified by a style-sheet to  audio properties that are available
on a particular device.
We provide this flexibility to allow a rich collection of aural renderings.
The field of <em>audio formatting</em> is relatively new (see
<A HREF="http://www.cs.cornell.edu/Info/People/raman/current/phd-thesis/index.html">AsTeR --Audio System For Technical Readings</A>
for  research defining some of the  key notions in this area.
Also see Janet Cahn's Masters Thesis  entitled
<em>Generating Expression in Synthesized Speech</em> (Copyright MIT 1990)
for additional examples of varying speech synthesis parameters to produce
interesting effects.

Restricting the style sheet specification language to  the constraints of
lower quality devices would throttle research in this  field.

<H3>Speech Properties</H3>

Speech properties specify the voice characterestic to be used when rednering
specific document elements. 

<DL>
<DT> :volume
<DD> level  [0 1 2 3 4 5 6 7 8 9 10] or   (nnndb)  (specified in decibels )
     or [soft |medium | loud ]

     The volume of the speaker. Specified  as a numeric level, in decibels or
     using the keywords soft, medium or loud.
     The volume if specified as a level is mapped by the implementation of the
     UA to an appropriate device setting with a setting of 5 interpreted as "medium".
     
<DT> :left-volume
<DD> number 1--100 (percentage)  <P>

     Specifies the speaker volume for the left-channel.
     Devices not supporting stereo output may ignore this setting. 

     <DT> :right-volume
<DD> number 1--100 (percent)  <P>

     Specifies the speaker volume for the right-channel.
     Devices not supporting stereo output may ignore this setting. 


<DT> :voice-family
<DD> string<P>

     Analogous to the :font-family property.
     This specifies the kind of voice to be used, and can be something generic
     such as <em>male</em>  or something more specific such as
     <em>comedian</em>
     or something very specific such as <em>paul</em>.
     We recommend the same approach as used in the case of :font-family --the
     style sheet provide a list of possible values ranging from most to least
     specific and allow the browser to pick the most specific voice that it
     can find on the output device in use.

<DT> :speech-rate
<DD>Level [ -- 10] or  Number (NNNwpm)   (wordsper minute)
      or [slow | medium | fast]<P>

     Specifies the speaking rate. 
     If specified as a level, 5 is interpreted as medium.
<DT> :average-pitch
<DD>  Level [1 -- 10] or > number (NNNhz) (hertz)
    <P>

     Specifies the average pitch of the speaking voice in hertz (hz).  The average
pitch is the fundamental frequency of the speaking voice.  Lowering it
typically produces a deeper voice --increasing it produces a higher pitched
voice.  Listen to <A
HREF="http://www.cs.cornell.edu/Info/People/raman/aster/sec-02.html">AsTeR
rendering superscripts and subscripts</A> for an example of this effect.

<DT>  :pitch-range
<DD> number (percentage variation 0--200)
     <P>

     Specifies  variation in average pitch. A pitch range of 0 produces a
     flat, monotonic voice. A pitch range of 100 produces normal inflection.
     Pitch  ranges greater than 100 produce animated voices.

     <P>
Less sophisticated speech output devices specify a simple prosody setting that
     acts as a toggle   that sets this value to either 0 or 100.
     
<DT>  :stress 
<DD> number (0--100)<P>

     Specifies the level of stress (assertiveness or emphasis) of the speaking
     voice.  English is a <strong>stressed</strong> language, and different
     parts of a sentence are assigned primary, secondary or tertiary
     stress. The value of property :stress controls amount of inflection that
     results from these stress markers.  Different speech devices may require
     the setting of one or more device-specific parameters to achieve this
     effect.  <P>


Increasing the value of this property results in
     the speech being more strongly inflected.  It is in a sense dual to
     property  <em>:pitch-range</em>
     and is provided to allow developpers to  exploit higher-end auditory displays.
     The resulting voice sounds excited or animated.
     
<DT>  :richness
<DD> number (0--100)<P>

     Specifies the richness (brightness) of the speaking voice.
     Different speech devices may require the setting of one or more
     device-specific parameters to achieve this effect.
     <P>
     
The effect of increasing richness is to produce a voice that <em>carries</em>
     --reducing richness produces a soft, mellifluous voice.  For an example
     of continuously reducing richness listen to <A
     HREF="http://www.cs.cornell.edu/Info/People/raman/aster/math-examples/sec4-ex1.au">AsTeR
     rendering a continuous fraction</A>

     <P>
Note: In the above example of a continuous fraction the voice also grows more
     animated --this is a result of increasing the value of property :stress.
<DT> :speech-other
<DD> List  of name value pairs. <P>

Allows implementors to experiment with features available on specific speech
     devices.  The use of this property is device-specific, but is provided as
     an <em>escape mechanism</em>since auditory displays are not yet as
     standardized as their visual counterparts.  Implementors are encouraged
     to use this property only where absolutely necessary.  In many cases, the
     desired effect can be abstracted using the properties defined earlier and
     having the device-specific component of the browser map a single abstract
     property to a collection of device specific properties.
<P>
In general, we expect document specific style sheets to completely avoid this
     escape mechanism to ensure that documents remain device-independent.
     User-specific and UA-specific local stylesheets may choose to use this
     facility to enhance the presentation.
     
</DL>


<H3>Miscellaneous Speech Settings</H3>

In addition to  specifying voice properties, a speech style-sheet also
specifies auxillary information such as the amount of pause  to insert before
or after rendering document elements.<P>

Pause can be used to great effect in conveying structural information.
Experience with AsTeR (Audio System For Technical Readings) has shown that
small amounts of pause --5 to 20 milliseconds-- can prove perceptually
significant and aid in the grouping of mathematical subexpressions.  listen to
<A HREF="http://www.cs.cornell.edu/Info/People/raman/aster/sec-01.html">AsTeR
rendering simple fractions</A>
where pauses are used effectivley to convey grouping. 
<DL>
<DT> :pause-before
<DD> number (milliseconds) Amount of pause. (analogous to white space.)<P>

     Specifies the number of milliseconds  of  silence  to insert
     <em>before</em> rendering a document element.
In situations  where the <em>:pause-before</em> <strong>intersects</strong>
     the <em>:pause-after</em> of the preceding document element,   we compute
     the amount of pause to insert in a manner similar to that used to compute
     the amount of intervening whitespace in producing visual renderings. 
     
     <DT> :pause-after
<DD> number (milliseconds) Amount of pause. (analogous to white space.)<P>

     Specifies the number  of milliseconds of  silence  to insert
     <em>after</em> rendering a document element.

     <DT> :pause-around
<DD> number (milliseconds) Amount of pause. (analogous to white space.)<P>

     Specifies the number  of milliseconds of  silence  to insert
      <em>before</em> and <em>after</em> rendering a document element.
     Though this effect can be achieved by using <em>:pause-before</em> and
     <em>:pause-after</em> in conjunction,
     style-sheet designers are encouraged to use <em>:pause-around</em>  where
     appropriate since it makes the intent clearer.
     <strong> Perhaps :before :after and :around should be modifiers so they
     can be generally applied to other property settings?</strong>
<DT> :pronunciation-mode
<DD> string<P>

     Specify the pronunciation mode to be used when speaking a document
     element. Pronunciation modes can include
     <UL>
     <LI> Speak all punctuation marks
     <LI>  Speak only some punctuations.
          In this case, the rule for handling punctuation marks is 
          specified  by providing  a value for property
          :punctuation-marks-to-skip or :punctuation-marks-to-speak.
          
     <LI> Speak contents as a date.
     <LI> Speak contents as a time string     
     </UL>

     The set of values for this property is left open so that designers can
     exploit all features available in a specific device.
     Style-sheet designers can specify a list of values for specifying a
     particular option in a amanner analogous to that described in specifying
     :voice-family. Browsers are expected to choose the  most
     specific setting available on the current output device.
     Thus, for property :speak-time, a style sheet could specify
     <q>:speak-military-time</q> and <q>:speak-am-pm</q> etc.
     <P>

     The device-specific component of a browser is expected to map those
     values that it does not understand to a suitable default. Alternatively,
     the device-specific component of the browser may choose to transform the
     contents of the document element to a form that is suitable to be
     rendered by the specific device.
To give an example:

     <P>
Consider the value <em>date-string</em>.
     Given a content string of the form <em>Jan 1, 1996</em>
     an aural browser could:
     <UL>
     <LI> Ignore property :pronunciation-mode.
     <LI> Send the content string directly to
          a smart speech device capable of switching to a <q>speak date</q>
          mode.
     <LI> Apply an appropriate transform --in this example, change Jan to January-- when
communicating with a less sophisticated output device.
     </UL>
<DT> :language
<DD> string<P>

     Language to use when rendering the contents of the document element.
     Specified by using the appropriate ISO encoding for international
     languages.

<DT> :country
<DD> string<P>

     Specified using ISO encoding for specifying country codes.
     Can be used in conjunction with :language to specify British or American
     English. (See property :dialect below for variations in speaking style
     within a country.)
     This property will be useful for multilingual speech devices capable of
     switching between languages.
     
<DT> :dialect
<DD> string<P>

     Specifies the dialect to be used, e.g.: american-mid-western-english. 

     
</DL>

<H3>Non-Speech Auditory Cues</H3>

Non-speech sounds can be used to produce <em>auditory icons</em>.
Such auditory icons serve to augment the aural rendering and provide succinct
cues.
<P>

<DL>
<DT> :before-sound
<DD> URL.  <P>

     Specifies a file containing
sound data. The sound is played
     <em>before</em> rendering the document element to produce an auditory
     icon. An optional :cue-volume can specify a volume scaling to be applied
     to the sound before playing it.

     <DT> :after-sound
<DD>  URL.  <P>

     Specifies a file containing
sound data. The sound is played
     <em>after</em> rendering the document element to produce an auditory
     icon.
 An optional :cue-volume can specify a volume scaling to be applied
     to the sound before playing it.
     <DT> :around-sound
<DD>  URL.  <P>

     Specifies a file containing
sound data. The sound is played
     <em>around</em> rendering the document element to produce an auditory
     icon.
 An optional :cue-volume can specify a volume scaling to be applied
     to the sound before playing it.
     <DT> :during-sound
<DD>  URL.  <P>

     Specifies a file containing
sound data. The sound is played repeatedly 
     <em>during</em> rendering the document element to produce an auditory
     icon that provides an aural backdrop.
</DL>


<H3>Advanced Settings </H3>

In the future, auditory displays may want to exploit spatial audio for
producing  rich aural layout.
Spatial audio --a digital signal processing technique that involves convolving
sound data with appropriate filters to produce spatially located sounds-- can
be used to make sounds <em>appear</em> to originate  from different points  in
the listener's auditory space and is popularly referred to as
<q>three-dimensional sound</q>.

<DL>
<DT> :spatial-audio
<DD> :azimuth number :elevation number

     Azimuth and elevation are specified in degrees and together specify the
     point in auditory space from which the sound appears to originate. <P>
</DL>


<hr>
<address><A href="mailto:raman@adobe.com">Email: raman@adobe.com</a></address>
<!-- hhmts start -->
Last modified: Mon Feb 26 12:12:19 1996
<!-- hhmts end -->
</body>
 </html>

-- 



Best Regards,
____________________________________________________________________________
--raman

      Adobe Systems                 Tel: 1 (415) 962 3945   (B-1 115)
      Advanced Technology Group     Fax: 1 (415) 962 6063 
      1585 Charleston Road          Email: raman@adobe.com 
      Mountain View, CA 94039 -7900  raman@cs.cornell.edu
      http://www-atg/People/Raman.html (Internal To Adobe)
      http://www.cs.cornell.edu/Info/People/raman/raman.html  (Cornell)

Disclaimer: The opinions expressed are my own and in no way should be taken
            as representative of my employer, Adobe Systems Inc.
____________________________________________________________________________

Received on Wednesday, 28 February 1996 11:55:12 UTC