multimodal style

This is a proposal for a multimodal styling language, as opposed to a
linear combination of visual/audio/whatever styling languages.  I suggest
a few natural multimodal attributes, and discuss why and how to encourage 
their use.  I also suggest allowing attribute values to be in a fixed 
range of numbers, which both simplifies the styling language and 
minimizes the dependence on English.

I would be very happy to hear comments from any interested reader.  This
proposal is publicly available at
http://www.physics.mcgill.ca/WWW/seibert/style/mms.html.  I am also
temporarily storing the audio style sheet proposal of T.V. Raman at
http://www.physics.mcgill.ca/WWW/seibert/style/raman.html, so that this
manuscript is also publicly available.


David Seibert


<title>Multimodal document styling</title>


<h1>Encouraging the production of stylish and accessible documents</h1>

  <li><a href="#int">Introduction</a>
  <li><a href="#des">Design goals</a>
  <li><a href="#uni">Unimodal attributes</a>
    <li><a href="#aud">Audio</a>
    <li><a href="#vis">Visual</a>
  <li><a href="#mul">Multimodal attributes</a>
    <li><a href="#def">Definitions, proposed values, and mnemonics</a>
    <li><a href="#oth">Other values</a>
  <li><a href="#sta">Standardization and language independence</a>
  <li><a href="#ind">Independent specification of attributes</a>
  <li><a href="#enc">Encouraging authors to produce multimodal documents</a>
  <li><a href="#pre">Precise specification of multimodal attributes</a>
  <li><a href="#sum">Summary</a>

<h2><a name="int">Introduction</a></h2>

<a href="http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html">HTML</a> 
(HyperText Markup Language), the standard markup language of the World
Wide Web, is an 
<a href="http://www.w3.org/hypertext/WWW/MarkUp/SGML/">SGML</a> 
(Standard Generalized Markup Language) document type.  Rules have been
specified to transform HTML tags to a set of "canonical" elements in
accordance with SDA (SGML Document Access) standards, so that HTML
documents can easily be presented using Braille, large print, audio,
or any other type of display.  The reason that the ICADD (International
Committee on Accessible Document Design) recommends the definition of
mappings to a standard tag set by DTD (Document Type Definition) authors
is that this practice will "minimize the burden on writers and editors
of understanding the requirements of markup for Braille, large print and
voice synthesized delivery" 
[<a href="gopher://gopher.mic.ucla.edu:4334/11/ICADD/info">Y. Rubinsky,
"Description of the ICADD Mechanism"</a>].

The use of <a href="http://www.w3.org/hypertext/WWW/Style/">HTML
styling languages</a>, such as
<a href="http://occam.sjf.novell.com:8080/docs/dsssl-o/do951212.htm">DSSSL
(Document Style Semantics and Specification Language) Online</a> and 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS (Cascading Style 
Sheets)</a>, has been suggested as a simple way for Web publishers to 
control the presentation of HTML documents.  The hope is that this
mechanism will encourage publishers and software vendors to use HTML
rather than creating their own DTDs using SGML, or inventing new HTML
markup tags.  Some advantages of continued HTML use are:
  <li>Cross-platform development is facilitated by having a
  well-established, simple standard.
  <li>The dependence of presentation on proprietary software is 
  minimized because the meanings of markup tags are relatively 
  <li>Web publishers are less likely to create HTML documents
  that cannot be presented well in alternative formats.
  <li>The relatively small set of allowed markup tags allows HTML
  display software to be simpler than SGML processing software.
However, documents created using customized styles will not be presented
well in all display modes, unless the style designer spends time creating
the proper specification for each mode.  This extra work would again make
it less likely that web authors would create documents and styles that
can be presented well in all formats.

Current styling language proposals concentrate on giving authors control
of the visual presentation of text and images, for the most part ignoring
the possibility of alternative formats.  HTML style sheets for audio
presentation have been proposed by 
<a href="http://www.physics.mcgill.ca/WWW/seibert/style/raman.html">T.V. 
and by the <a href="http://www.esat.kuleuven.ac.be/~juanjo/csss1.html">
TEO group</a> at the Katholique University of Leuven, but in these 
schemes the audio controls are totally independent of visual controls.  
Little attention has been paid to the fact that, in most cases, authors
use markup to express an idea, so that audio and visual presentations
(along with presentations in any other modes) are related because they
represent the same idea, just as visual presentations of markup in 
different languages are related by the semantics of the tags.  

In this document, I propose to create a multimodal HTML styling
language, in which visual, audio, and other style descriptions are
integrated as much as is practical.  The purpose of this unification is
to make it as easy as possible to produce better web documents for
people with disabilities, by reducing the work for the author or style
designer to enrich a document or style for all display modes.  I
suggest design goals for such a styling language and discuss means to
implement those goals.  I give concrete examples for five multimodal
presentation attributes that can be derived from visual and audio 
attributes.  Finally, I discuss how to combine multimodal and unimodal
attributes to create a styling language that not only allows but
encourages authors to produce multimodal documents and styles.

<h2><a name="des">Design goals</a></h2>

A well-designed multimodal styling language should
  <li>contain a fairly complete set of attributes, so that authors can
      specify a wide range of properties for visual, audio, and other 
  <li>contain multimodal attributes that allow authors to simultaneously
      specify properties for multiple presentation modes.
  <li>standardize attribute values as much as possible, to make it
      easier for casual authors to use the styling language.
  <li>reduce language dependence by minimizing the use of English.
  <li>allow authors to specify visual and audio properties 
      independently if they wish to do so, for maximal control over
  <li>encourage authors to use multimodal attributes (that control
      presentation in more than one mode), which will produce
      documents that can be presented well in any mode, rather than 
      unimodal attributes (that control presentation in a single 
  <li>allow precise physical specification of presentation properties
      when feasible.

<h2><a name="uni">Unimodal attributes</a></h2>

The first design goal, providing a wide range of attributes, can be met
by simply combining current proposals for visual and audio style
attributes.  In
this section, I give the names of some proposed attributes and their 
natural language and numerical values (without actual or implied 
units).  The definitions are usually fairly obvious; when they are 
not, readers should refer to the proposals for visual and audio 
style sheets.  I do not discuss physical values for these attributes,
as these cannot be translated as simply to values of multimodal 
attributes as can the less precise (but more intuitive) natural
language or numerical values.

<h4><a name="aud">Audio</a></h4>

I list here all attributes described by
<a href="http://www.physics.mcgill.ca/WWW/seibert/style/raman.html">Raman</a>, 
with the exception of speech-other, which is suggested for
experimental purposes, and spatial-audio, which is suggested
for possible use in the future.  The attributes proposed by the 
<a href="http://www.esat.kuleuven.ac.be/~juanjo/csss1.html">TEO group</a>
are a subset of those proposed by Raman, so they are also included below.  
The emphasized attributes are those that can be naturally combined with
visual attributes.  For simplicity, I use the 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a>
syntax, although the proposal could be written using the notation of 
either CSS or 
<a href="http://occam.sjf.novell.com:8080/docs/dsssl-o/do951212.htm">DSSSL

<tr> <th align=left>Attribute name:</th> 
     <th align=left>Natural language and numerical values</th> </tr> <p>
<tr> <td><em>volume</em>:</td> <td>soft | medium | loud | [0-10]</td> </tr> <p>
<tr> <td>[left | right]-volume:</td> <td>&lt;none&gt;</td> </tr> <p>
<tr> <td>voice-family:</td> <td>&lt;string&gt; (name)</td> </tr> <p>
<tr> <td>speech-rate:</td> <td>slow | medium | fast | [1-10]</td> </tr> <p>
<tr> <td>average-pitch:</td> <td>[1-10]</td> </tr> <p>
<tr> <td><em>pitch-range</em>:</td> <td>[0-200]</td> </tr> <p>
<tr> <td><em>stress</em>:</td> <td>[0-100]</td> </tr> <p>
<tr> <td>richness:</td> <td>[0-100]</td> </tr> <p>
<tr> <td><em>pause-[before | after | around]</em>:</td> 
     <td>&lt;none&gt;</td> </tr> <p>
<tr> <td>pronunciation-mode:</td> <td>&lt;string&gt;</td> </tr> <p>
<tr> <td><em>language:</em></td> <td>&lt;string&gt;</td> </tr> <p>
<tr> <td><em>country:</em></td> <td>&lt;string&gt;</td> </tr> 
<tr> <td>dialect:</td> <td>&lt;string&gt; (name)</td> </tr> <p>
<tr> <td><em>[before | after | during]-sound</em>:</td> 
     <td>&lt;uri&gt;</td> </tr> 
</table> </p>

<h4><a name="vis">Visual</a></h4>

I will not list the full range of visual attributes that can be 
controlled by proposed HTML style sheets.  Instead, I give only
the attributes that are naturally linked with audio attributes.  
I use the nomenclature of 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a>, although 
these attributes can be equally well expressed using the terminology 
<a href="http://occam.sjf.novell.com:8080/docs/dsssl-o/do951212.htm">DSSSL

<tr> <th align=left>Attribute name:</th> 
     <th align=left>Natural language and numerical values</th> </tr> <p>
<tr> <td>font-size:</td> <td>xx-small | x-small | small | medium | large | 
     x-large | xx-large</td> </tr> <p>
<tr> <td>font-style:</td> <td>normal | italic | oblique | small-caps | 
     [ italic | oblique ] small-caps</td> </tr> <p>
<tr> <td>font-weight:</td> <td>extra-light | light | demi-light | medium | 
     demi-bold | bold | extra-bold</td> </tr> <p>
<tr> <td>padding:</td> <td>auto</td> </tr> <p>
<tr> <td>background:</td> <td>transparent | &lt;uri&gt;</td> </tr>
</table> </p>

<h2><a name="mul">Multimodal attributes</a></h2>

Here I give an example of the solution to the second design goal by
proposing a set of multimodal attributes designed for simultaneous
control of visual and audio displays.
In a number of cases, visual and audio attributes given above can be 
expressed by a common meaning.  In these cases, the visual and audio 
attributes can be combined in a natural manner to produce multimodal 
attributes.  I propose the multimodal style attributes given in the 
following table as they are defined below. 

<tr> <th align=left>Multimodal attribute:</th> 
     <th align=left>Audio name,</th> 
     <th align=left>Visual name</th> </tr> <p>
<tr> <td>size:</td> <td>volume,</td> <td>font-size</td> </tr> <p>
<tr> <td>range:</td> <td>pitch-range,</td> <td>font-style</td> </tr> <p>
<tr> <td>weight:</td> <td>stress,</td> <td>font-weight</td> </tr> <p>
<tr> <td>separation:</td> <td>pause-[before | after | around],</td> 
     <td>padding</td> </tr> <p>
<tr> <td>background:</td> <td>[before | after | during]-sound,</td> 
     <td>background</td> </tr> 
</table> </p>

<h4><a name="def">Definitions, proposed values, and mnemonics</a></h4>


<dt>size: 1 | 2 | 3 | 4 | 5 | 6 | 7
<dd>The relationship here is obvious - larger text, louder speech, and
higher numbers will <em>usually</em> be associated.  If they are not, 
authors should use a suitable combination of unimodal style attributes, 
but if they are, authors will minimize their work by using the multimodal 
forms.  Possible mnemonics for the values (from musical notation): 
pianissimo | piano | mezzopiano | mezzo | mezzoforte | forte | fortissimo.

<dt>range: 1 | 2 | 3 | 4 | 5 | 6 | 7
<dd>Here again the relation is fairly obvious if you consider how printed 
words are normally spoken (e.g., "It's not <em>really</em> important 
...").  The mapping is a bit trickier, mainly because voices are so much 
more expressive in this regard than print.  Probably values 1-4 would map 
to normal type, and 5-7 would map to italics or oblique type.  Possible
mnemonics (could use work): dead | dull | boring | normal | happy | 
excited | wild.

<dt>weight: 1 | 2 | 3 | 4 | 5 | 6 | 7
<dd>Stress and font-weight are again fairly naturally related, and
the mapping from numbers to current natural language values is obvious.
Possible mnemonics (more or less from boxing): feather | light | 
midlight | middle | midheavy | heavy | superheavy.

<dt>space: 1 | 2 | 3 | 4 | 5 | 6 | 7 
    {above/right/below/left specified following 
    <a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a>}
<dd>Here the attribute values should tied to the visual presentation,
which is richer because printed spaces are two-dimensional while 
audio spaces can only be one-dimensional.  Space should be tied 
to the visual attribute of padding or margin;  I picked padding, but
I think that either could be chosen.  Possible mnemonics: none (a bit
counter-intuitive at 1) | narrower | narrow | normal | wide | wider |

<dt>background: &lt;uri&gt;
<dd>Here you just save a little time, but again the meanings match so it
makes sense to allow authors to simultaneously specify audio and visual
backgrounds.  The allowed values are the same, so the presentation
software must interpret the URI, but that is trivial - visual backgrounds
go with visual presentations, audio with audio, and so on.  Maybe
style sheets should also provide visual before- and after-cues, to go
along with the audio cues that
<a href="http://www.physics.mcgill.ca/WWW/seibert/style/raman.html">Raman</a>
Once one is allowed, the other would follow naturally in the same way that
background can be used naturally for both audio and visual presentation
without the need for any extra notation.


<h4><a name="oth">Other values</a></h4>

Other values can (and generally should) be allowed for most of these
attributes.  Physical values obviously should be allowed, to give
authors detailed control over document formats; however, the allowed
values were selected to be as useful and intuitive as possible, to 
encourage casual authors to use them rather than physical values.  They
should be granular enough to give good control, but not so granular as
to be confusing.  The mnemonics could also be allowed values, although
I am not sure that I would recommend this in general.

<h2><a name="sta">Standardization and language independence</a></h2> 

I simultaneously address the third and fourth design goals, 
standardization of allowed values and language independence, by
allowing numbers, e.g. [1-7], to be used to represent the 
values of multimodal attributes for which this procedure seems to be
intuitively reasonable, in their natural order (smallest=1, 
normal=4, largest=7).  This is similar to the practices proposed by 
<a href="http://www.physics.mcgill.ca/WWW/seibert/style/raman.html">Raman</a>
and the <a href="http://www.esat.kuleuven.ac.be/~juanjo/csss1.html">TEO 
group</a>; only the significant difference is the proposal to use the
same range of numbers for all attributes.  I allow 7 values because 
that seems to provide a reasonable amount of granularity for most 
attributes.  Using the same range for 
most attributes makes it simpler for authors to remember the allowed 
values and their meanings (e.g., 4 is always the default), so I 
suggest that the range remain the same across attributes if possible,
even if a different global range is preferred.

Using numbers gives relative language independence because numerical
notation is more widespread than any language.  If numbers are allowed,
authors can learn the definitions of the numbers in whatever language 
they prefer.  There will still be some difficulty for authors
who are not familiar with Arabic numbers, but this could be dealt with
simply by allowing non-Arabic numbers with the same meanings as well,
because numbers are well defined so translation is trivial.

Because it simultaneously solves two design problems, the practice of
using a standard numerical range as allowed values would also be an
advantageous practice for general styling language design.
An additional minor benefit is also obtained: because each integer is 
represented by a single character, the amount of typing needed to create
style descriptions is reduced.

<h2><a name="ind">Independent specification of attributes</a></h2>

The fifth design goal, allowing independent specifications for audio
and visual presentations, is also easily met.  Under 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a>, the use of 
multimodal attributes would <em>not</em> preclude the specification of 
refinements to any single mode of presentation.  Rather, authors should 
first specify the document style as accurately as possible using 
multimodal attributes, and then add further refinements through 
modifications of unimodal attributes, so that the document is presented 
well to all users.

<h2><a name="enc">Encouraging authors to produce multimodal

My sixth design goal is to encourage authors to use the multimodal
attributes provided by the styling language.  Establishing and
standardizing multimodal attributes is necessary to enable authors to
easily produce rich web documents for multimodal display, but it is not
sufficient to ensure that authors regularly produce rich documents and
styles suitable for multimodal presentation.  Additional steps should be
taken to encourage authors to create customized multimodal style
descriptions in place of unimodal descriptions.  For example, in a
well-designed styling language, authors should be pushed to use
multimodal attributes as the first step of designing styles, in
preference to unimodal attributes.

There will be some resistance to using multimodal attributes, as
many authors, regardless of the level of experience, are accustomed to
working primarily with unimodal (usually visual) attributes.  To
counteract this resistance, styling language designers should not give
an overcomplete set of attributes, i.e., all unimodal <em>and</em>
multimodal attributes.  Instead, for each group of unimodal attributes
that combines to form a multimodal attribute, the multimodal attribute
should replace the richest of the unimodal attributes (so that authors
are likely to need less refinement of unimodal attributes).  For the
attributes proposed above, this scheme could be implemented as follows.


<dt><a name="size">size</a>
  <dd>could probably replace either attribute 
  reasonably well.  It should probably replace the visual attribute, 
  font-size, as authors are probably more likely to prescribe visual
  style than audio style.  In this case, the name would be a bit less
  intuitive than before, but this could be an advantage as it would serve
  to remind the author that the attribute is multimodal rather than

  <dd>would replace the audio attribute, pitch-range, which
  has more granularity and therefore carries more information than the
  visual attribute, font-style.  The association with an audio property 
  might discourage visual authors from using this attribute, however,
  especially because the two attributes overlap in meaning but are not
  equivalent, so I suggest a replacement of 
  <a href="#font-style">font-style</a> as discussed below.

  <dd>would replace the visual attribute, font-weight,
  following the case for <a href="#size">size</a>.

  <dd>would again replace the visual attribute, padding, as in
  the cases above (although padding may be a better name for this

  <dd>would replace both attributes.  The use of multiple
  URIs of different types should be allowed, as presentation software can
  tell easily which to use from the context (e.g., use a visual
  background for visually displayed text, not an audio background).


In addition to the new multimodal attributes, two changes to the 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a> scheme would be
needed.  I propose that small-caps be added to the set of allowed values
for the CSS attribute text-transform, and a new element emph-style be
added, as given below.  The old attribute, font-style, would then be
expressed through combinations of range, text-transform, and emph-style.


<dt>emph-style: italic | oblique
  <dd>controls whether high-range text is presented in italic or oblique


<h2><a name="pre">Precise specification of multimodal attributes</a></h2>

One difficulty with the scheme proposed here is that many authors will 
want to specify physical quantities, such as the font-size, very
precisely.  To solve this problem requires simply a standard mapping
from the interval [1-7] to the range of reasonable physical values for
each attribute.  Then, if the author specifies a physical value (with
units) for a multimodal attribute, the units will enable the presentation
software to determine to which mode the value applies, and the mappings
can be used to calculate the appropriate values for the other modes
connected with the attribute.

Alternatively, if such a mapping exists, the value could be more 
precisely specified by allowing any number in the range [1-7], and not
just integers.  This is probably preferable, as it decreases the
device-dependence of the style, and so should probably be allowed.
However, physical values (with units) should be allowed in any case, as
a large body of authors are accustomed to using those values.

I have not attempted to produce the required mappings here.  That is
left for the present as a research problem, as it will be best solved
experimentally by having a wide range of subjects evaluate a large
number of displays with a variety of mappings.

<h2><a name="sum">Summary</a></h2>

Although I have used HTML style sheet proposals as examples here, the
design goals and the methods used to achieve them would apply to
multimodal styling languages for use with any DTD.  The third and
fourth, standardization of allowed attribute values and minimal language
dependence, are also useful goals for unimodal styling language
designers (as are the first and seventh, but they are so intuitive that
they are almost universally followed).

This is not a complete proposal for a multimodal HTML styling
language.  I have, however, proposed the creation of five multimodal
style attributes, and the elimination or modification of five related
visual or audio attributes in order to encourage authors to use the
multimodal attributes.  As proposed, the creation of one new visual
attribute and the slight modification of another would also be
necessary to recover the full proposed functionality of 
<a href="http://www.w3.org/pub/WWW/Style/css/">CSS</a>.

The changes proposed to current HTML styling languages are
small but important.  Without these changes, it will be significantly
harder for web authors to produce rich documents that can be presented
well in all modes, and therefore most documents will be designed for
unimodal (probably visual) display.  These changes could be implemented
in the future, but this would result in a significant diminution of their
full power, as once visual authors become accustomed to visually oriented
styling languages there will be more resistance to the multimodal forms.
In addition, there may be some problems with backward compatibility of 
attributes, as in the case of font-style, which may be easily eliminated
if designers plan now for eventual conversion to multimodal styling
languages. Thus, I suggest that the proposed changes be made early,
before visual styling technology has a chance to become widespread and
the current technology is locked in.


Last modified 1 March 1996 by 
<a href="http://www.physics.mcgill.ca/~seibert/">David Seibert</a>.
<address><a href="mailto:seibert@hep.physics.mcgill.ca">