csswg/css3-speech Overview.html,1.100,1.101 Overview.src.html,1.102,1.103 from Daniel Weck via cvs-syncmail on 2012-02-20 (public-css-commits@w3.org from February 2012)

From: Daniel Weck via cvs-syncmail <cvsmail@w3.org>
Date: Mon, 20 Feb 2012 23:48:11 +0000
To: public-css-commits@w3.org
Message-Id: <E1Rzcxn-00068t-EQ@lionel-hutz.w3.org>
Update of /sources/public/csswg/css3-speech
In directory hutz:/tmp/cvs-serv23555

Modified Files:
	Overview.html Overview.src.html 
Log Message:
reworded the audio cues volume level prose, now in its own section


Index: Overview.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.html,v
retrieving revision 1.100
retrieving revision 1.101
diff -u -d -r1.100 -r1.101
--- Overview.html	14 Feb 2012 01:17:10 -0000	1.100
+++ Overview.html	20 Feb 2012 23:48:09 -0000	1.101
@@ -88,14 +88,14 @@
 
    <h1 id=top>CSS Speech Module</h1>
 
-   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 14 February
+   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 20 February
     2012</h2>
 
    <dl id=versions>
     <dt>This version:
 
     <dd>
-     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120214/">http://www.w3.org/TR/2012/ED-css3-speech-20120214/</a>-->
+     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120220/">http://www.w3.org/TR/2012/ED-css3-speech-20120220/</a>-->
      <a
      href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
      
@@ -305,7 +305,10 @@
       </span>The &lsquo;<code class=property>cue-before</code>&rsquo; and
       &lsquo;<code class=property>cue-after</code>&rsquo; properties</a>
 
-     <li><a href="#cue-props-cue"><span class=secno>11.2. </span>The
+     <li><a href="#cue-props-volume"><span class=secno>11.2. </span>Relation
+      between audio cues and speech synthesis volume levels</a>
+
+     <li><a href="#cue-props-cue"><span class=secno>11.3. </span>The
       &lsquo;<code class=property>cue</code>&rsquo; shorthand property</a>
     </ul>
 
@@ -388,11 +391,11 @@
    "screen readers" allow users to interact with visual interfaces that would
    otherwise be inaccessible to them. There are also circumstances in which
    <em>listening</em> to content (as opposed to <em>reading</em>) is
-   preferred, or sometimes even required, regardless of a person's intrinsic
-   physical ability to access information. For instance: playing an e-book
-   whilst driving a vehicle, learning how to manipulate industrial and
-   medical devices, interacting with home entertainment systems, teaching
-   young children how to read.
+   preferred, or sometimes even required, irrespective of a person's physical
+   ability to access information. For instance: playing an e-book whilst
+   driving a vehicle, learning how to manipulate industrial and medical
+   devices, interacting with home entertainment systems, teaching young
+   children how to read.
 
   <p> The CSS properties defined in the Speech module enable authors to
    declaratively control the presentation of a document in the aural
@@ -419,11 +422,11 @@
    specifically for the aural dimension.
 
   <p> Content creators can conditionally include CSS properties dedicated to
-   user-agents with text to speech synthesis capabilities, by specifying the
+   user agents with text to speech synthesis capabilities, by specifying the
    "speech" media type via the <code>media</code> attribute of the
    <code>link</code> element, or with the <code>@media</code> at-rule, or
    within an <code>@import</code> statement. When styles are authored within
-   the scope of such conditional statements, they are ignored by user-agents
+   the scope of such conditional statements, they are ignored by user agents
    that do not support the Speech module.
 
   <h2 id=ssml-rel><span class=secno>3. </span>Relationship with SSML</h2>
@@ -590,7 +593,7 @@
    control the amplitude of the audio waveform generated by the speech
    synthesiser, and is also used to adjust the relative volume level of <a
    href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
-   "box" model</a>.
+   box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -635,12 +638,13 @@
     <strong>medium</strong>, <strong>loud</strong>, <strong>x-loud</strong>
 
    <dd>
-    <p> This sequence of keywords corresponds to monotonically non-decreasing
-     volume levels, mapped to implementation-dependent values (i.e. inferred
-     by the user-agent) that meet the user's requirements in terms of
-     perceived sound loudness . The keyword &lsquo;<code
-     class=property>x-soft</code>&rsquo; maps to the user's <em>minimum
-     audible</em> volume level, &lsquo;<code
+    <p>This sequence of keywords corresponds to monotonically non-decreasing
+     volume levels, mapped to implementation-dependent values that meet the
+     listener's requirements with regards to perceived sound loudness. These
+     audio levels are typically provided via a preference mechanism that
+     allow users to set options according to their auditory environment. The
+     keyword &lsquo;<code class=property>x-soft</code>&rsquo; maps to the
+     user's <em>minimum audible</em> volume level, &lsquo;<code
      class=property>x-loud</code>&rsquo; maps to the user's <em>maximum
      tolerable</em> volume level, &lsquo;<code
      class=property>medium</code>&rsquo; maps to the user's
@@ -796,22 +800,22 @@
      the resulting number to &lsquo;<code class=css>100</code>&rsquo;.</p>
   </dl>
 
-  <p> User agents may be connected to different kinds of sound systems,
+  <p> user agents may be connected to different kinds of sound systems,
    featuring varying audio mixing capabilities. The expected behavior for
    mono, stereo, and surround sound systems is defined as follows:
 
   <ul>
-   <li> When user-agents produce audio via a mono-aural sound system (i.e.
+   <li> When user agents produce audio via a mono-aural sound system (i.e.
     single-speaker setup), the &lsquo;<a href="#voice-balance"><code
     class=property>voice-balance</code></a>&rsquo; property has no effect.
 
-   <li> When user-agents produce audio through a stereo sound system (e.g.
+   <li> When user agents produce audio through a stereo sound system (e.g.
     two speakers, a pair of headphones), the left-right distribution of audio
     signals can precisely match the authored values for the &lsquo;<a
     href="#voice-balance"><code
     class=property>voice-balance</code></a>&rsquo; property.
 
-   <li> When user-agents are capable of mixing audio signals through more
+   <li> When user agents are capable of mixing audio signals through more
     than 2 channels (e.g. 5-speakers surround sound system, including a
     dedicated center channel), the physical distribution of audio signals
     resulting from the application of the &lsquo;<a
@@ -826,7 +830,7 @@
   <p> Future revisions of the CSS Speech module may include support for
    three-dimensional audio, which would effectively enable authors to specify
    "azimuth" and "elevation" values. In the future, content authored using
-   the current specification may therefore be consumed by user-agents which
+   the current specification may therefore be consumed by user agents which
    are compliant with the version of CSS Speech that supports
    three-dimensional audio. In order to prepare for this possibility, the
    values enabled by the current &lsquo;<a href="#voice-balance"><code
@@ -877,7 +881,7 @@
    and therefore do not intrinsically support the &lsquo;<a
    href="#voice-balance"><code class=property>voice-balance</code></a>&rsquo;
    property. The sound distribution along the left-right axis consequently
-   occurs at post-synthesis stage (when the speech-enabled user-agent mixes
+   occurs at post-synthesis stage (when the speech-enabled user agent mixes
    the various audio sources authored within the document)
 
   <h2 id=speaking-props><span class=secno>8. </span>Speaking properties</h2>
@@ -1078,7 +1082,7 @@
     <p class=note>Speech synthesizers are knowledgeable about what a
      <em>number</em> is. The &lsquo;<a href="#speak-as"><code
      class=property>speak-as</code></a>&rsquo; property enables some level of
-     control on how user-agents render numbers, and may be implemented as a
+     control on how user agents render numbers, and may be implemented as a
      preprocessing step before passing the text to the actual speech
      synthesizer.</p>
 
@@ -1200,7 +1204,7 @@
    class=property>cue-before</code></a>&rsquo; (or &lsquo;<a
    href="#cue-after"><code class=property>cue-after</code></a>&rsquo;) is
    specified, before (or after) the cue within the <a
-   href="#aural-model">aural "box" model</a>.
+   href="#aural-model">aural box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -1208,7 +1212,7 @@
    element</a> from the SSML markup language <a href="#SSML"
    rel=biblioentry>[SSML]<!--{{!SSML}}--></a>, the application of &lsquo;<a
    href="#pause"><code class=property>pause</code></a>&rsquo; prosodic
-   boundaries within the <a href="#aural-model">aural "box" model</a> of CSS
+   boundaries within the <a href="#aural-model">aural box model</a> of CSS
    Speech requires special considerations (e.g. <a
    href="#collapsed-pauses">"collapsed" pauses</a>).
 
@@ -1244,7 +1248,7 @@
 
   <div class=example>
    <p> This example illustrates how the default strengths of prosodic breaks
-    for specific elements (which are defined by the user-agent stylesheet)
+    for specific elements (which are defined by the user agent stylesheet)
     can be overridden by authored styles.</p>
 
    <pre>
@@ -1481,7 +1485,7 @@
    href="#rest-after"><code class=property>rest-after</code></a>&rsquo;
    properties specify a prosodic boundary (silence with a specific duration)
    that occurs before (or after) the speech synthesis rendition of an element
-   within the <a href="#aural-model">audio "box" model</a>.
+   within the <a href="#aural-model">audio box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -1489,7 +1493,7 @@
    element</a> from the SSML markup language <a href="#SSML"
    rel=biblioentry>[SSML]<!--{{!SSML}}--></a>, the application of &lsquo;<a
    href="#rest"><code class=property>rest</code></a>&rsquo; prosodic
-   boundaries within the <a href="#aural-model">aural "box" model</a> of CSS
+   boundaries within the <a href="#aural-model">aural box model</a> of CSS
    Speech requires special considerations (e.g. interspersed audio cues,
    additive adjacent rests).
 
@@ -1686,17 +1690,17 @@
    href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
    properties specify auditory icons (i.e. pre-recorded / pre-generated sound
    clips) to be played before (or after) the selected element within the <a
-   href="#aural-model">audio "box" model</a>.
+   href="#aural-model">audio box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property may appear related to the <a
    href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"><code>audio</code>
    element</a> from the SSML markup language <a href="#SSML"
    rel=biblioentry>[SSML]<!--{{!SSML}}--></a>, there are in fact major
-   discrepancies. For example, the <a href="#aural-model">aural "box"
-   model</a> means that audio cues are associated to the selected element's
-   volume level, and CSS Speech's auditory icons provide limited
-   functionality compared to SSML's <code>audio</code> element.
+   discrepancies. For example, the <a href="#aural-model">aural box model</a>
+   means that audio cues are associated to the selected element's volume
+   level, and CSS Speech's auditory icons provide limited functionality
+   compared to SSML's <code>audio</code> element.
 
   <dl>
    <dt> <strong>&lt;uri&gt;</strong>
@@ -1719,10 +1723,10 @@
      (decibel unit). This represents a change (positive or negative) relative
      to the computed value of the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property within the <a
-     href="#aural-model">aural "box" model</a> of the selected element.
-     Decibels express the ratio of the squares of the new signal amplitude
-     (a1) and the current amplitude (a0), as per the following logarithmic
-     equation: volume(dB) = 20 log10 (a1 / a0)</p>
+     href="#aural-model">aural box model</a> of the selected element (as a
+     result, the volume level of audio cues changes when the &lsquo;<a
+     href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
+     property changes). When omitted, the implied value computes to 0dB.</p>
 
     <p> When the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property is set to
@@ -1731,51 +1735,18 @@
      this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
      class=property>silent</code>&rsquo;), &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
-     values are always specified relatively to the volume level keywords,
-     which map to a user-configured scale of "preferred" loudness settings
-     (see the definition of &lsquo;<a href="#voice-volume"><code
-     class=property>voice-volume</code></a>&rsquo;). If the inherited
+     values are always specified relatively to the volume level keywords (see
+     the definition of &lsquo;<a href="#voice-volume"><code
+     class=property>voice-volume</code></a>&rsquo;), which map to a
+     user-configured scale of "preferred" loudness settings. If the inherited
      &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; value already contains a
      decibel offset, the dB offset specific to the audio cue is combined
-     additively.
+     additively.</p>
 
-    <p> The desired effect of an audio cue set at +0dB is that the volume
-     level during playback of the pre-recorded / pre-generated audio signal
-     is effectively the same as the loudness of live (i.e. real-time) speech
-     synthesis rendition. In order to achieve this effect, speech processors
-     are capable of directly controlling the waveform amplitude of generated
-     text-to-speech audio, user agents must be able to adjust the volume
-     output of audio cues (i.e. amplify or attenuate audio signals based on
-     the intrinsic waveform amplitude of digitized sound clips), and last but
-     not least, authors must ensure that the "normal" volume level of
-     pre-recorded audio cues (on average, as there may be discrete loudness
-     variations due to changes in the audio stream, such as intonation,
-     stress, etc.) matches that of a "typical" TTS voice output (based on the
-     &lsquo;<a href="#voice-family"><code
-     class=property>voice-family</code></a>&rsquo; intended for use), given
-     standard listening conditions (i.e. default system volume levels,
-     centered equalization across the frequency spectrum). This latter
-     prerequisite sets a baseline that enables a user agent to align the
-     volume outputs of both TTS and cue audio streams within the same aural
-     "box" model. Due to the complex relationship between perceived audio
-     characteristics and the processing applied to the digitized audio
-     signal, we will simplify the definition of "normal" volume levels by
-     referring to a canonical recording scenario, whereby the attenuation is
-     typically indicated in decibels, ranging from 0dB (maximum audio input,
-     near clipping threshold) to -60dB (total silence). In this common
-     context, a "standard" audio clip would oscillate between these values,
-     the loudest peak levels would be close to -3dB (to avoid distortion),
-     and the relevant audible passages would have average (RMS) volume levels
-     as high as possible (i.e. not too quiet, to avoid background noise
-     during amplification). This would roughly provide an audio experience
-     that could be seamlessly combined with text-to-speech output (i.e. there
-     would be no discernible difference in volume levels when switching from
-     pre-recorded audio to speech synthesis). Although there exists no
-     industry-wide standard to support such convention, TTS engines usually
-     generate comparably-loud audio signals when no gain or attenuation is
-     specified. For voice and soft music, -15dB RMS seems to be pretty
-     standard.</p>
+    <p> Decibels express the ratio of the squares of the new signal amplitude
+     (a1) and the current amplitude (a0), as per the following logarithmic
+     equation: volume(dB) = 20 log10 (a1 / a0)</p>
 
     <p class=note> Note that -6.0dB is approximately half the amplitude of
      the audio signal, and +6.0dB is approximately twice the amplitude.</p>
@@ -1808,7 +1779,60 @@
 div.caution { cue-before: url(./audio/caution.wav) +8dB; }</pre>
   </div>
 
-  <h3 id=cue-props-cue><span class=secno>11.2. </span>The &lsquo;<a
+  <h3 id=cue-props-volume><span class=secno>11.2. </span>Relation between
+   audio cues and speech synthesis volume levels</h3>
+
+  <p class=note>Note that this section is informative.
+
+  <p> The volume levels of audio cues and of speech synthesis within the <a
+   href="#aural-model">aural box model</a> of a selected element are related.
+   For example, the desired effect of an audio cue whose volume level is set
+   at +0dB (as specified by the &lt;decibel&gt; value) is that its perceived
+   loudness during playback is close to that of the speech synthesis
+   rendition of the selected element, as dictated by computed value of the
+   &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property (which is itself
+   based on a user-configured volume level keyword). Similarly, a
+   &lsquo;<code class=property>silent</code>&rsquo; value for the &lsquo;<a
+   href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
+   property results on any audio cues being "silenced" as well.
+
+  <p> In order to achieve this effect, authors should ensure that the volume
+   level of audio cues (on average, as there may be discrete loudness
+   variations due to changes in the audio stream, such as intonation, stress,
+   etc.) matches that of a "typical" TTS voice output (based on the &lsquo;<a
+   href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
+   intended for use), given "standard" listening conditions (i.e. default
+   system volume levels, centered equalization across the frequency
+   spectrum). As speech processors are capable of directly controlling the
+   waveform amplitude of generated text-to-speech audio, and because user
+   agents are able to adjust the volume output of audio cues (i.e. amplify or
+   attenuate audio signals based on the intrinsic waveform amplitude of
+   digitized sound clips), this sets a baseline that enables implementations
+   to "align" the loudness of both TTS and cue audio streams within the aural
+   box model, relative to user-configured volume levels (see the keywords
+   defined in the &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property).
+
+  <p> Due to the complex relationship between perceived audio characteristics
+   (e.g. loudness) and the processing applied to the digitized audio signal
+   (e.g. "compression"), we refer to a simple scenario whereby the
+   attenuation is indicated in decibels, typically ranging from 0dB (maximum
+   audio input, near clipping threshold) to -60dB (total silence). Given this
+   context, a "standard" audio clip would oscillate between these values, the
+   loudest peak levels would be close to -3dB (to avoid distortion), and the
+   relevant audible passages would have average (RMS) volume levels as high
+   as possible (i.e. not too quiet, to avoid background noise during
+   amplification). This would roughly provide an audio experience that could
+   be seamlessly combined with text-to-speech output (i.e. there would be no
+   discernible difference in volume levels when switching from pre-recorded
+   audio to speech synthesis). Although there exists no industry-wide
+   standard to support such convention, different TTS engines tend to
+   generate comparably-loud audio signals when no gain or attenuation is
+   specified. For voice and soft music, -15dB RMS seems to be pretty
+   standard.
+
+  <h3 id=cue-props-cue><span class=secno>11.3. </span>The &lsquo;<a
    href="#cue"><code class=property>cue</code></a>&rsquo; shorthand property</h3>
 
   <table class=propdef summary="name: syntax">
@@ -2103,7 +2127,7 @@
     selected content) must be used.
 
    <li> If no voice is available for the language of the selected content, it
-    is recommended that user-agents let the user know about the lack of
+    is recommended that user agents let the user know about the lack of
     appropriate TTS voice.
   </ol>
 
@@ -2117,7 +2141,7 @@
    example below).
 
   <p class=note>Note that dynamically computing a voice may lead to
-   unexpected lag, so user-agents should try to resolve concrete voice
+   unexpected lag, so user agents should try to resolve concrete voice
    instances in the document tree before the playback starts.
 
   <div class=example>
@@ -2888,7 +2912,7 @@
    <dt> <strong>disc, circle, square</strong>
 
    <dd>
-    <p> For these list item styles, the user-agent defines (possibly based on
+    <p> For these list item styles, the user agent defines (possibly based on
      user preferences) what equivalent phrase is spoken or what audio cue is
      played. List items with graphical bullets are therefore announced
      appropriately in an implementation-dependent manner.</p>
@@ -2919,13 +2943,13 @@
      /a/, /be/, /se/, etc. (phonetic notation)</p>
   </dl>
 
-  <p class=note>Note that it is common for user-agents such as screen readers
+  <p class=note>Note that it is common for user agents such as screen readers
    to announce the nesting depth of list items, or more generally, to
    indicate additional structural information pertaining to complex
    hierarchical content. The verbosity of these additional audio cues and/or
    speech output can usually be controlled by users, and contribute to
    increasing usability. These navigation aids are implementation-dependent,
-   but it is recommended that user-agents supporting the CSS Speech module
+   but it is recommended that user agents supporting the CSS Speech module
    ensure that these additional audio cues and speech output don't generate
    redundancies or create inconsistencies (for example: duplicated or
    different list item numbering scheme).
@@ -3476,7 +3500,7 @@
 
    <li>content, <a href="#content-def" title=content><strong>#</strong></a>
 
-   <li>cue, <a href="#cue" title=cue><strong>11.2.</strong></a>
+   <li>cue, <a href="#cue" title=cue><strong>11.3.</strong></a>
 
    <li>cue-after, <a href="#cue-after"
     title=cue-after><strong>11.1.</strong></a>

Index: Overview.src.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.src.html,v
retrieving revision 1.102
retrieving revision 1.103
diff -u -d -r1.102 -r1.103
--- Overview.src.html	14 Feb 2012 01:17:10 -0000	1.102
+++ Overview.src.html	20 Feb 2012 23:48:09 -0000	1.103
@@ -155,8 +155,8 @@
       visually-impaired or otherwise print-disabled. For instance, "screen readers" allow users to
       interact with visual interfaces that would otherwise be inaccessible to them. There are also
       circumstances in which <em>listening</em> to content (as opposed to <em>reading</em>) is
-      preferred, or sometimes even required, regardless of a person's intrinsic physical ability to
-      access information. For instance: playing an e-book whilst driving a vehicle, learning how to
+      preferred, or sometimes even required, irrespective of a person's physical ability to access
+      information. For instance: playing an e-book whilst driving a vehicle, learning how to
       manipulate industrial and medical devices, interacting with home entertainment systems,
       teaching young children how to read.</p>
 
@@ -178,11 +178,11 @@
       that apply to the "speech" media type, and defines a new "box" model specifically for the
       aural dimension. </p>
 
-    <p> Content creators can conditionally include CSS properties dedicated to user-agents with text
+    <p> Content creators can conditionally include CSS properties dedicated to user agents with text
       to speech synthesis capabilities, by specifying the "speech" media type via the
         <code>media</code> attribute of the <code>link</code> element, or with the
         <code>@media</code> at-rule, or within an <code>@import</code> statement. When styles are
-      authored within the scope of such conditional statements, they are ignored by user-agents that
+      authored within the scope of such conditional statements, they are ignored by user agents that
       do not support the Speech module. </p>
 
     <h2 id="ssml-rel">Relationship with SSML</h2>
@@ -325,7 +325,7 @@
     </table>
     <p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform
       generated by the speech synthesiser, and is also used to adjust the relative volume level of
-        <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p>
+        <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio box model</a>. </p>
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code>
         attribute of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]],
@@ -358,12 +358,14 @@
       <dt><strong>x-soft</strong>, <strong>soft</strong>, <strong>medium</strong>,
           <strong>loud</strong>, <strong>x-loud</strong></dt>
       <dd>
-        <p> This sequence of keywords corresponds to monotonically non-decreasing volume levels,
-          mapped to implementation-dependent values (i.e. inferred by the user-agent) that meet the
-          user's requirements in terms of perceived sound loudness . The keyword 'x-soft' maps to
-          the user's <em>minimum audible</em> volume level, 'x-loud' maps to the user's <em>maximum
-            tolerable</em> volume level, 'medium' maps to the user's <em>preferred</em> volume
-          level, 'soft' and 'loud' map to intermediary values. </p>
+        <p>This sequence of keywords corresponds to monotonically non-decreasing volume levels,
+          mapped to implementation-dependent values that meet the listener's requirements with
+          regards to perceived sound loudness. These audio levels are typically provided via a
+          preference mechanism that allow users to set options according to their auditory
+          environment. The keyword 'x-soft' maps to the user's <em>minimum audible</em> volume
+          level, 'x-loud' maps to the user's <em>maximum tolerable</em> volume level, 'medium' maps
+          to the user's <em>preferred</em> volume level, 'soft' and 'loud' map to intermediary
+          values.</p>
       </dd>
       <dt>
         <strong>&lt;decibel&gt;</strong>
@@ -495,16 +497,16 @@
           clamping the resulting number to '100'.</p>
       </dd>
     </dl>
-    <p> User agents may be connected to different kinds of sound systems, featuring varying audio
+    <p> user agents may be connected to different kinds of sound systems, featuring varying audio
       mixing capabilities. The expected behavior for mono, stereo, and surround sound systems is
       defined as follows: </p>
     <ul>
-      <li> When user-agents produce audio via a mono-aural sound system (i.e. single-speaker setup),
+      <li> When user agents produce audio via a mono-aural sound system (i.e. single-speaker setup),
         the 'voice-balance' property has no effect. </li>
-      <li> When user-agents produce audio through a stereo sound system (e.g. two speakers, a pair
+      <li> When user agents produce audio through a stereo sound system (e.g. two speakers, a pair
         of headphones), the left-right distribution of audio signals can precisely match the
         authored values for the 'voice-balance' property. </li>
-      <li> When user-agents are capable of mixing audio signals through more than 2 channels (e.g.
+      <li> When user agents are capable of mixing audio signals through more than 2 channels (e.g.
         5-speakers surround sound system, including a dedicated center channel), the physical
         distribution of audio signals resulting from the application of the 'voice-balance' property
         should be performed so that the listener perceives sound as if it was coming from a basic
@@ -513,8 +515,8 @@
     </ul>
     <p> Future revisions of the CSS Speech module may include support for three-dimensional audio,
       which would effectively enable authors to specify "azimuth" and "elevation" values. In the
-      future, content authored using the current specification may therefore be consumed by
-      user-agents which are compliant with the version of CSS Speech that supports three-dimensional
+      future, content authored using the current specification may therefore be consumed by user
+      agents which are compliant with the version of CSS Speech that supports three-dimensional
       audio. In order to prepare for this possibility, the values enabled by the current
       'voice-balance' property are designed to remain compatible with "azimuth" angles. More
       precisely, the mapping between the current left-right audio axis (lateral sound stage) and the
@@ -542,8 +544,8 @@
       customizations, and the 'voice-balance' property merely specifies the desired end-result. </p>
     <p class="note"> Note that many speech synthesizers only generate mono sound, and therefore do
       not intrinsically support the 'voice-balance' property. The sound distribution along the
-      left-right axis consequently occurs at post-synthesis stage (when the speech-enabled
-      user-agent mixes the various audio sources authored within the document) </p>
+      left-right axis consequently occurs at post-synthesis stage (when the speech-enabled user
+      agent mixes the various audio sources authored within the document) </p>
     <h2 id="speaking-props">Speaking properties</h2>
     <h3 id="speaking-props-speak">The 'speak' property</h3>
     <table class="propdef" summary="name: syntax">
@@ -723,7 +725,7 @@
         <p>Speak numbers one digit at a time, for instance, "twelve" would be spoken as "one two",
           and "31" as "three one".</p>
         <p class="note">Speech synthesizers are knowledgeable about what a <em>number</em> is. The
-          'speak-as' property enables some level of control on how user-agents render numbers, and
+          'speak-as' property enables some level of control on how user agents render numbers, and
           may be implemented as a preprocessing step before passing the text to the actual speech
           synthesizer.</p>
       </dd>
@@ -851,11 +853,11 @@
     <p>The 'pause-before' and 'pause-after' properties specify a prosodic boundary (silence with a
       specific duration) that occurs before (or after) the speech synthesis rendition of the
       selected element, or if any 'cue-before' (or 'cue-after') is specified, before (or after) the
-      cue within the <a href="#aural-model">aural "box" model</a>.</p>
+      cue within the <a href="#aural-model">aural box model</a>.</p>
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_break"><code>break</code>
         element</a> from the SSML markup language [[!SSML]], the application of 'pause' prosodic
-      boundaries within the <a href="#aural-model">aural "box" model</a> of CSS Speech requires
+      boundaries within the <a href="#aural-model">aural box model</a> of CSS Speech requires
       special considerations (e.g. <a href="#collapsed-pauses">"collapsed" pauses</a>). </p>
     <dl>
       <dt>
@@ -886,7 +888,7 @@
       between words within a sentence. </p>
     <div class="example">
       <p> This example illustrates how the default strengths of prosodic breaks for specific
-        elements (which are defined by the user-agent stylesheet) can be overridden by authored
+        elements (which are defined by the user agent stylesheet) can be overridden by authored
         styles. </p>
       <pre>
 p { pause: none } /* pause-before: none; pause-after: none */</pre>
@@ -1085,11 +1087,11 @@
     </table>
     <p>The 'rest-before' and 'rest-after' properties specify a prosodic boundary (silence with a
       specific duration) that occurs before (or after) the speech synthesis rendition of an element
-      within the <a href="#aural-model">audio "box" model</a>. </p>
+      within the <a href="#aural-model">audio box model</a>. </p>
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_break"><code>break</code>
         element</a> from the SSML markup language [[!SSML]], the application of 'rest' prosodic
-      boundaries within the <a href="#aural-model">aural "box" model</a> of CSS Speech requires
+      boundaries within the <a href="#aural-model">aural box model</a> of CSS Speech requires
       special considerations (e.g. interspersed audio cues, additive adjacent rests). </p>
     <dl>
       <dt>
@@ -1283,11 +1285,11 @@
     </table>
     <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. pre-recorded /
       pre-generated sound clips) to be played before (or after) the selected element within the <a
-        href="#aural-model">audio "box" model</a>.</p>
+        href="#aural-model">audio box model</a>.</p>
     <p class="note"> Note that although the functionality provided by this property may appear
       related to the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"
           ><code>audio</code> element</a> from the SSML markup language [[!SSML]], there are in fact
-      major discrepancies. For example, the <a href="#aural-model">aural "box" model</a> means that
+      major discrepancies. For example, the <a href="#aural-model">aural box model</a> means that
       audio cues are associated to the selected element's volume level, and CSS Speech's auditory
       icons provide limited functionality compared to SSML's <code>audio</code> element. </p>
     <dl>
@@ -1311,43 +1313,18 @@
       <dd>
         <p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This
           represents a change (positive or negative) relative to the computed value of the
-          'voice-volume' property within the <a href="#aural-model">aural "box" model</a> of the
-          selected element. Decibels express the ratio of the squares of the new signal amplitude
-          (a1) and the current amplitude (a0), as per the following logarithmic equation: volume(dB)
-          = 20 log10 (a1 / a0) </p>
+          'voice-volume' property within the <a href="#aural-model">aural box model</a> of the
+          selected element (as a result, the volume level of audio cues changes when the
+          'voice-volume' property changes). When omitted, the implied value computes to 0dB. </p>
         <p> When the 'voice-volume' property is set to 'silent', the audio cue is also set to
           'silent' (regardless of this specified &lt;decibel&gt; value). Otherwise (when not
           'silent'), 'voice-volume' values are always specified relatively to the volume level
-          keywords, which map to a user-configured scale of "preferred" loudness settings (see the
-          definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a
-          decibel offset, the dB offset specific to the audio cue is combined additively. </p><p>
-          The desired effect of an audio cue set at +0dB is that the volume level during playback of
-          the pre-recorded / pre-generated audio signal is effectively the same as the loudness of
-          live (i.e. real-time) speech synthesis rendition. In order to achieve this effect, speech
-          processors are capable of directly controlling the waveform amplitude of generated
-          text-to-speech audio, user agents must be able to adjust the volume output of audio cues
-          (i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of
-          digitized sound clips), and last but not least, authors must ensure that the "normal"
-          volume level of pre-recorded audio cues (on average, as there may be discrete loudness
-          variations due to changes in the audio stream, such as intonation, stress, etc.) matches
-          that of a "typical" TTS voice output (based on the 'voice-family' intended for use), given
-          standard listening conditions (i.e. default system volume levels, centered equalization
-          across the frequency spectrum). This latter prerequisite sets a baseline that enables a
-          user agent to align the volume outputs of both TTS and cue audio streams within the same
-          aural "box" model. Due to the complex relationship between perceived audio characteristics
-          and the processing applied to the digitized audio signal, we will simplify the definition
-          of "normal" volume levels by referring to a canonical recording scenario, whereby the
-          attenuation is typically indicated in decibels, ranging from 0dB (maximum audio input,
-          near clipping threshold) to -60dB (total silence). In this common context, a "standard"
-          audio clip would oscillate between these values, the loudest peak levels would be close to
-          -3dB (to avoid distortion), and the relevant audible passages would have average (RMS)
-          volume levels as high as possible (i.e. not too quiet, to avoid background noise during
-          amplification). This would roughly provide an audio experience that could be seamlessly
-          combined with text-to-speech output (i.e. there would be no discernible difference in
-          volume levels when switching from pre-recorded audio to speech synthesis). Although there
-          exists no industry-wide standard to support such convention, TTS engines usually generate
-          comparably-loud audio signals when no gain or attenuation is specified. For voice and soft
-          music, -15dB RMS seems to be pretty standard. </p>
+          keywords (see the definition of 'voice-volume'), which map to a user-configured scale of
+          "preferred" loudness settings. If the inherited 'voice-volume' value already contains a
+          decibel offset, the dB offset specific to the audio cue is combined additively. </p>
+        <p> Decibels express the ratio of the squares of the new signal amplitude (a1) and the
+          current amplitude (a0), as per the following logarithmic equation: volume(dB) = 20 log10
+          (a1 / a0) </p>
         <p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
           and +6.0dB is approximately twice the amplitude.</p>
         <p class="note"> Note that there is a difference between an audio cue whose volume is set to
@@ -1374,6 +1351,46 @@
 
 div.caution { cue-before: url(./audio/caution.wav) +8dB; }</pre>
     </div>
+
+    <h3 id="cue-props-volume">Relation between audio cues and speech synthesis volume levels</h3>
+
+    <p class="note">Note that this section is informative.</p>
+
+    <p> The volume levels of audio cues and of speech synthesis within the <a href="#aural-model"
+        >aural box model</a> of a selected element are related. For example, the desired effect of
+      an audio cue whose volume level is set at +0dB (as specified by the &lt;decibel&gt; value) is
+      that its perceived loudness during playback is close to that of the speech synthesis rendition
+      of the selected element, as dictated by computed value of the 'voice-volume' property (which
+      is itself based on a user-configured volume level keyword). Similarly, a 'silent' value for
+      the 'voice-volume' property results on any audio cues being "silenced" as well.</p>
+
+    <p> In order to achieve this effect, authors should ensure that the volume level of audio cues
+      (on average, as there may be discrete loudness variations due to changes in the audio stream,
+      such as intonation, stress, etc.) matches that of a "typical" TTS voice output (based on the
+      'voice-family' intended for use), given "standard" listening conditions (i.e. default system
+      volume levels, centered equalization across the frequency spectrum). As speech processors are
+      capable of directly controlling the waveform amplitude of generated text-to-speech audio, and
+      because user agents are able to adjust the volume output of audio cues (i.e. amplify or
+      attenuate audio signals based on the intrinsic waveform amplitude of digitized sound clips),
+      this sets a baseline that enables implementations to "align" the loudness of both TTS and cue
+      audio streams within the aural box model, relative to user-configured volume levels (see the
+      keywords defined in the 'voice-volume' property). </p>
+
+    <p> Due to the complex relationship between perceived audio characteristics (e.g. loudness) and
+      the processing applied to the digitized audio signal (e.g. "compression"), we refer to a
+      simple scenario whereby the attenuation is indicated in decibels, typically ranging from 0dB
+      (maximum audio input, near clipping threshold) to -60dB (total silence). Given this context, a
+      "standard" audio clip would oscillate between these values, the loudest peak levels would be
+      close to -3dB (to avoid distortion), and the relevant audible passages would have average
+      (RMS) volume levels as high as possible (i.e. not too quiet, to avoid background noise during
+      amplification). This would roughly provide an audio experience that could be seamlessly
+      combined with text-to-speech output (i.e. there would be no discernible difference in volume
+      levels when switching from pre-recorded audio to speech synthesis). Although there exists no
+      industry-wide standard to support such convention, different TTS engines tend to generate
+      comparably-loud audio signals when no gain or attenuation is specified. For voice and soft
+      music, -15dB RMS seems to be pretty standard. </p>
+
+
     <h3 id="cue-props-cue">The 'cue' shorthand property</h3>
     <table class="propdef" summary="name: syntax">
       <tbody>
@@ -1616,7 +1633,7 @@
         voice instance (amongst those suitable for the language of the selected content) must be
         used. </li>
       <li> If no voice is available for the language of the selected content, it is recommended that
-        user-agents let the user know about the lack of appropriate TTS voice. </li>
+        user agents let the user know about the lack of appropriate TTS voice. </li>
     </ol>
     <p>The speech synthesizer voice must be re-evaluated (i.e. the selection process must take place
       once again) whenever any of the CSS voice characteristics change within the content flow. The
@@ -1624,9 +1641,9 @@
       keyword is used (this may be useful in cases where embedded foreign language text can be
       spoken using a voice not designed for this language, as demonstrated by the example
       below).</p>
-    <p class="note">Note that dynamically computing a voice may lead to unexpected lag, so
-      user-agents should try to resolve concrete voice instances in the document tree before the
-      playback starts. </p>
+    <p class="note">Note that dynamically computing a voice may lead to unexpected lag, so user
+      agents should try to resolve concrete voice instances in the document tree before the playback
+      starts. </p>
     <div class="example">
       <p>Examples of property values:</p>
       <pre>
@@ -2338,7 +2355,7 @@
         <strong>disc, circle, square</strong>
       </dt>
       <dd>
-        <p> For these list item styles, the user-agent defines (possibly based on user preferences)
+        <p> For these list item styles, the user agent defines (possibly based on user preferences)
           what equivalent phrase is spoken or what audio cue is played. List items with graphical
           bullets are therefore announced appropriately in an implementation-dependent manner. </p>
       </dd>
@@ -2364,14 +2381,14 @@
           notation) </p>
       </dd>
     </dl>
-    <p class="note">Note that it is common for user-agents such as screen readers to announce the
+    <p class="note">Note that it is common for user agents such as screen readers to announce the
       nesting depth of list items, or more generally, to indicate additional structural information
       pertaining to complex hierarchical content. The verbosity of these additional audio cues
       and/or speech output can usually be controlled by users, and contribute to increasing
-      usability. These navigation aids are implementation-dependent, but it is recommended that
-      user-agents supporting the CSS Speech module ensure that these additional audio cues and
-      speech output don't generate redundancies or create inconsistencies (for example: duplicated
-      or different list item numbering scheme). </p>
+      usability. These navigation aids are implementation-dependent, but it is recommended that user
+      agents supporting the CSS Speech module ensure that these additional audio cues and speech
+      output don't generate redundancies or create inconsistencies (for example: duplicated or
+      different list item numbering scheme). </p>
     <h2 id="content">Inserted and replaced content</h2>
     <p class="note">Note that this entire section is informative.</p>
     <p>Sometimes, authors will want to specify a mapping from the source text into another string
Received on Monday, 20 February 2012 23:48:15 UTC