W3C home > Mailing lists > Public > public-css-commits@w3.org > February 2012

csswg/css3-speech Overview.html,1.101,1.102 Overview.src.html,1.103,1.104

From: Daniel Weck via cvs-syncmail <cvsmail@w3.org>
Date: Tue, 21 Feb 2012 23:11:36 +0000
To: public-css-commits@w3.org
Message-Id: <E1Rzyrw-00034L-94@lionel-hutz.w3.org>
Update of /sources/public/csswg/css3-speech
In directory hutz:/tmp/cvs-serv11638

Modified Files:
	Overview.html Overview.src.html 
Log Message:
last update to address the LCWD disposition of comments.


Index: Overview.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.html,v
retrieving revision 1.101
retrieving revision 1.102
diff -u -d -r1.101 -r1.102
--- Overview.html	20 Feb 2012 23:48:09 -0000	1.101
+++ Overview.html	21 Feb 2012 23:11:33 -0000	1.102
@@ -88,14 +88,14 @@
 
    <h1 id=top>CSS Speech Module</h1>
 
-   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 20 February
+   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 21 February
     2012</h2>
 
    <dl id=versions>
     <dt>This version:
 
     <dd>
-     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120220/">http://www.w3.org/TR/2012/ED-css3-speech-20120220/</a>-->
+     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120221/">http://www.w3.org/TR/2012/ED-css3-speech-20120221/</a>-->
      <a
      href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
      
@@ -535,8 +535,11 @@
   <p> The following diagram illustrates the equivalence between properties of
    the visual and aural box models, applied to the selected &lt;element&gt;:
 
-  <p> <img alt="A graph depicting the aural 'box' model." id=aural-box
-   src=aural-box.png>
+  <p> <img
+   alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
+   id=aural-box src=aural-box.png
+   title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin.">
+   
 
   <h2 id=mixing-props><span class=secno>7. </span>Mixing properties</h2>
 
@@ -585,15 +588,16 @@
     <tr>
      <td> <em>Computed value:</em>
 
-     <td>a keyword value, and optionally also a decibel offset (if not zero)
+     <td>&lsquo;<code class=property>silent</code>&rsquo;, or a keyword value
+      and optionally also a decibel offset (if not zero)
   </table>
 
   <p>The &lsquo;<a href="#voice-volume"><code
    class=property>voice-volume</code></a>&rsquo; property allows authors to
    control the amplitude of the audio waveform generated by the speech
    synthesiser, and is also used to adjust the relative volume level of <a
-   href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
-   box model</a>.
+   href="#cue-props">audio cues</a> within the <a href="#aural-model">aural
+   box model</a> of the selected element.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -615,24 +619,27 @@
    <dt> <strong>silent</strong>
 
    <dd>
-    <p> Specifies that no sound is generated (the text is read "silently").
-     Corresponds to negative infinity in dB units.</p>
+    <p> Specifies that no sound is generated (the text is read "silently").</p>
 
-    <p class=note> Note that there is a difference between an element whose
-     &lsquo;<a href="#voice-volume"><code
+    <p class=note> Note that this has the same effect as using negative
+     infinity decibels. Also note that there is a difference between an
+     element whose &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property has a value of
      &lsquo;<code class=property>silent</code>&rsquo;, and an element whose
      &lsquo;<a href="#speak"><code class=property>speak</code></a>&rsquo;
      property has the value &lsquo;<code class=property>none</code>&rsquo;.
      With the former, the selected element takes up the same time as if it
      was spoken, including any pause before and after the element, but no
-     sound is generated (descendants can override the &lsquo;<a
+     sound is generated (descendants within the <a href="#aural-model">aural
+     box model</a> of the selected element can override the &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
-     value and may therefore generate audio output). With the latter, the
+     value, and may therefore generate audio output). With the latter, the
      selected element is not rendered in the aural dimension and no time is
-     allocated for playback (descendants can override the &lsquo;<a
-     href="#speak"><code class=property>speak</code></a>&rsquo; value and may
-     therefore generate audio output).</p>
+     allocated for playback (descendants within the <a
+     href="#aural-model">aural box model</a> of the selected element can
+     override the &lsquo;<a href="#speak"><code
+     class=property>speak</code></a>&rsquo; value, and may therefore generate
+     audio output).</p>
 
    <dt><strong>x-soft</strong>, <strong>soft</strong>,
     <strong>medium</strong>, <strong>loud</strong>, <strong>x-loud</strong>
@@ -640,11 +647,12 @@
    <dd>
     <p>This sequence of keywords corresponds to monotonically non-decreasing
      volume levels, mapped to implementation-dependent values that meet the
-     listener's requirements with regards to perceived sound loudness. These
-     audio levels are typically provided via a preference mechanism that
-     allow users to set options according to their auditory environment. The
-     keyword &lsquo;<code class=property>x-soft</code>&rsquo; maps to the
-     user's <em>minimum audible</em> volume level, &lsquo;<code
+     listener's requirements with regards to perceived loudness. These audio
+     levels are typically provided via a preference mechanism that allow
+     users to calibrate sound options according to their auditory
+     environment. The keyword &lsquo;<code
+     class=property>x-soft</code>&rsquo; maps to the user's <em>minimum
+     audible</em> volume level, &lsquo;<code
      class=property>x-loud</code>&rsquo; maps to the user's <em>maximum
      tolerable</em> volume level, &lsquo;<code
      class=property>medium</code>&rsquo; maps to the user's
@@ -674,12 +682,12 @@
      the audio signal, and +6.0dB is approximately twice the amplitude.</p>
   </dl>
 
-  <p class=note>Note that the actual perceived volume levels depend on
-   various factors, such as the listening environment and personal user
-   preferences. The effective volume variation between &lsquo;<code
+  <p class=note>Note that perceived loudness depends on various factors, such
+   as the listening environment, user preferences or physical abilities. The
+   effective volume variation between &lsquo;<code
    class=property>x-soft</code>&rsquo; and &lsquo;<code
    class=property>x-loud</code>&rsquo; represents the dynamic range (in terms
-   of loudness) of the speech output. Typically, this range would be
+   of loudness) of the audio output. Typically, this range would be
    compressed in a noisy context, i.e. the perceived loudness corresponding
    to &lsquo;<code class=property>x-soft</code>&rsquo; would effectively be
    closer to &lsquo;<code class=property>x-loud</code>&rsquo; than it would
@@ -1485,7 +1493,7 @@
    href="#rest-after"><code class=property>rest-after</code></a>&rsquo;
    properties specify a prosodic boundary (silence with a specific duration)
    that occurs before (or after) the speech synthesis rendition of an element
-   within the <a href="#aural-model">audio box model</a>.
+   within the <a href="#aural-model">aural box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -1690,7 +1698,7 @@
    href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
    properties specify auditory icons (i.e. pre-recorded / pre-generated sound
    clips) to be played before (or after) the selected element within the <a
-   href="#aural-model">audio box model</a>.
+   href="#aural-model">aural box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property may appear related to the <a
@@ -1724,21 +1732,21 @@
      to the computed value of the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property within the <a
      href="#aural-model">aural box model</a> of the selected element (as a
-     result, the volume level of audio cues changes when the &lsquo;<a
+     result, the volume level of an audio cue changes when the &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
      property changes). When omitted, the implied value computes to 0dB.</p>
 
-    <p> When the &lsquo;<a href="#voice-volume"><code
-     class=property>voice-volume</code></a>&rsquo; property is set to
-     &lsquo;<code class=property>silent</code>&rsquo;, the audio cue is also
-     set to &lsquo;<code class=property>silent</code>&rsquo; (regardless of
-     this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
+    <p> When the computed value of the &lsquo;<a href="#voice-volume"><code
+     class=property>voice-volume</code></a>&rsquo; property is &lsquo;<code
+     class=property>silent</code>&rsquo;, the audio cue is also set to
+     &lsquo;<code class=property>silent</code>&rsquo; (regardless of this
+     specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
      class=property>silent</code>&rsquo;), &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
      values are always specified relatively to the volume level keywords (see
      the definition of &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo;), which map to a
-     user-configured scale of "preferred" loudness settings. If the inherited
+     user-calibrated scale of "preferred" loudness settings. If the inherited
      &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; value already contains a
      decibel offset, the dB offset specific to the audio cue is combined
@@ -1789,48 +1797,54 @@
    For example, the desired effect of an audio cue whose volume level is set
    at +0dB (as specified by the &lt;decibel&gt; value) is that its perceived
    loudness during playback is close to that of the speech synthesis
-   rendition of the selected element, as dictated by computed value of the
+   rendition of the selected element, as dictated by the computed value of
+   the &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property. Note that a
+   &lsquo;<code class=property>silent</code>&rsquo; computed value for the
    &lsquo;<a href="#voice-volume"><code
-   class=property>voice-volume</code></a>&rsquo; property (which is itself
-   based on a user-configured volume level keyword). Similarly, a
-   &lsquo;<code class=property>silent</code>&rsquo; value for the &lsquo;<a
-   href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
-   property results on any audio cues being "silenced" as well.
+   class=property>voice-volume</code></a>&rsquo; property results in audio
+   cues being "forcefully" silenced as well (i.e. regardless of the specified
+   audio cue &lsquo;<code class=property>decibel</code>&rsquo; value)
 
-  <p> In order to achieve this effect, authors should ensure that the volume
-   level of audio cues (on average, as there may be discrete loudness
-   variations due to changes in the audio stream, such as intonation, stress,
-   etc.) matches that of a "typical" TTS voice output (based on the &lsquo;<a
+  <p> The volume keywords of the &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property are user-calibrated
+   to match requirements not known at authoring time (e.g. auditory
+   environment, personal preferences). Therefore, in order to achieve this
+   approximate loudness alignment of audio cues and speech synthesis, authors
+   should ensure that the volume level of audio cues (on average, as there
+   may be discrete variations of perceived loudness due to changes in the
+   audio stream, such as intonation, stress, etc.) matches the output of a
+   speech synthesis rendition based on the &lsquo;<a
    href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
-   intended for use), given "standard" listening conditions (i.e. default
+   intended for use, given "typical" listening conditions (i.e. default
    system volume levels, centered equalization across the frequency
    spectrum). As speech processors are capable of directly controlling the
    waveform amplitude of generated text-to-speech audio, and because user
    agents are able to adjust the volume output of audio cues (i.e. amplify or
    attenuate audio signals based on the intrinsic waveform amplitude of
    digitized sound clips), this sets a baseline that enables implementations
-   to "align" the loudness of both TTS and cue audio streams within the aural
-   box model, relative to user-configured volume levels (see the keywords
+   to manage the loudness of both TTS and cue audio streams within the aural
+   box model, relative to user-calibrated volume levels (see the keywords
    defined in the &lsquo;<a href="#voice-volume"><code
    class=property>voice-volume</code></a>&rsquo; property).
 
   <p> Due to the complex relationship between perceived audio characteristics
    (e.g. loudness) and the processing applied to the digitized audio signal
-   (e.g. "compression"), we refer to a simple scenario whereby the
-   attenuation is indicated in decibels, typically ranging from 0dB (maximum
-   audio input, near clipping threshold) to -60dB (total silence). Given this
-   context, a "standard" audio clip would oscillate between these values, the
-   loudest peak levels would be close to -3dB (to avoid distortion), and the
-   relevant audible passages would have average (RMS) volume levels as high
-   as possible (i.e. not too quiet, to avoid background noise during
-   amplification). This would roughly provide an audio experience that could
-   be seamlessly combined with text-to-speech output (i.e. there would be no
-   discernible difference in volume levels when switching from pre-recorded
-   audio to speech synthesis). Although there exists no industry-wide
-   standard to support such convention, different TTS engines tend to
-   generate comparably-loud audio signals when no gain or attenuation is
-   specified. For voice and soft music, -15dB RMS seems to be pretty
-   standard.
+   (e.g. signal compression), we refer to a simple scenario whereby the
+   attenuation is indicated in decibels, typically ranging from 0dB (i.e.
+   maximum audio input, near clipping threshold) to -60dB (i.e. total
+   silence). Given this context, a "standard" audio clip would oscillate
+   between these values, the loudest peak levels would be close to -3dB (to
+   avoid distortion), and the relevant audible passages would have average
+   (RMS) volume levels as high as possible (i.e. not too quiet, to avoid
+   background noise during amplification). This would roughly provide an
+   audio experience that could be seamlessly combined with text-to-speech
+   output (i.e. there would be no discernible difference in volume levels
+   when switching from pre-recorded audio to speech synthesis). Although
+   there exists no industry-wide standard to support such convention,
+   different TTS engines tend to generate comparably-loud audio signals when
+   no gain or attenuation is specified. For voice and soft music, -15dB RMS
+   seems to be pretty standard.
 
   <h3 id=cue-props-cue><span class=secno>11.3. </span>The &lsquo;<a
    href="#cue"><code class=property>cue</code></a>&rsquo; shorthand property</h3>
@@ -2257,7 +2271,10 @@
      or otherwise to the inherited speaking rate (which may itself be a
      combination of a keyword value and of a percentage, in which case
      percentages are combined multiplicatively). For example, 50% means that
-     the speaking rate gets multiplied by 0.5 (half the value).</p>
+     the speaking rate gets multiplied by 0.5 (half the value). Percentages
+     above 100% result in faster speaking rates (relative to the base
+     keyword), whereas percentages below 100% result in slower speaking
+     rates.</p>
   </dl>
 
   <div class=example>
@@ -2289,7 +2306,7 @@
 e2 { voice-rate: fast 120%; } /* the computed value is
                           ['fast' and 120%], which will resolve
                           to the rate corresponding to 'fast'
-                          multiplied by 1.2 (one and a half times the speaking rate) */
+                          multiplied by 1.2 */
                           
 e3 { voice-rate: normal; /* "resets" the speaking rate to the intrinsic voice value,
                             the computed value is 'normal' (see comment below for actual value) */
@@ -2772,25 +2789,25 @@
    <p>Examples of property values, with HTML sample:</p>
 
    <pre>
-span.default-emphasis { voice-stress: normal; }
-span.lowered-emphasis { voice-stress: reduced; }
-span.removed-emphasis { voice-stress: none; }
-span.normal-emphasis { voice-stress: moderate; }
-span.huge-emphasis { voice-stress: strong; }
+.default-emphasis { voice-stress: normal; }
+.lowered-emphasis { voice-stress: reduced; }
+.removed-emphasis { voice-stress: none; }
+.normal-emphasis { voice-stress: moderate; }
+.huge-emphasis { voice-stress: strong; }
                 
 ...
 
 &lt;p&gt;This is a big car.&lt;/p&gt;
 &lt;!-- The speech output from the line above is identical to the line below: --&gt;
-&lt;p&gt;This is a &lt;span class="default-emphasis"&gt;big&lt;/span&gt; car.&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="default-emphasis"&gt;big&lt;/em&gt; car.&lt;/p&gt;
 
-&lt;p&gt;This car is &lt;span class="lowered-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
-&lt;!-- The "span" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
-&lt;p&gt;This car is &lt;span class="removed-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
+&lt;p&gt;This car is &lt;em class="lowered-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
+&lt;!-- The "em" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
+&lt;p&gt;This car is &lt;em class="removed-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
 
 &lt;!-- The lines below demonstrate increasing levels of emphasis: --&gt;
-&lt;p&gt;This is a &lt;span class="normal-emphasis"&gt;big&lt;/span&gt; car!&lt;/p&gt;
-&lt;p&gt;This is a &lt;span class="huge-emphasis"&gt;big&lt;/span&gt; car!!!&lt;/p&gt;</pre>
+&lt;p&gt;This is a &lt;em class="normal-emphasis"&gt;big&lt;/em&gt; car!&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="huge-emphasis"&gt;big&lt;/em&gt; car!!!&lt;/p&gt;</pre>
   </div>
 
   <h2 id=duration-props><span class=secno>13. </span>Voice duration property</h2>

Index: Overview.src.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.src.html,v
retrieving revision 1.103
retrieving revision 1.104
diff -u -d -r1.103 -r1.104
--- Overview.src.html	20 Feb 2012 23:48:09 -0000	1.103
+++ Overview.src.html	21 Feb 2012 23:11:34 -0000	1.104
@@ -147,10 +147,8 @@
         href="#editors-list">the editors</a>.</p -->
     <h2 class="no-num no-toc" id="contents">Table of contents</h2>
     <!--toc-->
-
     <h2 id="intro">Introduction, design goals</h2>
     <p class="note">Note that this section is informative.</p>
-
     <p>The aural presentation of information is commonly used by people who are blind,
       visually-impaired or otherwise print-disabled. For instance, "screen readers" allow users to
       interact with visual interfaces that would otherwise be inaccessible to them. There are also
@@ -159,7 +157,6 @@
       information. For instance: playing an e-book whilst driving a vehicle, learning how to
       manipulate industrial and medical devices, interacting with home entertainment systems,
       teaching young children how to read.</p>
-
     <p> The CSS properties defined in the Speech module enable authors to declaratively control the
       presentation of a document in the aural dimension. The aural rendering of a document combines
       speech synthesis (also known as "TTS", the acronym for "Text to Speech") and auditory icons
@@ -167,37 +164,29 @@
       provide the ability to control speech pitch and rate, sound levels, TTS voices, etc. These
       stylesheet properties can be used together with visual properties (mixed media), or as a
       complete aural alternative to a visual presentation. </p>
-
     <h2 id="background">Background information, CSS 2.1</h2>
     <p class="note">Note that this section is informative.</p>
-
     <p> The CSS Speech module is a re-work of the informative CSS2.1 Aural appendix, within which
       the "aural" media type was described, but also deprecated (in favor of the "speech" media
       type). Although the [[!CSS21]] specification reserves the "speech" media type, it doesn't
       actually define the corresponding properties. The Speech module describes the CSS properties
       that apply to the "speech" media type, and defines a new "box" model specifically for the
       aural dimension. </p>
-
     <p> Content creators can conditionally include CSS properties dedicated to user agents with text
       to speech synthesis capabilities, by specifying the "speech" media type via the
         <code>media</code> attribute of the <code>link</code> element, or with the
         <code>@media</code> at-rule, or within an <code>@import</code> statement. When styles are
       authored within the scope of such conditional statements, they are ignored by user agents that
       do not support the Speech module. </p>
-
     <h2 id="ssml-rel">Relationship with SSML</h2>
     <p class="note">Note that this section is informative.</p>
-
     <p>Some of the features in this specification are conceptually similar to functionality
       described in the Speech Synthesis Markup Language (SSML) Version 1.1 [[!SSML]]. However, the
       specificities of the CSS model mean that compatibility with SSML in terms of syntax and/or
       semantics is only partially achievable. The definition of each property in the Speech module
       includes informative statements, wherever necessary, to clarify their relationship with
       similar functionality from SSML.</p>
-
-
     <h2 id="css-values">CSS values</h2>
-
     <p>This specification follows the <a href="http://www.w3.org/TR/CSS21/about.html#property-defs"
         >CSS property definition conventions</a> from [[!CSS21]]. Value types not defined in this
       specification are defined in CSS Value and Units Level 3 [[!CSS3VAL]]. </p>
@@ -205,8 +194,6 @@
       defined in this specification also accept the <a
         href="http://www.w3.org/TR/CSS21/cascade.html#value-def-inherit">inherit</a> keyword as
       their property value. For readability it has not been repeated explicitly. </p>
-
-
     <h2 id="example">Example</h2>
     <div class="example">
       <p>This example shows how authors can tell the speech synthesizer to speak HTML headings with
@@ -267,7 +254,10 @@
     <p> The following diagram illustrates the equivalence between properties of the visual and aural
       box models, applied to the selected &lt;element&gt;:</p>
     <p>
-      <img alt="A graph depicting the aural 'box' model." id="aural-box" src="aural-box.png" />
+      <img
+        title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
+        alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
+        id="aural-box" src="aural-box.png" />
     </p>
     <h2 id="mixing-props">Mixing properties</h2>
     <h3 id="mixing-props-voice-volume">The 'voice-volume' property</h3>
@@ -319,13 +309,14 @@
           <td>
             <em>Computed value:</em>
           </td>
-          <td>a keyword value, and optionally also a decibel offset (if not zero)</td>
+          <td>'silent', or a keyword value and optionally also a decibel offset (if not zero)</td>
         </tr>
       </tbody>
     </table>
     <p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform
       generated by the speech synthesiser, and is also used to adjust the relative volume level of
-        <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio box model</a>. </p>
+        <a href="#cue-props">audio cues</a> within the <a href="#aural-model">aural box model</a> of
+      the selected element. </p>
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code>
         attribute of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]],
@@ -344,24 +335,26 @@
         <strong>silent</strong>
       </dt>
       <dd>
-        <p> Specifies that no sound is generated (the text is read "silently"). Corresponds to
-          negative infinity in dB units.</p>
-        <p class="note"> Note that there is a difference between an element whose 'voice-volume'
-          property has a value of 'silent', and an element whose 'speak' property has the value
-          'none'. With the former, the selected element takes up the same time as if it was spoken,
-          including any pause before and after the element, but no sound is generated (descendants
-          can override the 'voice-volume' value and may therefore generate audio output). With the
-          latter, the selected element is not rendered in the aural dimension and no time is
-          allocated for playback (descendants can override the 'speak' value and may therefore
-          generate audio output). </p>
+        <p> Specifies that no sound is generated (the text is read "silently").</p>
+        <p class="note"> Note that this has the same effect as using negative infinity decibels.
+          Also note that there is a difference between an element whose 'voice-volume' property has
+          a value of 'silent', and an element whose 'speak' property has the value 'none'. With the
+          former, the selected element takes up the same time as if it was spoken, including any
+          pause before and after the element, but no sound is generated (descendants within the <a
+            href="#aural-model">aural box model</a> of the selected element can override the
+          'voice-volume' value, and may therefore generate audio output). With the latter, the
+          selected element is not rendered in the aural dimension and no time is allocated for
+          playback (descendants within the <a href="#aural-model">aural box model</a> of the
+          selected element can override the 'speak' value, and may therefore generate audio output).
+        </p>
       </dd>
       <dt><strong>x-soft</strong>, <strong>soft</strong>, <strong>medium</strong>,
           <strong>loud</strong>, <strong>x-loud</strong></dt>
       <dd>
         <p>This sequence of keywords corresponds to monotonically non-decreasing volume levels,
           mapped to implementation-dependent values that meet the listener's requirements with
-          regards to perceived sound loudness. These audio levels are typically provided via a
-          preference mechanism that allow users to set options according to their auditory
+          regards to perceived loudness. These audio levels are typically provided via a preference
+          mechanism that allow users to calibrate sound options according to their auditory
           environment. The keyword 'x-soft' maps to the user's <em>minimum audible</em> volume
           level, 'x-loud' maps to the user's <em>maximum tolerable</em> volume level, 'medium' maps
           to the user's <em>preferred</em> volume level, 'soft' and 'loud' map to intermediary
@@ -384,14 +377,14 @@
           and +6.0dB is approximately twice the amplitude.</p>
       </dd>
     </dl>
-    <p class="note">Note that the actual perceived volume levels depend on various factors, such as
-      the listening environment and personal user preferences. The effective volume variation
-      between 'x-soft' and 'x-loud' represents the dynamic range (in terms of loudness) of the
-      speech output. Typically, this range would be compressed in a noisy context, i.e. the
-      perceived loudness corresponding to 'x-soft' would effectively be closer to 'x-loud' than it
-      would be in a quiet environment. There may also be situations where both 'x-soft' and 'x-loud'
-      would map to low volume levels, such as in listening environments requiring discretion (e.g.
-      library, night-reading). </p>
+    <p class="note">Note that perceived loudness depends on various factors, such as the listening
+      environment, user preferences or physical abilities. The effective volume variation between
+      'x-soft' and 'x-loud' represents the dynamic range (in terms of loudness) of the audio output.
+      Typically, this range would be compressed in a noisy context, i.e. the perceived loudness
+      corresponding to 'x-soft' would effectively be closer to 'x-loud' than it would be in a quiet
+      environment. There may also be situations where both 'x-soft' and 'x-loud' would map to low
+      volume levels, such as in listening environments requiring discretion (e.g. library,
+      night-reading). </p>
     <h3 id="mixing-props-voice-balance">The 'voice-balance' property</h3>
     <table class="propdef" summary="name: syntax">
       <tbody>
@@ -1087,7 +1080,7 @@
     </table>
     <p>The 'rest-before' and 'rest-after' properties specify a prosodic boundary (silence with a
       specific duration) that occurs before (or after) the speech synthesis rendition of an element
-      within the <a href="#aural-model">audio box model</a>. </p>
+      within the <a href="#aural-model">aural box model</a>. </p>
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_break"><code>break</code>
         element</a> from the SSML markup language [[!SSML]], the application of 'rest' prosodic
@@ -1285,7 +1278,7 @@
     </table>
     <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. pre-recorded /
       pre-generated sound clips) to be played before (or after) the selected element within the <a
-        href="#aural-model">audio box model</a>.</p>
+        href="#aural-model">aural box model</a>.</p>
     <p class="note"> Note that although the functionality provided by this property may appear
       related to the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"
           ><code>audio</code> element</a> from the SSML markup language [[!SSML]], there are in fact
@@ -1314,12 +1307,12 @@
         <p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This
           represents a change (positive or negative) relative to the computed value of the
           'voice-volume' property within the <a href="#aural-model">aural box model</a> of the
-          selected element (as a result, the volume level of audio cues changes when the
+          selected element (as a result, the volume level of an audio cue changes when the
           'voice-volume' property changes). When omitted, the implied value computes to 0dB. </p>
-        <p> When the 'voice-volume' property is set to 'silent', the audio cue is also set to
-          'silent' (regardless of this specified &lt;decibel&gt; value). Otherwise (when not
-          'silent'), 'voice-volume' values are always specified relatively to the volume level
-          keywords (see the definition of 'voice-volume'), which map to a user-configured scale of
+        <p> When the computed value of the 'voice-volume' property is 'silent', the audio cue is
+          also set to 'silent' (regardless of this specified &lt;decibel&gt; value). Otherwise (when
+          not 'silent'), 'voice-volume' values are always specified relatively to the volume level
+          keywords (see the definition of 'voice-volume'), which map to a user-calibrated scale of
           "preferred" loudness settings. If the inherited 'voice-volume' value already contains a
           decibel offset, the dB offset specific to the audio cue is combined additively. </p>
         <p> Decibels express the ratio of the squares of the new signal amplitude (a1) and the
@@ -1351,46 +1344,46 @@
 
 div.caution { cue-before: url(./audio/caution.wav) +8dB; }</pre>
     </div>
-
     <h3 id="cue-props-volume">Relation between audio cues and speech synthesis volume levels</h3>
-
     <p class="note">Note that this section is informative.</p>
 
     <p> The volume levels of audio cues and of speech synthesis within the <a href="#aural-model"
         >aural box model</a> of a selected element are related. For example, the desired effect of
       an audio cue whose volume level is set at +0dB (as specified by the &lt;decibel&gt; value) is
       that its perceived loudness during playback is close to that of the speech synthesis rendition
-      of the selected element, as dictated by computed value of the 'voice-volume' property (which
-      is itself based on a user-configured volume level keyword). Similarly, a 'silent' value for
-      the 'voice-volume' property results on any audio cues being "silenced" as well.</p>
+      of the selected element, as dictated by the computed value of the 'voice-volume' property.
+      Note that a 'silent' computed value for the 'voice-volume' property results in audio cues
+      being "forcefully" silenced as well (i.e. regardless of the specified audio cue 'decibel'
+      value) </p>
 
-    <p> In order to achieve this effect, authors should ensure that the volume level of audio cues
-      (on average, as there may be discrete loudness variations due to changes in the audio stream,
-      such as intonation, stress, etc.) matches that of a "typical" TTS voice output (based on the
-      'voice-family' intended for use), given "standard" listening conditions (i.e. default system
+    <p> The volume keywords of the 'voice-volume' property are user-calibrated to match requirements
+      not known at authoring time (e.g. auditory environment, personal preferences). Therefore, in
+      order to achieve this approximate loudness alignment of audio cues and speech synthesis,
+      authors should ensure that the volume level of audio cues (on average, as there may be
+      discrete variations of perceived loudness due to changes in the audio stream, such as
+      intonation, stress, etc.) matches the output of a speech synthesis rendition based on the
+      'voice-family' intended for use, given "typical" listening conditions (i.e. default system
       volume levels, centered equalization across the frequency spectrum). As speech processors are
       capable of directly controlling the waveform amplitude of generated text-to-speech audio, and
       because user agents are able to adjust the volume output of audio cues (i.e. amplify or
       attenuate audio signals based on the intrinsic waveform amplitude of digitized sound clips),
-      this sets a baseline that enables implementations to "align" the loudness of both TTS and cue
-      audio streams within the aural box model, relative to user-configured volume levels (see the
+      this sets a baseline that enables implementations to manage the loudness of both TTS and cue
+      audio streams within the aural box model, relative to user-calibrated volume levels (see the
       keywords defined in the 'voice-volume' property). </p>
 
     <p> Due to the complex relationship between perceived audio characteristics (e.g. loudness) and
-      the processing applied to the digitized audio signal (e.g. "compression"), we refer to a
+      the processing applied to the digitized audio signal (e.g. signal compression), we refer to a
       simple scenario whereby the attenuation is indicated in decibels, typically ranging from 0dB
-      (maximum audio input, near clipping threshold) to -60dB (total silence). Given this context, a
-      "standard" audio clip would oscillate between these values, the loudest peak levels would be
-      close to -3dB (to avoid distortion), and the relevant audible passages would have average
-      (RMS) volume levels as high as possible (i.e. not too quiet, to avoid background noise during
-      amplification). This would roughly provide an audio experience that could be seamlessly
+      (i.e. maximum audio input, near clipping threshold) to -60dB (i.e. total silence). Given this
+      context, a "standard" audio clip would oscillate between these values, the loudest peak levels
+      would be close to -3dB (to avoid distortion), and the relevant audible passages would have
+      average (RMS) volume levels as high as possible (i.e. not too quiet, to avoid background noise
+      during amplification). This would roughly provide an audio experience that could be seamlessly
       combined with text-to-speech output (i.e. there would be no discernible difference in volume
       levels when switching from pre-recorded audio to speech synthesis). Although there exists no
       industry-wide standard to support such convention, different TTS engines tend to generate
       comparably-loud audio signals when no gain or attenuation is specified. For voice and soft
       music, -15dB RMS seems to be pretty standard. </p>
-
-
     <h3 id="cue-props-cue">The 'cue' shorthand property</h3>
     <table class="propdef" summary="name: syntax">
       <tbody>
@@ -1599,7 +1592,6 @@
           name, gender, age). </p>
       </dd>
     </dl>
-
     <div class="example">
       <p> Examples of invalid declarations: </p>
       <pre>
@@ -1753,7 +1745,9 @@
           default value for the root element, or otherwise to the inherited speaking rate (which may
           itself be a combination of a keyword value and of a percentage, in which case percentages
           are combined multiplicatively). For example, 50% means that the speaking rate gets
-          multiplied by 0.5 (half the value).</p>
+          multiplied by 0.5 (half the value). Percentages above 100% result in faster speaking rates
+          (relative to the base keyword), whereas percentages below 100% result in slower speaking
+          rates.</p>
       </dd>
     </dl>
     <div class="example">
@@ -1784,7 +1778,7 @@
 e2 { voice-rate: fast 120%; } /* the computed value is
                           ['fast' and 120%], which will resolve
                           to the rate corresponding to 'fast'
-                          multiplied by 1.2 (one and a half times the speaking rate) */
+                          multiplied by 1.2 */
                           
 e3 { voice-rate: normal; /* "resets" the speaking rate to the intrinsic voice value,
                             the computed value is 'normal' (see comment below for actual value) */
@@ -2006,7 +2000,6 @@
       example when variations in inflection are used to convey meaning and emphasis in speech.
       Typically, a low range produces a flat, monotonic voice, whereas a high range produces an
       animated voice. </p>
-
     <p class="note"> Note that although the functionality provided by this property is similar to
       the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>range</code>
         attribute of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]],
@@ -2234,25 +2227,25 @@
     <div class="example">
       <p>Examples of property values, with HTML sample:</p>
       <pre>
-span.default-emphasis { voice-stress: normal; }
-span.lowered-emphasis { voice-stress: reduced; }
-span.removed-emphasis { voice-stress: none; }
-span.normal-emphasis { voice-stress: moderate; }
-span.huge-emphasis { voice-stress: strong; }
+.default-emphasis { voice-stress: normal; }
+.lowered-emphasis { voice-stress: reduced; }
+.removed-emphasis { voice-stress: none; }
+.normal-emphasis { voice-stress: moderate; }
+.huge-emphasis { voice-stress: strong; }
                 
 ...
 
 &lt;p&gt;This is a big car.&lt;/p&gt;
 &lt;!-- The speech output from the line above is identical to the line below: --&gt;
-&lt;p&gt;This is a &lt;span class="default-emphasis"&gt;big&lt;/span&gt; car.&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="default-emphasis"&gt;big&lt;/em&gt; car.&lt;/p&gt;
 
-&lt;p&gt;This car is &lt;span class="lowered-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
-&lt;!-- The "span" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
-&lt;p&gt;This car is &lt;span class="removed-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
+&lt;p&gt;This car is &lt;em class="lowered-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
+&lt;!-- The "em" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
+&lt;p&gt;This car is &lt;em class="removed-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
 
 &lt;!-- The lines below demonstrate increasing levels of emphasis: --&gt;
-&lt;p&gt;This is a &lt;span class="normal-emphasis"&gt;big&lt;/span&gt; car!&lt;/p&gt;
-&lt;p&gt;This is a &lt;span class="huge-emphasis"&gt;big&lt;/span&gt; car!!!&lt;/p&gt;</pre>
+&lt;p&gt;This is a &lt;em class="normal-emphasis"&gt;big&lt;/em&gt; car!&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="huge-emphasis"&gt;big&lt;/em&gt; car!!!&lt;/p&gt;</pre>
     </div>
     <h2 id="duration-props">Voice duration property</h2>
     <h3 id="mixing-props-voice-duration">The 'voice-duration' property</h3>
Received on Tuesday, 21 February 2012 23:11:40 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:44:50 UTC