csswg/css3-speech Overview.html,1.86,1.87 Overview.src.html,1.87,1.88 from Daniel Weck via cvs-syncmail on 2011-08-01 (public-css-commits@w3.org from August 2011)

From: Daniel Weck via cvs-syncmail <cvsmail@w3.org>
Date: Mon, 01 Aug 2011 21:01:50 +0000
To: public-css-commits@w3.org
Message-Id: <E1QnzcU-0006Cg-Gp@lionel-hutz.w3.org>
Update of /sources/public/csswg/css3-speech
In directory hutz:/tmp/cvs-serv23726

Modified Files:
	Overview.html Overview.src.html 
Log Message:
attempt to normatively clarify the audio cue / TTS volume level normalization.


Index: Overview.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.html,v
retrieving revision 1.86
retrieving revision 1.87
diff -u -d -r1.86 -r1.87
--- Overview.html	1 Aug 2011 17:45:43 -0000	1.86
+++ Overview.html	1 Aug 2011 21:01:48 -0000	1.87
@@ -427,7 +427,7 @@
   <div class=example>
    <p>This example shows how authors can tell the speech synthesizer to speak
     HTML headings with a voice called "paul", using "moderate" emphasis
-    (which is more than normal) and how to insert an audio cue (prerecorded
+    (which is more than normal) and how to insert an audio cue (pre-recorded
     audio clip located at the given URL) before the start of TTS rendering
     for each heading. In a stereo-capable sound system, paragraphs marked
     with the CSS class "heidi" are rendered on the left audio channel (and
@@ -554,9 +554,9 @@
   </table>
 
   <p>The &lsquo;<a href="#voice-volume"><code
-   class=property>voice-volume</code></a>&rsquo; property manipulates the
-   amplitude of the audio waveform generated by the speech synthesiser, and
-   is also used when calculating the relative volume level of <a
+   class=property>voice-volume</code></a>&rsquo; property allows authors to
+   control the amplitude of the audio waveform generated by the speech
+   synthesiser, and is also used to adjust the relative volume level of <a
    href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
    "box" model</a>.
 
@@ -1225,9 +1225,9 @@
      <td> <em>Value:</em>
 
      <td>&lt;&lsquo;<a href="#pause-before"><code
-      class=property>pause-before</code></a>&rsquo;&gt; || &lt;&lsquo;<a
+      class=property>pause-before</code></a>&rsquo;&gt; &lt;&lsquo;<a
       href="#pause-after"><code
-      class=property>pause-after</code></a>&rsquo;&gt;
+      class=property>pause-after</code></a>&rsquo;&gt;?
 
     <tr>
      <td> <em>Initial:</em>
@@ -1495,9 +1495,9 @@
      <td> <em>Value:</em>
 
      <td>&lt;&lsquo;<a href="#rest-before"><code
-      class=property>rest-before</code></a>&rsquo;&gt; || &lt;&lsquo;<a
+      class=property>rest-before</code></a>&rsquo;&gt; &lt;&lsquo;<a
       href="#rest-after"><code
-      class=property>rest-after</code></a>&rsquo;&gt;
+      class=property>rest-after</code></a>&rsquo;&gt;?
 
     <tr>
      <td> <em>Initial:</em>
@@ -1639,8 +1639,8 @@
   <p>The &lsquo;<a href="#cue-before"><code
    class=property>cue-before</code></a>&rsquo; and &lsquo;<a
    href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
-   properties specify auditory icons (i.e. prerecorded audio clips) to be
-   played before (or after) the selected element within the <a
+   properties specify auditory icons (i.e. pre-recorded / pre-generated sound
+   clips) to be played before (or after) the selected element within the <a
    href="#aural-model">audio "box" model</a>.
 
   <p class=note> Note that the functionality provided by this property is
@@ -1670,15 +1670,61 @@
      (decibel unit). This represents a change (positive or negative) relative
      to the computed value of the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property within the <a
-     href="#aural-model">aural "box" model</a> of the selected element. When
-     the &lsquo;<a href="#voice-volume"><code
+     href="#aural-model">aural "box" model</a> of the selected element.
+     Decibels express the ratio of the squares of the new signal amplitude
+     (a1) and the current amplitude (a0), as per the following logarithmic
+     equation: volume(dB) = 20 log10 (a1 / a0)</p>
+
+    <p> When the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property is set to
      &lsquo;<code class=property>silent</code>&rsquo;, the audio cue is also
      set to &lsquo;<code class=property>silent</code>&rsquo; (regardless of
-     the value specified for this &lt;decibel&gt;). Decibels express the
-     ratio of the squares of the new signal amplitude (a1) and the current
-     amplitude (a0), as per the following logarithmic equation: volume(dB) =
-     20 log10 (a1 / a0)</p>
+     this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
+     class=property>silent</code>&rsquo;), &lsquo;<a
+     href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
+     values are always specified relatively to the volume level keywords,
+     which map to a user-configured scale of "preferred" loudness settings
+     (see the definition of &lsquo;<a href="#voice-volume"><code
+     class=property>voice-volume</code></a>&rsquo;). If the inherited
+     &lsquo;<a href="#voice-volume"><code
+     class=property>voice-volume</code></a>&rsquo; value already contains a
+     decibel offset, the dB offset specific to the audio cue is combined
+     additively.
+
+    <p> The desired effect of an audio cue set at +0dB is that the volume
+     level during playback of the pre-recorded / pre-generated audio signal
+     is effectively the same as the volume level of live (i.e. real-time)
+     speech synthesis rendition. In order to achieve this effect, speech
+     processors are capable of directly controlling the waveform amplitude of
+     generated text-to-speech audio, user agents must be able to adjust the
+     volume output of audio cues (i.e. amplify or attenuate audio signals
+     based on the intrinsic waveform amplitude of sound clips), and last but
+     not least, authors must ensure that the "normal" volume level of
+     pre-recorded audio cues (on average, as there may be discrete variations
+     due to changes in the audio stream, such as intonation, stress, etc.)
+     matches that of a "typical" TTS voice output (based on the &lsquo;<a
+     href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
+     intended for use), given standard listening conditions (i.e. default
+     system volume levels, centered equalization across the frequency
+     spectrum). This latter prerequisite sets a baseline that enables a user
+     agent to align the volume outputs of both TTS and cue audio streams
+     within the same "aural box model". Due to the complex relationship
+     between perceived audio characteristics and the processing applied to
+     the digitized audio signal, we will simplify the definition of "normal"
+     volume levels by referring to a canonical recording scenario, whereby
+     the attenuation is typically indicated in decibels, ranging from 0dB
+     (maximum audio input, near clipping threshold) to -60dB (total silence).
+     In this common context, a "standard" audio clip would oscillate between
+     these values, the loudest peak levels would be close to -3dB (to avoid
+     distortion), and the audible passages would have average volume levels
+     as high as possible (i.e. not too quiet, to avoid background noise
+     during amplification). This would roughly provide an audio experience
+     that could be seamlessly combined with text-to-speech output (i.e. there
+     would be no discernible difference in volume levels when switching from
+     pre-recorded audio to speech synthesis). Although there exists no
+     industry-wide standard to backup such convention, TTS engines usually
+     generate comparably-loud audio signals when no amplification (or
+     attenuation) is specified.</p>
 
     <p class=note> Note that -6.0dB is approximately half the amplitude of
      the audio signal, and +6.0dB is approximately twice the amplitude.</p>
@@ -1906,15 +1952,16 @@
      ranges may be used by the processor-dependent voice-matching algorithm).
      </p>
 
-    <p class=note> The interpretation of the relationship between a person's
-     age and a recognizable type of voice cannot realistically be defined in
-     a universal manner, as it effectively depends on numerous cultural and
-     linguistic variations. The values provided by this specification
-     therefore represent a simplified model that can be reasonably applied to
-     a great variety of speech locales, albeit at the cost of a certain
-     degree of approximation. Future versions of this specification may
-     refine the level of precision of the voice-matching algorithm, as speech
-     processor implementations become more standardized.</p>
+    <p class=note> Note that the interpretation of the relationship between a
+     person's age and a recognizable type of voice cannot realistically be
+     defined in a universal manner, as it effectively depends on numerous
+     criteria (cultural, linguistic, biological, etc.). The values provided
+     by this specification therefore represent a simplified model that can be
+     reasonably applied to a broad variety of speech contexts, albeit at the
+     cost of a certain degree of approximation. Future versions of this
+     specification may refine the level of precision of the voice-matching
+     algorithm, as speech processor implementations become more standardized.
+     </p>
 
    <dt> <strong>&lt;gender&gt;</strong>
 
@@ -2218,10 +2265,11 @@
     <tr>
      <td> <em>Computed value:</em>
 
-     <td> one of the predefined keywords if only the keyword is specified by
-      itself, otherwise a fixed frequency calculated by converting the
-      keyword value (if any) to an absolute value based on the current
-      voice-family and by applying the specified relative offset (if any)
+     <td> one of the predefined pitch keywords if only the keyword is
+      specified by itself, otherwise an absolute frequency calculated by
+      converting the keyword value (if any) to a fixed frequency based on the
+      current voice-family and by applying the specified relative offset (if
+      any)
   </table>
 
   <p>The &lsquo;<a href="#voice-pitch"><code
@@ -2306,14 +2354,14 @@
      the conversion from a keyword to a concrete, voice-dependent frequency.</p>
   </dl>
 
-  <p> Computed absolute frequency values that are negative are clamped to
-   zero Hertz. Speech-capable user agents are likely to support a specific
-   range of values rather than the full range of possible calculated
-   numerical values for frequencies. The actual values in user agents may
-   therefore be clamped to implementation-dependent minimum and maximum
-   boundaries. For example: although the 0Hz frequency can be legitimately
-   calculated, it may be clamped to a more meaningful value in the context of
-   the speech synthesizer.
+  <p> Computed absolute frequencies that are negative are clamped to zero
+   Hertz. Speech-capable user agents are likely to support a specific range
+   of values rather than the full range of possible calculated numerical
+   values for frequencies. The actual values in user agents may therefore be
+   clamped to implementation-dependent minimum and maximum boundaries. For
+   example: although the 0Hz frequency can be legitimately calculated, it may
+   be clamped to a more meaningful value in the context of the speech
+   synthesizer.
 
   <div class=example>
    <p>Examples of property values:</p>
@@ -2377,10 +2425,11 @@
     <tr>
      <td> <em>Computed value:</em>
 
-     <td> one of the predefined keywords if only the keyword is specified by
-      itself, otherwise a fixed frequency calculated by converting the
-      keyword value (if any) to an absolute value based on the current
-      voice-family and by applying the specified relative offset (if any)
+     <td> one of the predefined pitch keywords if only the keyword is
+      specified by itself, otherwise an absolute frequency calculated by
+      converting the keyword value (if any) to a fixed frequency based on the
+      current voice-family and by applying the specified relative offset (if
+      any)
   </table>
 
   <p> The &lsquo;<a href="#voice-range"><code
@@ -2465,14 +2514,14 @@
      the conversion from a keyword to a concrete, voice-dependent frequency.</p>
   </dl>
 
-  <p> Computed absolute frequency values that are negative are clamped to
-   zero Hertz. Speech-capable user agents are likely to support a specific
-   range of values rather than the full range of possible calculated
-   numerical values for frequencies. The actual values in user agents may
-   therefore be clamped to implementation-dependent minimum and maximum
-   boundaries. For example: although the 0Hz frequency can be legitimately
-   calculated, it may be clamped to a more meaningful value in the context of
-   the speech synthesizer.
+  <p> Computed absolute frequencies that are negative are clamped to zero
+   Hertz. Speech-capable user agents are likely to support a specific range
+   of values rather than the full range of possible calculated numerical
+   values for frequencies. The actual values in user agents may therefore be
+   clamped to implementation-dependent minimum and maximum boundaries. For
+   example: although the 0Hz frequency can be legitimately calculated, it may
+   be clamped to a more meaningful value in the context of the speech
+   synthesizer.
 
   <div class=example>
    <p>Examples of inherited values:</p>
@@ -3000,8 +3049,8 @@
     <tr>
      <th><a class=property href="#pause">pause</a>
 
-     <td>&lt;&lsquo;pause-before&rsquo;&gt; ||
-      &lt;&lsquo;pause-after&rsquo;&gt;
+     <td>&lt;&lsquo;pause-before&rsquo;&gt;
+      &lt;&lsquo;pause-after&rsquo;&gt;?
 
      <td>N/A (see individual properties)
 
@@ -3046,8 +3095,7 @@
     <tr>
      <th><a class=property href="#rest">rest</a>
 
-     <td>&lt;&lsquo;rest-before&rsquo;&gt; ||
-      &lt;&lsquo;rest-after&rsquo;&gt;
+     <td>&lt;&lsquo;rest-before&rsquo;&gt; &lt;&lsquo;rest-after&rsquo;&gt;?
 
      <td>N/A (see individual properties)
 

Index: Overview.src.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.src.html,v
retrieving revision 1.87
retrieving revision 1.88
diff -u -d -r1.87 -r1.88
--- Overview.src.html	1 Aug 2011 17:45:43 -0000	1.87
+++ Overview.src.html	1 Aug 2011 21:01:48 -0000	1.88
@@ -184,7 +184,7 @@
     <div class="example">
       <p>This example shows how authors can tell the speech synthesizer to speak HTML headings with
         a voice called "paul", using "moderate" emphasis (which is more than normal) and how to
-        insert an audio cue (prerecorded audio clip located at the given URL) before the start of
+        insert an audio cue (pre-recorded audio clip located at the given URL) before the start of
         TTS rendering for each heading. In a stereo-capable sound system, paragraphs marked with the
         CSS class "heidi" are rendered on the left audio channel (and with a female voice, etc.),
         whilst the class "peter" corresponds to the right channel (and to a male voice, etc.). The
@@ -296,9 +296,9 @@
         </tr>
       </tbody>
     </table>
-    <p>The 'voice-volume' property manipulates the amplitude of the audio waveform generated by the
-      speech synthesiser, and is also used when calculating the relative volume level of <a
-        href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p>
+    <p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform
+      generated by the speech synthesiser, and is also used to adjust the relative volume level of
+        <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p>
     <p class="note"> Note that the functionality provided by this property is related to the <a
         href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code> attribute
         of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]]. </p>
@@ -871,7 +871,7 @@
           <td>
             <em>Value:</em>
           </td>
-          <td>&lt;'pause-before'&gt; || &lt;'pause-after'&gt;</td>
+          <td>&lt;'pause-before'&gt; &lt;'pause-after'&gt;?</td>
         </tr>
         <tr>
           <td>
@@ -1096,7 +1096,7 @@
           <td>
             <em>Value:</em>
           </td>
-          <td>&lt;'rest-before'&gt; || &lt;'rest-after'&gt;</td>
+          <td>&lt;'rest-before'&gt; &lt;'rest-after'&gt;?</td>
         </tr>
         <tr>
           <td>
@@ -1246,9 +1246,9 @@
         </tr>
       </tbody>
     </table>
-    <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. prerecorded audio
-      clips) to be played before (or after) the selected element within the <a href="#aural-model"
-        >audio "box" model</a>.</p>
+    <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. pre-recorded /
+      pre-generated sound clips) to be played before (or after) the selected element within the <a
+        href="#aural-model">audio "box" model</a>.</p>
     <p class="note"> Note that the functionality provided by this property is related to the <a
         href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"><code>audio</code> element</a>
       from the SSML markup language [[!SSML]]. </p>
@@ -1274,11 +1274,41 @@
         <p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This
           represents a change (positive or negative) relative to the computed value of the
           'voice-volume' property within the <a href="#aural-model">aural "box" model</a> of the
-          selected element. When the 'voice-volume' property is set to 'silent', the audio cue is
-          also set to 'silent' (regardless of the value specified for this &lt;decibel&gt;).
-          Decibels express the ratio of the squares of the new signal amplitude (a1) and the current
-          amplitude (a0), as per the following logarithmic equation: volume(dB) = 20 log10 (a1 /
-          a0)</p>
+          selected element. Decibels express the ratio of the squares of the new signal amplitude
+          (a1) and the current amplitude (a0), as per the following logarithmic equation: volume(dB)
+          = 20 log10 (a1 / a0) </p>
+        <p> When the 'voice-volume' property is set to 'silent', the audio cue is also set to
+          'silent' (regardless of this specified &lt;decibel&gt; value). Otherwise (when not
+          'silent'), 'voice-volume' values are always specified relatively to the volume level
+          keywords, which map to a user-configured scale of "preferred" loudness settings (see the
+          definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a
+          decibel offset, the dB offset specific to the audio cue is combined additively. </p><p>
+          The desired effect of an audio cue set at +0dB is that the volume level during playback of
+          the pre-recorded / pre-generated audio signal is effectively the same as the volume level
+          of live (i.e. real-time) speech synthesis rendition. In order to achieve this effect,
+          speech processors are capable of directly controlling the waveform amplitude of generated
+          text-to-speech audio, user agents must be able to adjust the volume output of audio cues
+          (i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of
+          sound clips), and last but not least, authors must ensure that the "normal" volume level
+          of pre-recorded audio cues (on average, as there may be discrete variations due to changes
+          in the audio stream, such as intonation, stress, etc.) matches that of a "typical" TTS
+          voice output (based on the 'voice-family' intended for use), given standard listening
+          conditions (i.e. default system volume levels, centered equalization across the frequency
+          spectrum). This latter prerequisite sets a baseline that enables a user agent to align the
+          volume outputs of both TTS and cue audio streams within the same "aural box model". Due to
+          the complex relationship between perceived audio characteristics and the processing
+          applied to the digitized audio signal, we will simplify the definition of "normal" volume
+          levels by referring to a canonical recording scenario, whereby the attenuation is
+          typically indicated in decibels, ranging from 0dB (maximum audio input, near clipping
+          threshold) to -60dB (total silence). In this common context, a "standard" audio clip would
+          oscillate between these values, the loudest peak levels would be close to -3dB (to avoid
+          distortion), and the audible passages would have average volume levels as high as possible
+          (i.e. not too quiet, to avoid background noise during amplification). This would roughly
+          provide an audio experience that could be seamlessly combined with text-to-speech output
+          (i.e. there would be no discernible difference in volume levels when switching from
+          pre-recorded audio to speech synthesis). Although there exists no industry-wide standard
+          to backup such convention, TTS engines usually generate comparably-loud audio signals when
+          no amplification (or attenuation) is specified.</p>
         <p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
           and +6.0dB is approximately twice the amplitude.</p>
         <p class="note"> Note that there is a difference between an audio cue whose volume is set to
@@ -1473,13 +1503,13 @@
           match during voice selection. The mapping with [[!SSML]] ages is defined as follows:
           'child' = 6 y/o, 'young' = 24 y/o, 'old' = 75 y/o (note that more flexible age ranges may
           be used by the processor-dependent voice-matching algorithm). </p>
-        <p class="note"> The interpretation of the relationship between a person's age and a
-          recognizable type of voice cannot realistically be defined in a universal manner, as it
-          effectively depends on numerous cultural and linguistic variations. The values provided by
-          this specification therefore represent a simplified model that can be reasonably applied
-          to a great variety of speech locales, albeit at the cost of a certain degree of
-          approximation. Future versions of this specification may refine the level of precision of
-          the voice-matching algorithm, as speech processor implementations become more
+        <p class="note"> Note that the interpretation of the relationship between a person's age and
+          a recognizable type of voice cannot realistically be defined in a universal manner, as it
+          effectively depends on numerous criteria (cultural, linguistic, biological, etc.). The
+          values provided by this specification therefore represent a simplified model that can be
+          reasonably applied to a broad variety of speech contexts, albeit at the cost of a certain
+          degree of approximation. Future versions of this specification may refine the level of
+          precision of the voice-matching algorithm, as speech processor implementations become more
           standardized. </p>
       </dd>
       <dt>
@@ -1752,10 +1782,10 @@
           <td>
             <em>Computed value:</em>
           </td>
-          <td> one of the predefined keywords if only the keyword is specified by itself, otherwise
-            a fixed frequency calculated by converting the keyword value (if any) to an absolute
-            value based on the current voice-family and by applying the specified relative offset
-            (if any)</td>
+          <td> one of the predefined pitch keywords if only the keyword is specified by itself,
+            otherwise an absolute frequency calculated by converting the keyword value (if any) to a
+            fixed frequency based on the current voice-family and by applying the specified relative
+            offset (if any)</td>
         </tr>
       </tbody>
     </table>
@@ -1827,12 +1857,12 @@
           conversion from a keyword to a concrete, voice-dependent frequency.</p>
       </dd>
     </dl>
-    <p> Computed absolute frequency values that are negative are clamped to zero Hertz.
-      Speech-capable user agents are likely to support a specific range of values rather than the
-      full range of possible calculated numerical values for frequencies. The actual values in user
-      agents may therefore be clamped to implementation-dependent minimum and maximum boundaries.
-      For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a
-      more meaningful value in the context of the speech synthesizer. </p>
+    <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable
+      user agents are likely to support a specific range of values rather than the full range of
+      possible calculated numerical values for frequencies. The actual values in user agents may
+      therefore be clamped to implementation-dependent minimum and maximum boundaries. For example:
+      although the 0Hz frequency can be legitimately calculated, it may be clamped to a more
+      meaningful value in the context of the speech synthesizer. </p>
     <div class="example">
       <p>Examples of property values:</p>
       <pre>
@@ -1897,10 +1927,10 @@
           <td>
             <em>Computed value:</em>
           </td>
-          <td> one of the predefined keywords if only the keyword is specified by itself, otherwise
-            a fixed frequency calculated by converting the keyword value (if any) to an absolute
-            value based on the current voice-family and by applying the specified relative offset
-            (if any)</td>
+          <td> one of the predefined pitch keywords if only the keyword is specified by itself,
+            otherwise an absolute frequency calculated by converting the keyword value (if any) to a
+            fixed frequency based on the current voice-family and by applying the specified relative
+            offset (if any)</td>
         </tr>
       </tbody>
     </table>
@@ -1973,12 +2003,12 @@
           conversion from a keyword to a concrete, voice-dependent frequency.</p>
       </dd>
     </dl>
-    <p> Computed absolute frequency values that are negative are clamped to zero Hertz.
-      Speech-capable user agents are likely to support a specific range of values rather than the
-      full range of possible calculated numerical values for frequencies. The actual values in user
-      agents may therefore be clamped to implementation-dependent minimum and maximum boundaries.
-      For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a
-      more meaningful value in the context of the speech synthesizer. </p>
+    <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable
+      user agents are likely to support a specific range of values rather than the full range of
+      possible calculated numerical values for frequencies. The actual values in user agents may
+      therefore be clamped to implementation-dependent minimum and maximum boundaries. For example:
+      although the 0Hz frequency can be legitimately calculated, it may be clamped to a more
+      meaningful value in the context of the speech synthesizer. </p>
     <div class="example">
       <p>Examples of inherited values:</p>
       <pre>
Received on Monday, 1 August 2011 21:01:52 UTC