- From: Daniel Weck via cvs-syncmail <cvsmail@w3.org>
- Date: Mon, 01 Aug 2011 21:01:50 +0000
- To: public-css-commits@w3.org
Update of /sources/public/csswg/css3-speech
In directory hutz:/tmp/cvs-serv23726
Modified Files:
Overview.html Overview.src.html
Log Message:
attempt to normatively clarify the audio cue / TTS volume level normalization.
Index: Overview.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.html,v
retrieving revision 1.86
retrieving revision 1.87
diff -u -d -r1.86 -r1.87
--- Overview.html 1 Aug 2011 17:45:43 -0000 1.86
+++ Overview.html 1 Aug 2011 21:01:48 -0000 1.87
@@ -427,7 +427,7 @@
<div class=example>
<p>This example shows how authors can tell the speech synthesizer to speak
HTML headings with a voice called "paul", using "moderate" emphasis
- (which is more than normal) and how to insert an audio cue (prerecorded
+ (which is more than normal) and how to insert an audio cue (pre-recorded
audio clip located at the given URL) before the start of TTS rendering
for each heading. In a stereo-capable sound system, paragraphs marked
with the CSS class "heidi" are rendered on the left audio channel (and
@@ -554,9 +554,9 @@
</table>
<p>The ‘<a href="#voice-volume"><code
- class=property>voice-volume</code></a>’ property manipulates the
- amplitude of the audio waveform generated by the speech synthesiser, and
- is also used when calculating the relative volume level of <a
+ class=property>voice-volume</code></a>’ property allows authors to
+ control the amplitude of the audio waveform generated by the speech
+ synthesiser, and is also used to adjust the relative volume level of <a
href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
"box" model</a>.
@@ -1225,9 +1225,9 @@
<td> <em>Value:</em>
<td><‘<a href="#pause-before"><code
- class=property>pause-before</code></a>’> || <‘<a
+ class=property>pause-before</code></a>’> <‘<a
href="#pause-after"><code
- class=property>pause-after</code></a>’>
+ class=property>pause-after</code></a>’>?
<tr>
<td> <em>Initial:</em>
@@ -1495,9 +1495,9 @@
<td> <em>Value:</em>
<td><‘<a href="#rest-before"><code
- class=property>rest-before</code></a>’> || <‘<a
+ class=property>rest-before</code></a>’> <‘<a
href="#rest-after"><code
- class=property>rest-after</code></a>’>
+ class=property>rest-after</code></a>’>?
<tr>
<td> <em>Initial:</em>
@@ -1639,8 +1639,8 @@
<p>The ‘<a href="#cue-before"><code
class=property>cue-before</code></a>’ and ‘<a
href="#cue-after"><code class=property>cue-after</code></a>’
- properties specify auditory icons (i.e. prerecorded audio clips) to be
- played before (or after) the selected element within the <a
+ properties specify auditory icons (i.e. pre-recorded / pre-generated sound
+ clips) to be played before (or after) the selected element within the <a
href="#aural-model">audio "box" model</a>.
<p class=note> Note that the functionality provided by this property is
@@ -1670,15 +1670,61 @@
(decibel unit). This represents a change (positive or negative) relative
to the computed value of the ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ property within the <a
- href="#aural-model">aural "box" model</a> of the selected element. When
- the ‘<a href="#voice-volume"><code
+ href="#aural-model">aural "box" model</a> of the selected element.
+ Decibels express the ratio of the squares of the new signal amplitude
+ (a1) and the current amplitude (a0), as per the following logarithmic
+ equation: volume(dB) = 20 log10 (a1 / a0)</p>
+
+ <p> When the ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ property is set to
‘<code class=property>silent</code>’, the audio cue is also
set to ‘<code class=property>silent</code>’ (regardless of
- the value specified for this <decibel>). Decibels express the
- ratio of the squares of the new signal amplitude (a1) and the current
- amplitude (a0), as per the following logarithmic equation: volume(dB) =
- 20 log10 (a1 / a0)</p>
+ this specified <decibel> value). Otherwise (when not ‘<code
+ class=property>silent</code>’), ‘<a
+ href="#voice-volume"><code class=property>voice-volume</code></a>’
+ values are always specified relatively to the volume level keywords,
+ which map to a user-configured scale of "preferred" loudness settings
+ (see the definition of ‘<a href="#voice-volume"><code
+ class=property>voice-volume</code></a>’). If the inherited
+ ‘<a href="#voice-volume"><code
+ class=property>voice-volume</code></a>’ value already contains a
+ decibel offset, the dB offset specific to the audio cue is combined
+ additively.
+
+ <p> The desired effect of an audio cue set at +0dB is that the volume
+ level during playback of the pre-recorded / pre-generated audio signal
+ is effectively the same as the volume level of live (i.e. real-time)
+ speech synthesis rendition. In order to achieve this effect, speech
+ processors are capable of directly controlling the waveform amplitude of
+ generated text-to-speech audio, user agents must be able to adjust the
+ volume output of audio cues (i.e. amplify or attenuate audio signals
+ based on the intrinsic waveform amplitude of sound clips), and last but
+ not least, authors must ensure that the "normal" volume level of
+ pre-recorded audio cues (on average, as there may be discrete variations
+ due to changes in the audio stream, such as intonation, stress, etc.)
+ matches that of a "typical" TTS voice output (based on the ‘<a
+ href="#voice-family"><code class=property>voice-family</code></a>’
+ intended for use), given standard listening conditions (i.e. default
+ system volume levels, centered equalization across the frequency
+ spectrum). This latter prerequisite sets a baseline that enables a user
+ agent to align the volume outputs of both TTS and cue audio streams
+ within the same "aural box model". Due to the complex relationship
+ between perceived audio characteristics and the processing applied to
+ the digitized audio signal, we will simplify the definition of "normal"
+ volume levels by referring to a canonical recording scenario, whereby
+ the attenuation is typically indicated in decibels, ranging from 0dB
+ (maximum audio input, near clipping threshold) to -60dB (total silence).
+ In this common context, a "standard" audio clip would oscillate between
+ these values, the loudest peak levels would be close to -3dB (to avoid
+ distortion), and the audible passages would have average volume levels
+ as high as possible (i.e. not too quiet, to avoid background noise
+ during amplification). This would roughly provide an audio experience
+ that could be seamlessly combined with text-to-speech output (i.e. there
+ would be no discernible difference in volume levels when switching from
+ pre-recorded audio to speech synthesis). Although there exists no
+ industry-wide standard to backup such convention, TTS engines usually
+ generate comparably-loud audio signals when no amplification (or
+ attenuation) is specified.</p>
<p class=note> Note that -6.0dB is approximately half the amplitude of
the audio signal, and +6.0dB is approximately twice the amplitude.</p>
@@ -1906,15 +1952,16 @@
ranges may be used by the processor-dependent voice-matching algorithm).
</p>
- <p class=note> The interpretation of the relationship between a person's
- age and a recognizable type of voice cannot realistically be defined in
- a universal manner, as it effectively depends on numerous cultural and
- linguistic variations. The values provided by this specification
- therefore represent a simplified model that can be reasonably applied to
- a great variety of speech locales, albeit at the cost of a certain
- degree of approximation. Future versions of this specification may
- refine the level of precision of the voice-matching algorithm, as speech
- processor implementations become more standardized.</p>
+ <p class=note> Note that the interpretation of the relationship between a
+ person's age and a recognizable type of voice cannot realistically be
+ defined in a universal manner, as it effectively depends on numerous
+ criteria (cultural, linguistic, biological, etc.). The values provided
+ by this specification therefore represent a simplified model that can be
+ reasonably applied to a broad variety of speech contexts, albeit at the
+ cost of a certain degree of approximation. Future versions of this
+ specification may refine the level of precision of the voice-matching
+ algorithm, as speech processor implementations become more standardized.
+ </p>
<dt> <strong><gender></strong>
@@ -2218,10 +2265,11 @@
<tr>
<td> <em>Computed value:</em>
- <td> one of the predefined keywords if only the keyword is specified by
- itself, otherwise a fixed frequency calculated by converting the
- keyword value (if any) to an absolute value based on the current
- voice-family and by applying the specified relative offset (if any)
+ <td> one of the predefined pitch keywords if only the keyword is
+ specified by itself, otherwise an absolute frequency calculated by
+ converting the keyword value (if any) to a fixed frequency based on the
+ current voice-family and by applying the specified relative offset (if
+ any)
</table>
<p>The ‘<a href="#voice-pitch"><code
@@ -2306,14 +2354,14 @@
the conversion from a keyword to a concrete, voice-dependent frequency.</p>
</dl>
- <p> Computed absolute frequency values that are negative are clamped to
- zero Hertz. Speech-capable user agents are likely to support a specific
- range of values rather than the full range of possible calculated
- numerical values for frequencies. The actual values in user agents may
- therefore be clamped to implementation-dependent minimum and maximum
- boundaries. For example: although the 0Hz frequency can be legitimately
- calculated, it may be clamped to a more meaningful value in the context of
- the speech synthesizer.
+ <p> Computed absolute frequencies that are negative are clamped to zero
+ Hertz. Speech-capable user agents are likely to support a specific range
+ of values rather than the full range of possible calculated numerical
+ values for frequencies. The actual values in user agents may therefore be
+ clamped to implementation-dependent minimum and maximum boundaries. For
+ example: although the 0Hz frequency can be legitimately calculated, it may
+ be clamped to a more meaningful value in the context of the speech
+ synthesizer.
<div class=example>
<p>Examples of property values:</p>
@@ -2377,10 +2425,11 @@
<tr>
<td> <em>Computed value:</em>
- <td> one of the predefined keywords if only the keyword is specified by
- itself, otherwise a fixed frequency calculated by converting the
- keyword value (if any) to an absolute value based on the current
- voice-family and by applying the specified relative offset (if any)
+ <td> one of the predefined pitch keywords if only the keyword is
+ specified by itself, otherwise an absolute frequency calculated by
+ converting the keyword value (if any) to a fixed frequency based on the
+ current voice-family and by applying the specified relative offset (if
+ any)
</table>
<p> The ‘<a href="#voice-range"><code
@@ -2465,14 +2514,14 @@
the conversion from a keyword to a concrete, voice-dependent frequency.</p>
</dl>
- <p> Computed absolute frequency values that are negative are clamped to
- zero Hertz. Speech-capable user agents are likely to support a specific
- range of values rather than the full range of possible calculated
- numerical values for frequencies. The actual values in user agents may
- therefore be clamped to implementation-dependent minimum and maximum
- boundaries. For example: although the 0Hz frequency can be legitimately
- calculated, it may be clamped to a more meaningful value in the context of
- the speech synthesizer.
+ <p> Computed absolute frequencies that are negative are clamped to zero
+ Hertz. Speech-capable user agents are likely to support a specific range
+ of values rather than the full range of possible calculated numerical
+ values for frequencies. The actual values in user agents may therefore be
+ clamped to implementation-dependent minimum and maximum boundaries. For
+ example: although the 0Hz frequency can be legitimately calculated, it may
+ be clamped to a more meaningful value in the context of the speech
+ synthesizer.
<div class=example>
<p>Examples of inherited values:</p>
@@ -3000,8 +3049,8 @@
<tr>
<th><a class=property href="#pause">pause</a>
- <td><‘pause-before’> ||
- <‘pause-after’>
+ <td><‘pause-before’>
+ <‘pause-after’>?
<td>N/A (see individual properties)
@@ -3046,8 +3095,7 @@
<tr>
<th><a class=property href="#rest">rest</a>
- <td><‘rest-before’> ||
- <‘rest-after’>
+ <td><‘rest-before’> <‘rest-after’>?
<td>N/A (see individual properties)
Index: Overview.src.html
===================================================================
RCS file: /sources/public/csswg/css3-speech/Overview.src.html,v
retrieving revision 1.87
retrieving revision 1.88
diff -u -d -r1.87 -r1.88
--- Overview.src.html 1 Aug 2011 17:45:43 -0000 1.87
+++ Overview.src.html 1 Aug 2011 21:01:48 -0000 1.88
@@ -184,7 +184,7 @@
<div class="example">
<p>This example shows how authors can tell the speech synthesizer to speak HTML headings with
a voice called "paul", using "moderate" emphasis (which is more than normal) and how to
- insert an audio cue (prerecorded audio clip located at the given URL) before the start of
+ insert an audio cue (pre-recorded audio clip located at the given URL) before the start of
TTS rendering for each heading. In a stereo-capable sound system, paragraphs marked with the
CSS class "heidi" are rendered on the left audio channel (and with a female voice, etc.),
whilst the class "peter" corresponds to the right channel (and to a male voice, etc.). The
@@ -296,9 +296,9 @@
</tr>
</tbody>
</table>
- <p>The 'voice-volume' property manipulates the amplitude of the audio waveform generated by the
- speech synthesiser, and is also used when calculating the relative volume level of <a
- href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p>
+ <p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform
+ generated by the speech synthesiser, and is also used to adjust the relative volume level of
+ <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p>
<p class="note"> Note that the functionality provided by this property is related to the <a
href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code> attribute
of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]]. </p>
@@ -871,7 +871,7 @@
<td>
<em>Value:</em>
</td>
- <td><'pause-before'> || <'pause-after'></td>
+ <td><'pause-before'> <'pause-after'>?</td>
</tr>
<tr>
<td>
@@ -1096,7 +1096,7 @@
<td>
<em>Value:</em>
</td>
- <td><'rest-before'> || <'rest-after'></td>
+ <td><'rest-before'> <'rest-after'>?</td>
</tr>
<tr>
<td>
@@ -1246,9 +1246,9 @@
</tr>
</tbody>
</table>
- <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. prerecorded audio
- clips) to be played before (or after) the selected element within the <a href="#aural-model"
- >audio "box" model</a>.</p>
+ <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. pre-recorded /
+ pre-generated sound clips) to be played before (or after) the selected element within the <a
+ href="#aural-model">audio "box" model</a>.</p>
<p class="note"> Note that the functionality provided by this property is related to the <a
href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"><code>audio</code> element</a>
from the SSML markup language [[!SSML]]. </p>
@@ -1274,11 +1274,41 @@
<p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This
represents a change (positive or negative) relative to the computed value of the
'voice-volume' property within the <a href="#aural-model">aural "box" model</a> of the
- selected element. When the 'voice-volume' property is set to 'silent', the audio cue is
- also set to 'silent' (regardless of the value specified for this <decibel>).
- Decibels express the ratio of the squares of the new signal amplitude (a1) and the current
- amplitude (a0), as per the following logarithmic equation: volume(dB) = 20 log10 (a1 /
- a0)</p>
+ selected element. Decibels express the ratio of the squares of the new signal amplitude
+ (a1) and the current amplitude (a0), as per the following logarithmic equation: volume(dB)
+ = 20 log10 (a1 / a0) </p>
+ <p> When the 'voice-volume' property is set to 'silent', the audio cue is also set to
+ 'silent' (regardless of this specified <decibel> value). Otherwise (when not
+ 'silent'), 'voice-volume' values are always specified relatively to the volume level
+ keywords, which map to a user-configured scale of "preferred" loudness settings (see the
+ definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a
+ decibel offset, the dB offset specific to the audio cue is combined additively. </p><p>
+ The desired effect of an audio cue set at +0dB is that the volume level during playback of
+ the pre-recorded / pre-generated audio signal is effectively the same as the volume level
+ of live (i.e. real-time) speech synthesis rendition. In order to achieve this effect,
+ speech processors are capable of directly controlling the waveform amplitude of generated
+ text-to-speech audio, user agents must be able to adjust the volume output of audio cues
+ (i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of
+ sound clips), and last but not least, authors must ensure that the "normal" volume level
+ of pre-recorded audio cues (on average, as there may be discrete variations due to changes
+ in the audio stream, such as intonation, stress, etc.) matches that of a "typical" TTS
+ voice output (based on the 'voice-family' intended for use), given standard listening
+ conditions (i.e. default system volume levels, centered equalization across the frequency
+ spectrum). This latter prerequisite sets a baseline that enables a user agent to align the
+ volume outputs of both TTS and cue audio streams within the same "aural box model". Due to
+ the complex relationship between perceived audio characteristics and the processing
+ applied to the digitized audio signal, we will simplify the definition of "normal" volume
+ levels by referring to a canonical recording scenario, whereby the attenuation is
+ typically indicated in decibels, ranging from 0dB (maximum audio input, near clipping
+ threshold) to -60dB (total silence). In this common context, a "standard" audio clip would
+ oscillate between these values, the loudest peak levels would be close to -3dB (to avoid
+ distortion), and the audible passages would have average volume levels as high as possible
+ (i.e. not too quiet, to avoid background noise during amplification). This would roughly
+ provide an audio experience that could be seamlessly combined with text-to-speech output
+ (i.e. there would be no discernible difference in volume levels when switching from
+ pre-recorded audio to speech synthesis). Although there exists no industry-wide standard
+ to backup such convention, TTS engines usually generate comparably-loud audio signals when
+ no amplification (or attenuation) is specified.</p>
<p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
and +6.0dB is approximately twice the amplitude.</p>
<p class="note"> Note that there is a difference between an audio cue whose volume is set to
@@ -1473,13 +1503,13 @@
match during voice selection. The mapping with [[!SSML]] ages is defined as follows:
'child' = 6 y/o, 'young' = 24 y/o, 'old' = 75 y/o (note that more flexible age ranges may
be used by the processor-dependent voice-matching algorithm). </p>
- <p class="note"> The interpretation of the relationship between a person's age and a
- recognizable type of voice cannot realistically be defined in a universal manner, as it
- effectively depends on numerous cultural and linguistic variations. The values provided by
- this specification therefore represent a simplified model that can be reasonably applied
- to a great variety of speech locales, albeit at the cost of a certain degree of
- approximation. Future versions of this specification may refine the level of precision of
- the voice-matching algorithm, as speech processor implementations become more
+ <p class="note"> Note that the interpretation of the relationship between a person's age and
+ a recognizable type of voice cannot realistically be defined in a universal manner, as it
+ effectively depends on numerous criteria (cultural, linguistic, biological, etc.). The
+ values provided by this specification therefore represent a simplified model that can be
+ reasonably applied to a broad variety of speech contexts, albeit at the cost of a certain
+ degree of approximation. Future versions of this specification may refine the level of
+ precision of the voice-matching algorithm, as speech processor implementations become more
standardized. </p>
</dd>
<dt>
@@ -1752,10 +1782,10 @@
<td>
<em>Computed value:</em>
</td>
- <td> one of the predefined keywords if only the keyword is specified by itself, otherwise
- a fixed frequency calculated by converting the keyword value (if any) to an absolute
- value based on the current voice-family and by applying the specified relative offset
- (if any)</td>
+ <td> one of the predefined pitch keywords if only the keyword is specified by itself,
+ otherwise an absolute frequency calculated by converting the keyword value (if any) to a
+ fixed frequency based on the current voice-family and by applying the specified relative
+ offset (if any)</td>
</tr>
</tbody>
</table>
@@ -1827,12 +1857,12 @@
conversion from a keyword to a concrete, voice-dependent frequency.</p>
</dd>
</dl>
- <p> Computed absolute frequency values that are negative are clamped to zero Hertz.
- Speech-capable user agents are likely to support a specific range of values rather than the
- full range of possible calculated numerical values for frequencies. The actual values in user
- agents may therefore be clamped to implementation-dependent minimum and maximum boundaries.
- For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a
- more meaningful value in the context of the speech synthesizer. </p>
+ <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable
+ user agents are likely to support a specific range of values rather than the full range of
+ possible calculated numerical values for frequencies. The actual values in user agents may
+ therefore be clamped to implementation-dependent minimum and maximum boundaries. For example:
+ although the 0Hz frequency can be legitimately calculated, it may be clamped to a more
+ meaningful value in the context of the speech synthesizer. </p>
<div class="example">
<p>Examples of property values:</p>
<pre>
@@ -1897,10 +1927,10 @@
<td>
<em>Computed value:</em>
</td>
- <td> one of the predefined keywords if only the keyword is specified by itself, otherwise
- a fixed frequency calculated by converting the keyword value (if any) to an absolute
- value based on the current voice-family and by applying the specified relative offset
- (if any)</td>
+ <td> one of the predefined pitch keywords if only the keyword is specified by itself,
+ otherwise an absolute frequency calculated by converting the keyword value (if any) to a
+ fixed frequency based on the current voice-family and by applying the specified relative
+ offset (if any)</td>
</tr>
</tbody>
</table>
@@ -1973,12 +2003,12 @@
conversion from a keyword to a concrete, voice-dependent frequency.</p>
</dd>
</dl>
- <p> Computed absolute frequency values that are negative are clamped to zero Hertz.
- Speech-capable user agents are likely to support a specific range of values rather than the
- full range of possible calculated numerical values for frequencies. The actual values in user
- agents may therefore be clamped to implementation-dependent minimum and maximum boundaries.
- For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a
- more meaningful value in the context of the speech synthesizer. </p>
+ <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable
+ user agents are likely to support a specific range of values rather than the full range of
+ possible calculated numerical values for frequencies. The actual values in user agents may
+ therefore be clamped to implementation-dependent minimum and maximum boundaries. For example:
+ although the 0Hz frequency can be legitimately calculated, it may be clamped to a more
+ meaningful value in the context of the speech synthesizer. </p>
<div class="example">
<p>Examples of inherited values:</p>
<pre>
Received on Monday, 1 August 2011 21:01:52 UTC