- From: Daniel Weck via cvs-syncmail <cvsmail@w3.org>
- Date: Mon, 01 Aug 2011 21:01:50 +0000
- To: public-css-commits@w3.org
Update of /sources/public/csswg/css3-speech In directory hutz:/tmp/cvs-serv23726 Modified Files: Overview.html Overview.src.html Log Message: attempt to normatively clarify the audio cue / TTS volume level normalization. Index: Overview.html =================================================================== RCS file: /sources/public/csswg/css3-speech/Overview.html,v retrieving revision 1.86 retrieving revision 1.87 diff -u -d -r1.86 -r1.87 --- Overview.html 1 Aug 2011 17:45:43 -0000 1.86 +++ Overview.html 1 Aug 2011 21:01:48 -0000 1.87 @@ -427,7 +427,7 @@ <div class=example> <p>This example shows how authors can tell the speech synthesizer to speak HTML headings with a voice called "paul", using "moderate" emphasis - (which is more than normal) and how to insert an audio cue (prerecorded + (which is more than normal) and how to insert an audio cue (pre-recorded audio clip located at the given URL) before the start of TTS rendering for each heading. In a stereo-capable sound system, paragraphs marked with the CSS class "heidi" are rendered on the left audio channel (and @@ -554,9 +554,9 @@ </table> <p>The ‘<a href="#voice-volume"><code - class=property>voice-volume</code></a>’ property manipulates the - amplitude of the audio waveform generated by the speech synthesiser, and - is also used when calculating the relative volume level of <a + class=property>voice-volume</code></a>’ property allows authors to + control the amplitude of the audio waveform generated by the speech + synthesiser, and is also used to adjust the relative volume level of <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. @@ -1225,9 +1225,9 @@ <td> <em>Value:</em> <td><‘<a href="#pause-before"><code - class=property>pause-before</code></a>’> || <‘<a + class=property>pause-before</code></a>’> <‘<a href="#pause-after"><code - class=property>pause-after</code></a>’> + class=property>pause-after</code></a>’>? <tr> <td> <em>Initial:</em> @@ -1495,9 +1495,9 @@ <td> <em>Value:</em> <td><‘<a href="#rest-before"><code - class=property>rest-before</code></a>’> || <‘<a + class=property>rest-before</code></a>’> <‘<a href="#rest-after"><code - class=property>rest-after</code></a>’> + class=property>rest-after</code></a>’>? <tr> <td> <em>Initial:</em> @@ -1639,8 +1639,8 @@ <p>The ‘<a href="#cue-before"><code class=property>cue-before</code></a>’ and ‘<a href="#cue-after"><code class=property>cue-after</code></a>’ - properties specify auditory icons (i.e. prerecorded audio clips) to be - played before (or after) the selected element within the <a + properties specify auditory icons (i.e. pre-recorded / pre-generated sound + clips) to be played before (or after) the selected element within the <a href="#aural-model">audio "box" model</a>. <p class=note> Note that the functionality provided by this property is @@ -1670,15 +1670,61 @@ (decibel unit). This represents a change (positive or negative) relative to the computed value of the ‘<a href="#voice-volume"><code class=property>voice-volume</code></a>’ property within the <a - href="#aural-model">aural "box" model</a> of the selected element. When - the ‘<a href="#voice-volume"><code + href="#aural-model">aural "box" model</a> of the selected element. + Decibels express the ratio of the squares of the new signal amplitude + (a1) and the current amplitude (a0), as per the following logarithmic + equation: volume(dB) = 20 log10 (a1 / a0)</p> + + <p> When the ‘<a href="#voice-volume"><code class=property>voice-volume</code></a>’ property is set to ‘<code class=property>silent</code>’, the audio cue is also set to ‘<code class=property>silent</code>’ (regardless of - the value specified for this <decibel>). Decibels express the - ratio of the squares of the new signal amplitude (a1) and the current - amplitude (a0), as per the following logarithmic equation: volume(dB) = - 20 log10 (a1 / a0)</p> + this specified <decibel> value). Otherwise (when not ‘<code + class=property>silent</code>’), ‘<a + href="#voice-volume"><code class=property>voice-volume</code></a>’ + values are always specified relatively to the volume level keywords, + which map to a user-configured scale of "preferred" loudness settings + (see the definition of ‘<a href="#voice-volume"><code + class=property>voice-volume</code></a>’). If the inherited + ‘<a href="#voice-volume"><code + class=property>voice-volume</code></a>’ value already contains a + decibel offset, the dB offset specific to the audio cue is combined + additively. + + <p> The desired effect of an audio cue set at +0dB is that the volume + level during playback of the pre-recorded / pre-generated audio signal + is effectively the same as the volume level of live (i.e. real-time) + speech synthesis rendition. In order to achieve this effect, speech + processors are capable of directly controlling the waveform amplitude of + generated text-to-speech audio, user agents must be able to adjust the + volume output of audio cues (i.e. amplify or attenuate audio signals + based on the intrinsic waveform amplitude of sound clips), and last but + not least, authors must ensure that the "normal" volume level of + pre-recorded audio cues (on average, as there may be discrete variations + due to changes in the audio stream, such as intonation, stress, etc.) + matches that of a "typical" TTS voice output (based on the ‘<a + href="#voice-family"><code class=property>voice-family</code></a>’ + intended for use), given standard listening conditions (i.e. default + system volume levels, centered equalization across the frequency + spectrum). This latter prerequisite sets a baseline that enables a user + agent to align the volume outputs of both TTS and cue audio streams + within the same "aural box model". Due to the complex relationship + between perceived audio characteristics and the processing applied to + the digitized audio signal, we will simplify the definition of "normal" + volume levels by referring to a canonical recording scenario, whereby + the attenuation is typically indicated in decibels, ranging from 0dB + (maximum audio input, near clipping threshold) to -60dB (total silence). + In this common context, a "standard" audio clip would oscillate between + these values, the loudest peak levels would be close to -3dB (to avoid + distortion), and the audible passages would have average volume levels + as high as possible (i.e. not too quiet, to avoid background noise + during amplification). This would roughly provide an audio experience + that could be seamlessly combined with text-to-speech output (i.e. there + would be no discernible difference in volume levels when switching from + pre-recorded audio to speech synthesis). Although there exists no + industry-wide standard to backup such convention, TTS engines usually + generate comparably-loud audio signals when no amplification (or + attenuation) is specified.</p> <p class=note> Note that -6.0dB is approximately half the amplitude of the audio signal, and +6.0dB is approximately twice the amplitude.</p> @@ -1906,15 +1952,16 @@ ranges may be used by the processor-dependent voice-matching algorithm). </p> - <p class=note> The interpretation of the relationship between a person's - age and a recognizable type of voice cannot realistically be defined in - a universal manner, as it effectively depends on numerous cultural and - linguistic variations. The values provided by this specification - therefore represent a simplified model that can be reasonably applied to - a great variety of speech locales, albeit at the cost of a certain - degree of approximation. Future versions of this specification may - refine the level of precision of the voice-matching algorithm, as speech - processor implementations become more standardized.</p> + <p class=note> Note that the interpretation of the relationship between a + person's age and a recognizable type of voice cannot realistically be + defined in a universal manner, as it effectively depends on numerous + criteria (cultural, linguistic, biological, etc.). The values provided + by this specification therefore represent a simplified model that can be + reasonably applied to a broad variety of speech contexts, albeit at the + cost of a certain degree of approximation. Future versions of this + specification may refine the level of precision of the voice-matching + algorithm, as speech processor implementations become more standardized. + </p> <dt> <strong><gender></strong> @@ -2218,10 +2265,11 @@ <tr> <td> <em>Computed value:</em> - <td> one of the predefined keywords if only the keyword is specified by - itself, otherwise a fixed frequency calculated by converting the - keyword value (if any) to an absolute value based on the current - voice-family and by applying the specified relative offset (if any) + <td> one of the predefined pitch keywords if only the keyword is + specified by itself, otherwise an absolute frequency calculated by + converting the keyword value (if any) to a fixed frequency based on the + current voice-family and by applying the specified relative offset (if + any) </table> <p>The ‘<a href="#voice-pitch"><code @@ -2306,14 +2354,14 @@ the conversion from a keyword to a concrete, voice-dependent frequency.</p> </dl> - <p> Computed absolute frequency values that are negative are clamped to - zero Hertz. Speech-capable user agents are likely to support a specific - range of values rather than the full range of possible calculated - numerical values for frequencies. The actual values in user agents may - therefore be clamped to implementation-dependent minimum and maximum - boundaries. For example: although the 0Hz frequency can be legitimately - calculated, it may be clamped to a more meaningful value in the context of - the speech synthesizer. + <p> Computed absolute frequencies that are negative are clamped to zero + Hertz. Speech-capable user agents are likely to support a specific range + of values rather than the full range of possible calculated numerical + values for frequencies. The actual values in user agents may therefore be + clamped to implementation-dependent minimum and maximum boundaries. For + example: although the 0Hz frequency can be legitimately calculated, it may + be clamped to a more meaningful value in the context of the speech + synthesizer. <div class=example> <p>Examples of property values:</p> @@ -2377,10 +2425,11 @@ <tr> <td> <em>Computed value:</em> - <td> one of the predefined keywords if only the keyword is specified by - itself, otherwise a fixed frequency calculated by converting the - keyword value (if any) to an absolute value based on the current - voice-family and by applying the specified relative offset (if any) + <td> one of the predefined pitch keywords if only the keyword is + specified by itself, otherwise an absolute frequency calculated by + converting the keyword value (if any) to a fixed frequency based on the + current voice-family and by applying the specified relative offset (if + any) </table> <p> The ‘<a href="#voice-range"><code @@ -2465,14 +2514,14 @@ the conversion from a keyword to a concrete, voice-dependent frequency.</p> </dl> - <p> Computed absolute frequency values that are negative are clamped to - zero Hertz. Speech-capable user agents are likely to support a specific - range of values rather than the full range of possible calculated - numerical values for frequencies. The actual values in user agents may - therefore be clamped to implementation-dependent minimum and maximum - boundaries. For example: although the 0Hz frequency can be legitimately - calculated, it may be clamped to a more meaningful value in the context of - the speech synthesizer. + <p> Computed absolute frequencies that are negative are clamped to zero + Hertz. Speech-capable user agents are likely to support a specific range + of values rather than the full range of possible calculated numerical + values for frequencies. The actual values in user agents may therefore be + clamped to implementation-dependent minimum and maximum boundaries. For + example: although the 0Hz frequency can be legitimately calculated, it may + be clamped to a more meaningful value in the context of the speech + synthesizer. <div class=example> <p>Examples of inherited values:</p> @@ -3000,8 +3049,8 @@ <tr> <th><a class=property href="#pause">pause</a> - <td><‘pause-before’> || - <‘pause-after’> + <td><‘pause-before’> + <‘pause-after’>? <td>N/A (see individual properties) @@ -3046,8 +3095,7 @@ <tr> <th><a class=property href="#rest">rest</a> - <td><‘rest-before’> || - <‘rest-after’> + <td><‘rest-before’> <‘rest-after’>? <td>N/A (see individual properties) Index: Overview.src.html =================================================================== RCS file: /sources/public/csswg/css3-speech/Overview.src.html,v retrieving revision 1.87 retrieving revision 1.88 diff -u -d -r1.87 -r1.88 --- Overview.src.html 1 Aug 2011 17:45:43 -0000 1.87 +++ Overview.src.html 1 Aug 2011 21:01:48 -0000 1.88 @@ -184,7 +184,7 @@ <div class="example"> <p>This example shows how authors can tell the speech synthesizer to speak HTML headings with a voice called "paul", using "moderate" emphasis (which is more than normal) and how to - insert an audio cue (prerecorded audio clip located at the given URL) before the start of + insert an audio cue (pre-recorded audio clip located at the given URL) before the start of TTS rendering for each heading. In a stereo-capable sound system, paragraphs marked with the CSS class "heidi" are rendered on the left audio channel (and with a female voice, etc.), whilst the class "peter" corresponds to the right channel (and to a male voice, etc.). The @@ -296,9 +296,9 @@ </tr> </tbody> </table> - <p>The 'voice-volume' property manipulates the amplitude of the audio waveform generated by the - speech synthesiser, and is also used when calculating the relative volume level of <a - href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p> + <p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform + generated by the speech synthesiser, and is also used to adjust the relative volume level of + <a href="#cue-props">audio cues</a> within the <a href="#aural-model">audio "box" model</a>. </p> <p class="note"> Note that the functionality provided by this property is related to the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code> attribute of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]]. </p> @@ -871,7 +871,7 @@ <td> <em>Value:</em> </td> - <td><'pause-before'> || <'pause-after'></td> + <td><'pause-before'> <'pause-after'>?</td> </tr> <tr> <td> @@ -1096,7 +1096,7 @@ <td> <em>Value:</em> </td> - <td><'rest-before'> || <'rest-after'></td> + <td><'rest-before'> <'rest-after'>?</td> </tr> <tr> <td> @@ -1246,9 +1246,9 @@ </tr> </tbody> </table> - <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. prerecorded audio - clips) to be played before (or after) the selected element within the <a href="#aural-model" - >audio "box" model</a>.</p> + <p>The 'cue-before' and 'cue-after' properties specify auditory icons (i.e. pre-recorded / + pre-generated sound clips) to be played before (or after) the selected element within the <a + href="#aural-model">audio "box" model</a>.</p> <p class="note"> Note that the functionality provided by this property is related to the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_audio"><code>audio</code> element</a> from the SSML markup language [[!SSML]]. </p> @@ -1274,11 +1274,41 @@ <p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This represents a change (positive or negative) relative to the computed value of the 'voice-volume' property within the <a href="#aural-model">aural "box" model</a> of the - selected element. When the 'voice-volume' property is set to 'silent', the audio cue is - also set to 'silent' (regardless of the value specified for this <decibel>). - Decibels express the ratio of the squares of the new signal amplitude (a1) and the current - amplitude (a0), as per the following logarithmic equation: volume(dB) = 20 log10 (a1 / - a0)</p> + selected element. Decibels express the ratio of the squares of the new signal amplitude + (a1) and the current amplitude (a0), as per the following logarithmic equation: volume(dB) + = 20 log10 (a1 / a0) </p> + <p> When the 'voice-volume' property is set to 'silent', the audio cue is also set to + 'silent' (regardless of this specified <decibel> value). Otherwise (when not + 'silent'), 'voice-volume' values are always specified relatively to the volume level + keywords, which map to a user-configured scale of "preferred" loudness settings (see the + definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a + decibel offset, the dB offset specific to the audio cue is combined additively. </p><p> + The desired effect of an audio cue set at +0dB is that the volume level during playback of + the pre-recorded / pre-generated audio signal is effectively the same as the volume level + of live (i.e. real-time) speech synthesis rendition. In order to achieve this effect, + speech processors are capable of directly controlling the waveform amplitude of generated + text-to-speech audio, user agents must be able to adjust the volume output of audio cues + (i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of + sound clips), and last but not least, authors must ensure that the "normal" volume level + of pre-recorded audio cues (on average, as there may be discrete variations due to changes + in the audio stream, such as intonation, stress, etc.) matches that of a "typical" TTS + voice output (based on the 'voice-family' intended for use), given standard listening + conditions (i.e. default system volume levels, centered equalization across the frequency + spectrum). This latter prerequisite sets a baseline that enables a user agent to align the + volume outputs of both TTS and cue audio streams within the same "aural box model". Due to + the complex relationship between perceived audio characteristics and the processing + applied to the digitized audio signal, we will simplify the definition of "normal" volume + levels by referring to a canonical recording scenario, whereby the attenuation is + typically indicated in decibels, ranging from 0dB (maximum audio input, near clipping + threshold) to -60dB (total silence). In this common context, a "standard" audio clip would + oscillate between these values, the loudest peak levels would be close to -3dB (to avoid + distortion), and the audible passages would have average volume levels as high as possible + (i.e. not too quiet, to avoid background noise during amplification). This would roughly + provide an audio experience that could be seamlessly combined with text-to-speech output + (i.e. there would be no discernible difference in volume levels when switching from + pre-recorded audio to speech synthesis). Although there exists no industry-wide standard + to backup such convention, TTS engines usually generate comparably-loud audio signals when + no amplification (or attenuation) is specified.</p> <p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal, and +6.0dB is approximately twice the amplitude.</p> <p class="note"> Note that there is a difference between an audio cue whose volume is set to @@ -1473,13 +1503,13 @@ match during voice selection. The mapping with [[!SSML]] ages is defined as follows: 'child' = 6 y/o, 'young' = 24 y/o, 'old' = 75 y/o (note that more flexible age ranges may be used by the processor-dependent voice-matching algorithm). </p> - <p class="note"> The interpretation of the relationship between a person's age and a - recognizable type of voice cannot realistically be defined in a universal manner, as it - effectively depends on numerous cultural and linguistic variations. The values provided by - this specification therefore represent a simplified model that can be reasonably applied - to a great variety of speech locales, albeit at the cost of a certain degree of - approximation. Future versions of this specification may refine the level of precision of - the voice-matching algorithm, as speech processor implementations become more + <p class="note"> Note that the interpretation of the relationship between a person's age and + a recognizable type of voice cannot realistically be defined in a universal manner, as it + effectively depends on numerous criteria (cultural, linguistic, biological, etc.). The + values provided by this specification therefore represent a simplified model that can be + reasonably applied to a broad variety of speech contexts, albeit at the cost of a certain + degree of approximation. Future versions of this specification may refine the level of + precision of the voice-matching algorithm, as speech processor implementations become more standardized. </p> </dd> <dt> @@ -1752,10 +1782,10 @@ <td> <em>Computed value:</em> </td> - <td> one of the predefined keywords if only the keyword is specified by itself, otherwise - a fixed frequency calculated by converting the keyword value (if any) to an absolute - value based on the current voice-family and by applying the specified relative offset - (if any)</td> + <td> one of the predefined pitch keywords if only the keyword is specified by itself, + otherwise an absolute frequency calculated by converting the keyword value (if any) to a + fixed frequency based on the current voice-family and by applying the specified relative + offset (if any)</td> </tr> </tbody> </table> @@ -1827,12 +1857,12 @@ conversion from a keyword to a concrete, voice-dependent frequency.</p> </dd> </dl> - <p> Computed absolute frequency values that are negative are clamped to zero Hertz. - Speech-capable user agents are likely to support a specific range of values rather than the - full range of possible calculated numerical values for frequencies. The actual values in user - agents may therefore be clamped to implementation-dependent minimum and maximum boundaries. - For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a - more meaningful value in the context of the speech synthesizer. </p> + <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable + user agents are likely to support a specific range of values rather than the full range of + possible calculated numerical values for frequencies. The actual values in user agents may + therefore be clamped to implementation-dependent minimum and maximum boundaries. For example: + although the 0Hz frequency can be legitimately calculated, it may be clamped to a more + meaningful value in the context of the speech synthesizer. </p> <div class="example"> <p>Examples of property values:</p> <pre> @@ -1897,10 +1927,10 @@ <td> <em>Computed value:</em> </td> - <td> one of the predefined keywords if only the keyword is specified by itself, otherwise - a fixed frequency calculated by converting the keyword value (if any) to an absolute - value based on the current voice-family and by applying the specified relative offset - (if any)</td> + <td> one of the predefined pitch keywords if only the keyword is specified by itself, + otherwise an absolute frequency calculated by converting the keyword value (if any) to a + fixed frequency based on the current voice-family and by applying the specified relative + offset (if any)</td> </tr> </tbody> </table> @@ -1973,12 +2003,12 @@ conversion from a keyword to a concrete, voice-dependent frequency.</p> </dd> </dl> - <p> Computed absolute frequency values that are negative are clamped to zero Hertz. - Speech-capable user agents are likely to support a specific range of values rather than the - full range of possible calculated numerical values for frequencies. The actual values in user - agents may therefore be clamped to implementation-dependent minimum and maximum boundaries. - For example: although the 0Hz frequency can be legitimately calculated, it may be clamped to a - more meaningful value in the context of the speech synthesizer. </p> + <p> Computed absolute frequencies that are negative are clamped to zero Hertz. Speech-capable + user agents are likely to support a specific range of values rather than the full range of + possible calculated numerical values for frequencies. The actual values in user agents may + therefore be clamped to implementation-dependent minimum and maximum boundaries. For example: + although the 0Hz frequency can be legitimately calculated, it may be clamped to a more + meaningful value in the context of the speech synthesizer. </p> <div class="example"> <p>Examples of inherited values:</p> <pre>
Received on Monday, 1 August 2011 21:01:52 UTC