Re: [css3-speech] cue volume

On 2/08/2011 6:31 PM, Daniel Weck wrote:
> On 2 Aug 2011, at 09:44, Alan Gresley wrote:
>> When recording, you must adjust the input level so sound with a
>> largest amplitude does not get distorted. To set the best input
>> level, you must sample the range of amplitude of different sounds
>> (ei a double base or trumpet compared to a triangle). This is what
>> happens at a concert where they do a sound check.
>> Playing back something is OK but an author can not know for sure
>> how the sound will be replayed. One user could have there computer
>> sound powered by a 500 watt external amplifier (stereo / surround
>> system) where other users may be using a PC amplifier of 20 watts
>> or headphones.
>> Another variable that is more dangerous is the setting of the
>> volume. A user may go from listening to a YT video to listening to
>> some music on a CD or DVD and adjust the volume at a desirable
>> level. The spec would want to have UAs not deafen someone (or cause
>> hearing damage) due to this unknown variable.
>> What is needed is something that plays sound at ever increasing
>> levels until a level is reach that is desirable. This would have to
>> be done over different octaves.
> I agree with everything you say, but I am unsure about how this
> translates into normative requirements in CSS Speech (particularly,
> user-agent conformance requirements). Any suggestion? Thanks! Daniel

I do believe it is already in the spec.

   | The desired effect of an audio cue set at +0dB is that
   | the volume level during playback of the pre-recorded /
   | pre-generated audio signal is effectively the same as
   | the loudness of live (i.e. real-time) speech synthesis
   | rendition. In order to achieve this effect, speech
   | processors are capable of directly controlling the waveform
   | amplitude of generated text-to-speech audio, user agents
   | must be able to adjust the volume output of audio cues
   | (i.e. amplify or attenuate audio signals based on the
   | intrinsic waveform amplitude of digitized sound clips),

I would believe that these digitized sound clips cover the full spectrum.

   | and last but not least, authors must ensure that the
   | "normal" volume level of pre-recorded audio cues (on
   | average, as there may be discrete loudness variations
   | due to changes in the audio stream, such as intonation,
   | stress, etc.) matches that of a "typical" TTS voice
   | output (based on the ‘voice-family’ intended for use),
   | given standard listening conditions (i.e. default
   | system volume levels, centered equalization across
   | the frequency spectrum).

The part above with "equalization across the frequency spectrum" is what 
I have mentioned in the other message in the 'voice-family' thread where 
I say "equalization can be done for various voice pitches in a dynamic 

   | This latter prerequisite sets a baseline that enables
   | a user agent to align the volume outputs of both
   | TTS and cue audio streams within the same "aural box
   | model".

CSS can only set this baseline (after possible equalization across the 
frequency spectrum). How a speech synthesis device uses such a baseline 
is I believe out of the scope of CSS3 speech.

Alan Gresley

Received on Tuesday, 2 August 2011 12:05:16 UTC