Re: [css3-speech] cue volume from Alan Gresley on 2011-08-02 (www-style@w3.org from August 2011)

From: Alan Gresley <alan@css-class.com>
Date: Tue, 02 Aug 2011 22:04:41 +1000
To: Daniel Weck <daniel.weck@gmail.com>
CC: fantasai <fantasai.lists@inkedblade.net>, www-style@w3.org
Message-ID: <4E37E7D9.2090503@css-class.com>
On 2/08/2011 6:31 PM, Daniel Weck wrote:
>
> On 2 Aug 2011, at 09:44, Alan Gresley wrote:
>> When recording, you must adjust the input level so sound with a
>> largest amplitude does not get distorted. To set the best input
>> level, you must sample the range of amplitude of different sounds
>> (ei a double base or trumpet compared to a triangle). This is what
>> happens at a concert where they do a sound check.
>>
>> Playing back something is OK but an author can not know for sure
>> how the sound will be replayed. One user could have there computer
>> sound powered by a 500 watt external amplifier (stereo / surround
>> system) where other users may be using a PC amplifier of 20 watts
>> or headphones.
>>
>> Another variable that is more dangerous is the setting of the
>> volume. A user may go from listening to a YT video to listening to
>> some music on a CD or DVD and adjust the volume at a desirable
>> level. The spec would want to have UAs not deafen someone (or cause
>> hearing damage) due to this unknown variable.
>>
>> What is needed is something that plays sound at ever increasing
>> levels until a level is reach that is desirable. This would have to
>> be done over different octaves.
>
> I agree with everything you say, but I am unsure about how this
> translates into normative requirements in CSS Speech (particularly,
> user-agent conformance requirements). Any suggestion? Thanks! Daniel


I do believe it is already in the spec.


   | The desired effect of an audio cue set at +0dB is that
   | the volume level during playback of the pre-recorded /
   | pre-generated audio signal is effectively the same as
   | the loudness of live (i.e. real-time) speech synthesis
   | rendition. In order to achieve this effect, speech
   | processors are capable of directly controlling the waveform
   | amplitude of generated text-to-speech audio, user agents
   | must be able to adjust the volume output of audio cues
   | (i.e. amplify or attenuate audio signals based on the
   | intrinsic waveform amplitude of digitized sound clips),


I would believe that these digitized sound clips cover the full spectrum.


   | and last but not least, authors must ensure that the
   | "normal" volume level of pre-recorded audio cues (on
   | average, as there may be discrete loudness variations
   | due to changes in the audio stream, such as intonation,
   | stress, etc.) matches that of a "typical" TTS voice
   | output (based on the ‘voice-family’ intended for use),
   | given standard listening conditions (i.e. default
   | system volume levels, centered equalization across
   | the frequency spectrum).


The part above with "equalization across the frequency spectrum" is what 
I have mentioned in the other message in the 'voice-family' thread where 
I say "equalization can be done for various voice pitches in a dynamic 
range".


   | This latter prerequisite sets a baseline that enables
   | a user agent to align the volume outputs of both
   | TTS and cue audio streams within the same "aural box
   | model".


CSS can only set this baseline (after possible equalization across the 
frequency spectrum). How a speech synthesis device uses such a baseline 
is I believe out of the scope of CSS3 speech.



-- 
Alan Gresley
http://css-3d.org/
http://css-class.com/
Received on Tuesday, 2 August 2011 12:05:16 UTC