Re: [css3-speech] cue volume from Daniel Weck on 2011-08-02 (www-style@w3.org from August 2011)

From: Daniel Weck <daniel.weck@gmail.com>
Date: Tue, 2 Aug 2011 19:06:34 +0200
To: Alan Gresley <alan@css-class.com>
Cc: fantasai <fantasai.lists@inkedblade.net>, www-style@w3.org
Message-Id: <472DF2E6-3BA1-4414-84A5-98D535943D3A@gmail.com>
Many thanks for your input Alan. Regards, Daniel

On 2 Aug 2011, at 14:04, Alan Gresley wrote:

> On 2/08/2011 6:31 PM, Daniel Weck wrote:
>> 
>> On 2 Aug 2011, at 09:44, Alan Gresley wrote:
>>> When recording, you must adjust the input level so sound with a
>>> largest amplitude does not get distorted. To set the best input
>>> level, you must sample the range of amplitude of different sounds
>>> (ei a double base or trumpet compared to a triangle). This is what
>>> happens at a concert where they do a sound check.
>>> 
>>> Playing back something is OK but an author can not know for sure
>>> how the sound will be replayed. One user could have there computer
>>> sound powered by a 500 watt external amplifier (stereo / surround
>>> system) where other users may be using a PC amplifier of 20 watts
>>> or headphones.
>>> 
>>> Another variable that is more dangerous is the setting of the
>>> volume. A user may go from listening to a YT video to listening to
>>> some music on a CD or DVD and adjust the volume at a desirable
>>> level. The spec would want to have UAs not deafen someone (or cause
>>> hearing damage) due to this unknown variable.
>>> 
>>> What is needed is something that plays sound at ever increasing
>>> levels until a level is reach that is desirable. This would have to
>>> be done over different octaves.
>> 
>> I agree with everything you say, but I am unsure about how this
>> translates into normative requirements in CSS Speech (particularly,
>> user-agent conformance requirements). Any suggestion? Thanks! Daniel
> 
> 
> I do believe it is already in the spec.
> 
> 
>  | The desired effect of an audio cue set at +0dB is that
>  | the volume level during playback of the pre-recorded /
>  | pre-generated audio signal is effectively the same as
>  | the loudness of live (i.e. real-time) speech synthesis
>  | rendition. In order to achieve this effect, speech
>  | processors are capable of directly controlling the waveform
>  | amplitude of generated text-to-speech audio, user agents
>  | must be able to adjust the volume output of audio cues
>  | (i.e. amplify or attenuate audio signals based on the
>  | intrinsic waveform amplitude of digitized sound clips),
> 
> 
> I would believe that these digitized sound clips cover the full spectrum.
> 
> 
>  | and last but not least, authors must ensure that the
>  | "normal" volume level of pre-recorded audio cues (on
>  | average, as there may be discrete loudness variations
>  | due to changes in the audio stream, such as intonation,
>  | stress, etc.) matches that of a "typical" TTS voice
>  | output (based on the ‘voice-family’ intended for use),
>  | given standard listening conditions (i.e. default
>  | system volume levels, centered equalization across
>  | the frequency spectrum).
> 
> 
> The part above with "equalization across the frequency spectrum" is what I have mentioned in the other message in the 'voice-family' thread where I say "equalization can be done for various voice pitches in a dynamic range".
> 
> 
>  | This latter prerequisite sets a baseline that enables
>  | a user agent to align the volume outputs of both
>  | TTS and cue audio streams within the same "aural box
>  | model".
> 
> 
> CSS can only set this baseline (after possible equalization across the frequency spectrum). How a speech synthesis device uses such a baseline is I believe out of the scope of CSS3 speech.
> 
> 
> 
> -- 
> Alan Gresley
> http://css-3d.org/
> http://css-class.com/
Received on Tuesday, 2 August 2011 17:07:11 UTC