- From: Alan Gresley <alan@css-class.com>
- Date: Tue, 02 Aug 2011 22:04:41 +1000
- To: Daniel Weck <daniel.weck@gmail.com>
- CC: fantasai <fantasai.lists@inkedblade.net>, www-style@w3.org
On 2/08/2011 6:31 PM, Daniel Weck wrote: > > On 2 Aug 2011, at 09:44, Alan Gresley wrote: >> When recording, you must adjust the input level so sound with a >> largest amplitude does not get distorted. To set the best input >> level, you must sample the range of amplitude of different sounds >> (ei a double base or trumpet compared to a triangle). This is what >> happens at a concert where they do a sound check. >> >> Playing back something is OK but an author can not know for sure >> how the sound will be replayed. One user could have there computer >> sound powered by a 500 watt external amplifier (stereo / surround >> system) where other users may be using a PC amplifier of 20 watts >> or headphones. >> >> Another variable that is more dangerous is the setting of the >> volume. A user may go from listening to a YT video to listening to >> some music on a CD or DVD and adjust the volume at a desirable >> level. The spec would want to have UAs not deafen someone (or cause >> hearing damage) due to this unknown variable. >> >> What is needed is something that plays sound at ever increasing >> levels until a level is reach that is desirable. This would have to >> be done over different octaves. > > I agree with everything you say, but I am unsure about how this > translates into normative requirements in CSS Speech (particularly, > user-agent conformance requirements). Any suggestion? Thanks! Daniel I do believe it is already in the spec. | The desired effect of an audio cue set at +0dB is that | the volume level during playback of the pre-recorded / | pre-generated audio signal is effectively the same as | the loudness of live (i.e. real-time) speech synthesis | rendition. In order to achieve this effect, speech | processors are capable of directly controlling the waveform | amplitude of generated text-to-speech audio, user agents | must be able to adjust the volume output of audio cues | (i.e. amplify or attenuate audio signals based on the | intrinsic waveform amplitude of digitized sound clips), I would believe that these digitized sound clips cover the full spectrum. | and last but not least, authors must ensure that the | "normal" volume level of pre-recorded audio cues (on | average, as there may be discrete loudness variations | due to changes in the audio stream, such as intonation, | stress, etc.) matches that of a "typical" TTS voice | output (based on the ‘voice-family’ intended for use), | given standard listening conditions (i.e. default | system volume levels, centered equalization across | the frequency spectrum). The part above with "equalization across the frequency spectrum" is what I have mentioned in the other message in the 'voice-family' thread where I say "equalization can be done for various voice pitches in a dynamic range". | This latter prerequisite sets a baseline that enables | a user agent to align the volume outputs of both | TTS and cue audio streams within the same "aural box | model". CSS can only set this baseline (after possible equalization across the frequency spectrum). How a speech synthesis device uses such a baseline is I believe out of the scope of CSS3 speech. -- Alan Gresley http://css-3d.org/ http://css-class.com/
Received on Tuesday, 2 August 2011 12:05:16 UTC