- From: Daniel Weck <daniel.weck@gmail.com>
- Date: Tue, 2 Aug 2011 19:06:34 +0200
- To: Alan Gresley <alan@css-class.com>
- Cc: fantasai <fantasai.lists@inkedblade.net>, www-style@w3.org
Many thanks for your input Alan. Regards, Daniel On 2 Aug 2011, at 14:04, Alan Gresley wrote: > On 2/08/2011 6:31 PM, Daniel Weck wrote: >> >> On 2 Aug 2011, at 09:44, Alan Gresley wrote: >>> When recording, you must adjust the input level so sound with a >>> largest amplitude does not get distorted. To set the best input >>> level, you must sample the range of amplitude of different sounds >>> (ei a double base or trumpet compared to a triangle). This is what >>> happens at a concert where they do a sound check. >>> >>> Playing back something is OK but an author can not know for sure >>> how the sound will be replayed. One user could have there computer >>> sound powered by a 500 watt external amplifier (stereo / surround >>> system) where other users may be using a PC amplifier of 20 watts >>> or headphones. >>> >>> Another variable that is more dangerous is the setting of the >>> volume. A user may go from listening to a YT video to listening to >>> some music on a CD or DVD and adjust the volume at a desirable >>> level. The spec would want to have UAs not deafen someone (or cause >>> hearing damage) due to this unknown variable. >>> >>> What is needed is something that plays sound at ever increasing >>> levels until a level is reach that is desirable. This would have to >>> be done over different octaves. >> >> I agree with everything you say, but I am unsure about how this >> translates into normative requirements in CSS Speech (particularly, >> user-agent conformance requirements). Any suggestion? Thanks! Daniel > > > I do believe it is already in the spec. > > > | The desired effect of an audio cue set at +0dB is that > | the volume level during playback of the pre-recorded / > | pre-generated audio signal is effectively the same as > | the loudness of live (i.e. real-time) speech synthesis > | rendition. In order to achieve this effect, speech > | processors are capable of directly controlling the waveform > | amplitude of generated text-to-speech audio, user agents > | must be able to adjust the volume output of audio cues > | (i.e. amplify or attenuate audio signals based on the > | intrinsic waveform amplitude of digitized sound clips), > > > I would believe that these digitized sound clips cover the full spectrum. > > > | and last but not least, authors must ensure that the > | "normal" volume level of pre-recorded audio cues (on > | average, as there may be discrete loudness variations > | due to changes in the audio stream, such as intonation, > | stress, etc.) matches that of a "typical" TTS voice > | output (based on the ‘voice-family’ intended for use), > | given standard listening conditions (i.e. default > | system volume levels, centered equalization across > | the frequency spectrum). > > > The part above with "equalization across the frequency spectrum" is what I have mentioned in the other message in the 'voice-family' thread where I say "equalization can be done for various voice pitches in a dynamic range". > > > | This latter prerequisite sets a baseline that enables > | a user agent to align the volume outputs of both > | TTS and cue audio streams within the same "aural box > | model". > > > CSS can only set this baseline (after possible equalization across the frequency spectrum). How a speech synthesis device uses such a baseline is I believe out of the scope of CSS3 speech. > > > > -- > Alan Gresley > http://css-3d.org/ > http://css-class.com/
Received on Tuesday, 2 August 2011 17:07:11 UTC