- From: Daniel Weck <daniel.weck@gmail.com>
- Date: Mon, 12 Sep 2011 23:14:55 +0100
- To: W3C style mailing list <www-style@w3.org>, "paul.bagshaw@orange-ftgroup.com> <paul.bagshaw@orange-ftgroup.com" <paul.bagshaw@orange-ftgroup.com>
- Cc: w3c-voice-wg@w3.org
Dear Paul, the 'cue' properties [1] of CSS3 Speech have in common with SSML's 'audio' element [2] the ability to play external (pre-recorded) audio clips, but the comparison ends here. The SSML feature-set is richer (e.g. data prefetch, clipping, repeat, rate), perhaps conceptually closer to HTML5's 'audio' element than to CSS Speech's 'cue' functionality. The informative note in the CSS3 Speech specification should perhaps consequently be improved, to prevent misleading the reader. CSS3 Speech provides a simple mechanism for short auditory cues that merely complement the speech-focused information stream. So for example, when an H1 heading gets its CSS voice-volume set to 'silent', the associated pre-recorded sounds (leading and/or trailing) should quite naturally become silent as well. Technically, this behaviour is dictated by the "aural box model" [3], which is designed by analogy with the visual box model (padding, border, margin). Within this conceptual "space" surrounding each selected element (in the CSS sense), sound/volume level is akin to opacity/visibility, in that a change affects the "box" as a whole. In order to deal with the possible (and likely) discrepancies between the sound levels generated by TTS engines and the waveform amplitude of encoded audio clips, the CSS3 Speech specification relies on the user-agent ability to set some values based on user preferences (principle which allows keywords such as 'soft' to be mapped to concrete, useful values in terms of the listening context [4]). Furthermore, the <decibel> field of auditory 'cues' (see [1]) describes a canonical (if somewhat empirical) method to author TTS/cues combinations that play predictably when volume variations are applied. As Alan Gresley pointed out in his reply (thank you, by the way), standardisation in the field of TTS engines has yet to happen, so the lack of harmonisation prevents us to use stricter conformance requirements. Implementations of the CSS Speech specification will therefore expose control mechanisms for users to "equalise" the volume output of TTS and non-TTS audio streams, in the same way that sound level keywords are mapped to real-world values that meet the listener's needs. I hope this clarifies the matter. Let me know if this addresses the issue you raised. Kind regards, Daniel [1] http://www.w3.org/TR/css3-speech/#cue-props [2] http://www.w3.org/TR/speech-synthesis11/#edef_audio [3] http://www.w3.org/TR/css3-speech/#aural-model [4] http://www.w3.org/TR/css3-speech/#voice-volume On 18 Aug 2011, at 10:44, <paul.bagshaw@orange-ftgroup.com> <paul.bagshaw@orange-ftgroup.com> wrote: > Bert, > > In response to your recent call for comments on the CSS Speech Module, I have made a personal review of the spec. Please note that my comments have not been seen or discussed by the Voice Brower WG, and as such may not represent the opinion of the group. > > 1. Interaction between the 'voice-volume' and 'cue' properties. > > Please note that in SSML 1.1 the attributes of the <ssml:prosody> element affect the rendering "of the contained text"; they do not have an effect on child <audio> elements. Note therefore that the 'volume' attribute of the <ssml:prosody> element and the 'soundLevel' attribute of the <ssml:audio> element are intentionally independent. This enables the perceived loudness of speech synthesised from text to be balanced with that of speech in pre-recorded audio cues. > > The CSS-Speech module states that 'voice-volume' is related to <ssml:prosody>'s 'volume' attribute, and that the 'cue' properties are related to <ssml:audio> (inferring its 'soundLevel' attribute). It also states that the <decibel> value of the 'cue' properties "represents a change (positive or negative) relative to the computed value of the ‘voice-volume’ property". > > Authors often have no control over the volume level of the source (initial waveform) of pre-recorded audio cues, and never have control over the source of speech synthesis waveforms whose loudness differs between speech engines and voices. However, the CSS-Speech module makes the impractical suggestion that authors control the volume level of audio cue waveforms in order the balance them with speech rendered from text. > > I suggest that the CSS-Speech module follows the SSML 1.1 paradigm and that the 'voice-volume' and 'cue' properties should not interact. > > With regards, > Paul Bagshaw > Co-author of SSML 1.1 and PLS 1.0. > > -----Original Message----- > From: w3c-voice-wg-request@w3.org [mailto:w3c-voice-wg-request@w3.org] On Behalf Of Bert Bos > Sent: Sunday, August 14, 2011 12:32 AM > To: w3c-wai-pf@w3.org; w3c-voice-wg@w3.org; member-xg-htmlspeech@w3.org; wai-xtech@w3.org > Cc: chairs@w3.org > Subject: Heads-up: CSS WG plans last call for css3-speech > > Hello chairs, > > The CSS WG decided to issue a last call for the CSS Speech Module. We're planning to publish next week, with a deadline for comments of 30 September, i.e., about 6 weeks. > > Please, let us know if that deadline is too soon. > > We'd especially like to hear from > > - WAI PF and/or HTML Accessibility TF > - Voice Browser WG > - HTML Speech XG > > The latest editor's draft is here: > > http://dev.w3.org/csswg/css3-speech/ > > (The content is what will be published, after reformatting for Working Draft.) > > The CSS Speech module contains properties to style the rendering of documents via a speech synthesizer: voice, volume, speed, pitch, pauses, etc. It is designed to be compatible with SSML, i.e., the rendering of the document could be in the form of an SSML stream. > > > > For the CSS WG, > > Bert > -- > Bert Bos ( W 3 C )http://www.w3.org/ > http://www.w3.org/people/bos W3C/ERCIM > bert@w3.org 2004 Rt des Lucioles / BP 93 > +33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France > >
Received on Monday, 12 September 2011 22:15:32 UTC