W3C home > Mailing lists > Public > www-style@w3.org > November 2011

Re: [css3-speech] Heads-up: CSS WG plans last call for css3-speech

From: Daniel Weck <daniel.weck@gmail.com>
Date: Wed, 23 Nov 2011 16:53:19 +0000
Cc: W3C style mailing list <www-style@w3.org>, w3c-voice-wg@w3.org
Message-Id: <C0526AFF-FF31-450B-967B-B9403D5A5ED6@gmail.com>
To: "paul.bagshaw@orange.com> <paul.bagshaw@orange.com" <paul.bagshaw@orange.com>
Hello,
could you please let us know your position regarding this issue? It seems that we are agreeing on principle, regarding a clarification of informative notes within the specification, but without your feedback I am unable to determine whether you are actually objecting against the normative prose.
Many thanks!
Kind regards, Daniel

On 16 Oct 2011, at 22:09, Daniel Weck wrote:

> Hi Paul (et al), would you please be able to take a look at my response (quoted below), as I am hoping to discuss pending issues with the CSS Working Group on this week's conference call. Many thanks! Kind regards, Dan
> 
> On 29 Sep 2011, at 10:40, Daniel Weck wrote:
> 
>> Hi Paul,
>> in your reply (quoted below), point #2 is indeed related to the issue you originally raised, entitled: "Interaction between the 'voice-volume' and 'cue' properties." Thank you for the suggestion. I take it that this particular statement is deemed inappropriate (am I right?):
>> 
>> "Although there exists no industry-wide standard to support such convention, TTS engines usually generate comparably-loud audio signals when no gain or attenuation is specified. For voice and soft music, -15dB RMS seems to be pretty standard."
>> 
>> See:
>> http://www.w3.org/TR/css3-speech/#cue-props
>> 
>> 
>> 
>> However, point #1 seems to address a different issue, namely the fact that in CSS-Speech, "voice sound level" keywords can be combined with relative decibel offsets (which is a side-effect of how keyword values get inherited and effectively resolved/computed in the CSS property model). As this is a different issue, I would prefer to file it separately. And yes, we could explicitly specify a function mapping with SSML, by using nested 'prosody' elements. Could you please raise a separate issue?
>> 
>> See:
>> http://www.w3.org/TR/css3-speech/#voice-volume
>> http://www.w3.org/TR/speech-synthesis11/#edef_prosody
>> 
>> Thanks!
>> Regards, Daniel
>> 
>> 
>> On 29 Sep 2011, at 08:45, <paul.bagshaw@orange.com> <paul.bagshaw@orange.com> wrote:
>> 
>>> Hi,
>>> 
>>> Yes, you do need to improve the related informative note. To resolve this issue:
>>> 
>>> 1. you at least need to demonstrate to the reader that the value of the ssml:prosody volume attribute is equivalent to a function (please define it) of CCS-Speech properties. Pay particular attention to key-word values, since they will lead you to a messy solution. And if you really wish to claim "The feature set exposed by this specification is designed to match the model described by the Speech Synthesis Markup Language (SSML) Version 1.1", then you can go that one step further and make the function a one-to-one mapping.
>>> 
>>> 2. you should remove any pretentious illusion that speech synthesis vendors will one day conform to some futuristic sound-level standard and consequently modify all their existing voices. It's not going to happen and it is inappropriate to propose a specification based on such a hypothesis.
>>> 
>>> With regards,
>>> 
>>> -- Paul
>>> 
>>> -----Original Message-----
>>> From: Daniel Weck [mailto:daniel.weck@gmail.com] 
>>> Sent: Tuesday, September 13, 2011 12:15 AM
>>> To: W3C style mailing list; BAGSHAW Paul RD-TECH-REN
>>> Cc: w3c-voice-wg@w3.org
>>> Subject: Re: [css3-speech] Heads-up: CSS WG plans last call for css3-speech
>>> 
>>> Dear Paul,
>>> the 'cue' properties [1] of CSS3 Speech have in common with SSML's 'audio' element [2] the ability to play external (pre-recorded) audio clips, but the comparison ends here. The SSML feature-set is richer (e.g. data prefetch, clipping, repeat, rate), perhaps conceptually closer to HTML5's 'audio' element than to CSS Speech's 'cue' functionality. The informative note in the CSS3 Speech specification should perhaps consequently be improved, to prevent misleading the reader.
>>> 
>>> CSS3 Speech provides a simple mechanism for short auditory cues that merely complement the speech-focused information stream. So for example, when an H1 heading gets its CSS voice-volume set to 'silent', the associated pre-recorded sounds (leading and/or trailing) should quite naturally become silent as well. Technically, this behaviour is dictated by the "aural box model" [3], which is designed by analogy with the visual box model (padding, border, margin). Within this conceptual "space" surrounding each selected element (in the CSS sense), sound/volume level is akin to opacity/visibility, in that a change affects the "box" as a whole.
>>> 
>>> In order to deal with the possible (and likely) discrepancies between the sound levels generated by TTS engines and the waveform amplitude of encoded audio clips, the CSS3 Speech specification relies on the user-agent ability to set some values based on user preferences (principle which allows keywords such as 'soft' to be mapped to concrete, useful values in terms of the listening context [4]). Furthermore, the <decibel> field of auditory 'cues' (see [1]) describes a canonical (if somewhat empirical) method to author TTS/cues combinations that play predictably when volume variations are applied. As Alan Gresley pointed out in his reply (thank you, by the way), standardisation in the field of TTS engines has yet to happen, so the lack of harmonisation prevents us to use stricter conformance requirements. Implementations of the CSS Speech specification will therefore expose control mechanisms for users to "equalise" the volume output of TTS and non-TTS audio streams, in the same way that sound level keywords are mapped to real-world values that meet the listener's needs.
>>> 
>>> I hope this clarifies the matter.
>>> Let me know if this addresses the issue you raised.
>>> Kind regards, Daniel
>>> 
>>> [1]
>>> http://www.w3.org/TR/css3-speech/#cue-props
>>> 
>>> [2]
>>> http://www.w3.org/TR/speech-synthesis11/#edef_audio
>>> 
>>> [3]
>>> http://www.w3.org/TR/css3-speech/#aural-model
>>> 
>>> [4]
>>> http://www.w3.org/TR/css3-speech/#voice-volume
>>> 
>>> On 18 Aug 2011, at 10:44, <paul.bagshaw@orange-ftgroup.com> <paul.bagshaw@orange-ftgroup.com> wrote:
>>> 
>>>> Bert,
>>>> 
>>>> In response to your recent call for comments on the CSS Speech Module, I have made a personal review of the spec. Please note that my comments have not been seen or discussed by the Voice Brower WG, and as such may not represent the opinion of the group.
>>>> 
>>>> 1. Interaction between the 'voice-volume' and 'cue' properties.
>>>> 
>>>> Please note that in SSML 1.1 the attributes of the <ssml:prosody> element affect the rendering "of the contained text"; they do not have an effect on child <audio> elements. Note therefore that the 'volume' attribute of the <ssml:prosody> element and the 'soundLevel' attribute of the <ssml:audio> element are intentionally independent. This enables the perceived loudness of speech synthesised from text to be balanced with that of speech in pre-recorded audio cues.
>>>> 
>>>> The CSS-Speech module states that 'voice-volume' is related to <ssml:prosody>'s 'volume' attribute, and that the 'cue' properties are related to <ssml:audio> (inferring its 'soundLevel' attribute). It also states that the <decibel> value of the 'cue' properties "represents a change (positive or negative) relative to the computed value of the ‘voice-volume’ property".
>>>> 
>>>> Authors often have no control over the volume level of the source (initial waveform) of pre-recorded audio cues, and never have control over the source of speech synthesis waveforms whose loudness differs between speech engines and voices. However, the CSS-Speech module makes the impractical suggestion that authors control the volume level of audio cue waveforms in order the balance them with speech rendered from text.
>>>> 
>>>> I suggest that the CSS-Speech module follows the SSML 1.1 paradigm and that the 'voice-volume' and 'cue' properties should not interact.
>>>> 
>>>> With regards,
>>>> Paul Bagshaw
>>>> Co-author of SSML 1.1 and PLS 1.0.
>>>> 
>>>> -----Original Message-----
>>>> From: w3c-voice-wg-request@w3.org [mailto:w3c-voice-wg-request@w3.org] On Behalf Of Bert Bos
>>>> Sent: Sunday, August 14, 2011 12:32 AM
>>>> To: w3c-wai-pf@w3.org; w3c-voice-wg@w3.org; member-xg-htmlspeech@w3.org; wai-xtech@w3.org
>>>> Cc: chairs@w3.org
>>>> Subject: Heads-up: CSS WG plans last call for css3-speech
>>>> 
>>>> Hello chairs,
>>>> 
>>>> The CSS WG decided to issue a last call for the CSS Speech Module. We're planning to publish next week, with a deadline for comments of 30 September, i.e., about 6 weeks.
>>>> 
>>>> Please, let us know if that deadline is too soon.
>>>> 
>>>> We'd especially like to hear from
>>>> 
>>>> - WAI PF and/or HTML Accessibility TF
>>>> - Voice Browser WG
>>>> - HTML Speech XG
>>>> 
>>>> The latest editor's draft is here:
>>>> 
>>>>  http://dev.w3.org/csswg/css3-speech/
>>>> 
>>>> (The content is what will be published, after reformatting for Working Draft.)
>>>> 
>>>> The CSS Speech module contains properties to style the rendering of documents via a speech synthesizer: voice, volume, speed, pitch, pauses, etc. It is designed to be compatible with SSML, i.e., the rendering of the document could be in the form of an SSML stream.
>>>> 
>>>> 
>>>> 
>>>> For the CSS WG,
>>>> 
>>>> Bert
>>>> --
>>>> Bert Bos                                ( W 3 C )http://www.w3.org/
>>>> http://www.w3.org/people/bos                               W3C/ERCIM
>>>> bert@w3.org                             2004 Rt des Lucioles / BP 93
>>>> +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France
>>>> 
>>>> 
>>> 
>> 
> 
Received on Wednesday, 23 November 2011 16:54:22 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:20:46 GMT