RE: [css3-speech] audio cue sound level, tts voice-volume (was => Heads-up: CSS WG plans last call for css3-speech)

Dan,

Thank you for your apparently in-depth analysis of the problem. Your proposed actions look perfectly good to resolve the issues.

I wish you and the CSS group all the best in advancing the status of the document.

With regards,
Paul

-----Original Message-----
From: Daniel Weck [mailto:daniel.weck@gmail.com] 
Sent: Wednesday, November 30, 2011 9:16 PM
To: BAGSHAW Paul RD-TECH-REN
Cc: W3C style mailing list; w3c-voice-wg@w3.org
Subject: Re: [css3-speech] audio cue sound level, tts voice-volume (was => Heads-up: CSS WG plans last call for css3-speech)

Hi Paul (et al),
I have not received your feedback since my last contribution to the discussion thread, two months ago. I have changed the email title, just in case your radar missed the previous "heads-up" emails.

I would like to synthesise a proposed solution, which I believe adequately addresses your concerns (as I understand them). Please let us know if this is satisfactory.

This is the offending part of the draft specification:

http://www.w3.org/TR/css3-speech/#cue-props


Proposed actions:

(1) The informative note that states a relationship between the 'cue' functionality and SSML's 'audio' element must be rephrased. The functional overlap is indeed very thin (as described in my initial email reply).

(2) Regarding the third paragraph in the <decibel> section of the 'cue' property: an editorial oversight on my part means that all of it appears to be normative, when in fact was supposed to be marked as informative. Furthermore, I agree that some prose clarification is required. I suggest a less convoluted description of the expected behaviour, along the lines of the following explanation:

As you correctly pointed out, the SSML 'soundLevel' attribute fulfils a different need than the <decibel> part of the 'cue' properties. Indeed, the latter expresses a sound level within the 'aural box model', relative to the value "inherited" from 'voice-volume' which is based on keywords and a decibel offset (treated additively). The keyword mechanism allows user-agents to expose volume calibration controls to the user, so that "preferred" loudness <-> keyword mappings can be defined based on a user's specific listening environment. The speech synthesis calibration is based on the TTS engine in use. The calibration for external audio clips is based on a reference audio asset, which is either user-selected, or more likely suggested by the user-agent, such as one picked from the document to be presented, or one that is considered "standard" in terms of average volume output (user-agents may "intelligently" pick different references depending on which speech synthesisers / voice provider is currently used). This allows content authors to expect that an unmodified audio cue (+0dB) results in being roughly as loud as the TTS output it accompanies (speech synthesis loudness authored through the 'voice-volume' property). A value of 'silent' is effectively applied to both speech synthesis and audio cues (overriding the specified decibel offset, if any), so that the entire element contained within the 'aural box' is silenced.

(3) The informative note that states a relationship between the 'voice-volume' property and SSML's 'volume' attribute must be rephrased. As described in my initial email reply, the way CSS computes values based on inherited values along the element hierarchy means that keywords and decibels are not mutually-exclusive.

http://www.w3.org/TR/css3-speech/#voice-volume


(4) I suggest that similar non-normative (and potentially misleading) statements are rephrased when necessary: "Note that the functionality provided by this property is related to the [SSML feature xxx]". Indeed, although many CSS-Speech features are directly inspired by SSML, it is clear that there are discrepancies in the respective CSS / SSML document models, resulting in limiting the 1-to-1 mapping on a per-feature basis.

Let me know if this works for you.
Regards, Dan

On 29 Sep 2011, at 10:40, Daniel Weck wrote:

> Hi Paul,
> in your reply (quoted below), point #2 is indeed related to the issue you originally raised, entitled: "Interaction between the 'voice-volume' and 'cue' properties." Thank you for the suggestion. I take it that this particular statement is deemed inappropriate (am I right?):
> 
> "Although there exists no industry-wide standard to support such convention, TTS engines usually generate comparably-loud audio signals when no gain or attenuation is specified. For voice and soft music, -15dB RMS seems to be pretty standard."
> 
> See:
> http://www.w3.org/TR/css3-speech/#cue-props

> 
> 
> 
> However, point #1 seems to address a different issue, namely the fact that in CSS-Speech, "voice sound level" keywords can be combined with relative decibel offsets (which is a side-effect of how keyword values get inherited and effectively resolved/computed in the CSS property model). As this is a different issue, I would prefer to file it separately. And yes, we could explicitly specify a function mapping with SSML, by using nested 'prosody' elements. Could you please raise a separate issue?
> 
> See:
> http://www.w3.org/TR/css3-speech/#voice-volume

> http://www.w3.org/TR/speech-synthesis11/#edef_prosody

> 
> Thanks!
> Regards, Daniel
> 
> 
> On 29 Sep 2011, at 08:45, <paul.bagshaw@orange.com> <paul.bagshaw@orange.com> wrote:
> 
>> Hi,
>> 
>> Yes, you do need to improve the related informative note. To resolve this issue:
>> 
>> 1. you at least need to demonstrate to the reader that the value of the ssml:prosody volume attribute is equivalent to a function (please define it) of CCS-Speech properties. Pay particular attention to key-word values, since they will lead you to a messy solution. And if you really wish to claim "The feature set exposed by this specification is designed to match the model described by the Speech Synthesis Markup Language (SSML) Version 1.1", then you can go that one step further and make the function a one-to-one mapping.
>> 
>> 2. you should remove any pretentious illusion that speech synthesis vendors will one day conform to some futuristic sound-level standard and consequently modify all their existing voices. It's not going to happen and it is inappropriate to propose a specification based on such a hypothesis.
>> 
>> With regards,
>> 
>> -- Paul
>> 
>> -----Original Message-----
>> From: Daniel Weck [mailto:daniel.weck@gmail.com] 
>> Sent: Tuesday, September 13, 2011 12:15 AM
>> To: W3C style mailing list; BAGSHAW Paul RD-TECH-REN
>> Cc: w3c-voice-wg@w3.org
>> Subject: Re: [css3-speech] Heads-up: CSS WG plans last call for css3-speech
>> 
>> Dear Paul,
>> the 'cue' properties [1] of CSS3 Speech have in common with SSML's 'audio' element [2] the ability to play external (pre-recorded) audio clips, but the comparison ends here. The SSML feature-set is richer (e.g. data prefetch, clipping, repeat, rate), perhaps conceptually closer to HTML5's 'audio' element than to CSS Speech's 'cue' functionality. The informative note in the CSS3 Speech specification should perhaps consequently be improved, to prevent misleading the reader.
>> 
>> CSS3 Speech provides a simple mechanism for short auditory cues that merely complement the speech-focused information stream. So for example, when an H1 heading gets its CSS voice-volume set to 'silent', the associated pre-recorded sounds (leading and/or trailing) should quite naturally become silent as well. Technically, this behaviour is dictated by the "aural box model" [3], which is designed by analogy with the visual box model (padding, border, margin). Within this conceptual "space" surrounding each selected element (in the CSS sense), sound/volume level is akin to opacity/visibility, in that a change affects the "box" as a whole.
>> 
>> In order to deal with the possible (and likely) discrepancies between the sound levels generated by TTS engines and the waveform amplitude of encoded audio clips, the CSS3 Speech specification relies on the user-agent ability to set some values based on user preferences (principle which allows keywords such as 'soft' to be mapped to concrete, useful values in terms of the listening context [4]). Furthermore, the <decibel> field of auditory 'cues' (see [1]) describes a canonical (if somewhat empirical) method to author TTS/cues combinations that play predictably when volume variations are applied. As Alan Gresley pointed out in his reply (thank you, by the way), standardisation in the field of TTS engines has yet to happen, so the lack of harmonisation prevents us to use stricter conformance requirements. Implementations of the CSS Speech specification will therefore expose control mechanisms for users to "equalise" the volume output of TTS and non-TTS audio streams, in the same way that sound level keywords are mapped to real-world values that meet the listener's needs.
>> 
>> I hope this clarifies the matter.
>> Let me know if this addresses the issue you raised.
>> Kind regards, Daniel
>> 
>> [1]
>> http://www.w3.org/TR/css3-speech/#cue-props

>> 
>> [2]
>> http://www.w3.org/TR/speech-synthesis11/#edef_audio

>> 
>> [3]
>> http://www.w3.org/TR/css3-speech/#aural-model

>> 
>> [4]
>> http://www.w3.org/TR/css3-speech/#voice-volume

>> 
>> On 18 Aug 2011, at 10:44, <paul.bagshaw@orange-ftgroup.com> <paul.bagshaw@orange-ftgroup.com> wrote:
>> 
>>> Bert,
>>> 
>>> In response to your recent call for comments on the CSS Speech Module, I have made a personal review of the spec. Please note that my comments have not been seen or discussed by the Voice Brower WG, and as such may not represent the opinion of the group.
>>> 
>>> 1. Interaction between the 'voice-volume' and 'cue' properties.
>>> 
>>> Please note that in SSML 1.1 the attributes of the <ssml:prosody> element affect the rendering "of the contained text"; they do not have an effect on child <audio> elements. Note therefore that the 'volume' attribute of the <ssml:prosody> element and the 'soundLevel' attribute of the <ssml:audio> element are intentionally independent. This enables the perceived loudness of speech synthesised from text to be balanced with that of speech in pre-recorded audio cues.
>>> 
>>> The CSS-Speech module states that 'voice-volume' is related to <ssml:prosody>'s 'volume' attribute, and that the 'cue' properties are related to <ssml:audio> (inferring its 'soundLevel' attribute). It also states that the <decibel> value of the 'cue' properties "represents a change (positive or negative) relative to the computed value of the ‘voice-volume’ property".
>>> 
>>> Authors often have no control over the volume level of the source (initial waveform) of pre-recorded audio cues, and never have control over the source of speech synthesis waveforms whose loudness differs between speech engines and voices. However, the CSS-Speech module makes the impractical suggestion that authors control the volume level of audio cue waveforms in order the balance them with speech rendered from text.
>>> 
>>> I suggest that the CSS-Speech module follows the SSML 1.1 paradigm and that the 'voice-volume' and 'cue' properties should not interact.
>>> 
>>> With regards,
>>> Paul Bagshaw
>>> Co-author of SSML 1.1 and PLS 1.0.
>>> 
>>> -----Original Message-----
>>> From: w3c-voice-wg-request@w3.org [mailto:w3c-voice-wg-request@w3.org] On Behalf Of Bert Bos
>>> Sent: Sunday, August 14, 2011 12:32 AM
>>> To: w3c-wai-pf@w3.org; w3c-voice-wg@w3.org; member-xg-htmlspeech@w3.org; wai-xtech@w3.org
>>> Cc: chairs@w3.org
>>> Subject: Heads-up: CSS WG plans last call for css3-speech
>>> 
>>> Hello chairs,
>>> 
>>> The CSS WG decided to issue a last call for the CSS Speech Module. We're planning to publish next week, with a deadline for comments of 30 September, i.e., about 6 weeks.
>>> 
>>> Please, let us know if that deadline is too soon.
>>> 
>>> We'd especially like to hear from
>>> 
>>> - WAI PF and/or HTML Accessibility TF
>>> - Voice Browser WG
>>> - HTML Speech XG
>>> 
>>> The latest editor's draft is here:
>>> 
>>>   http://dev.w3.org/csswg/css3-speech/

>>> 
>>> (The content is what will be published, after reformatting for Working Draft.)
>>> 
>>> The CSS Speech module contains properties to style the rendering of documents via a speech synthesizer: voice, volume, speed, pitch, pauses, etc. It is designed to be compatible with SSML, i.e., the rendering of the document could be in the form of an SSML stream.
>>> 
>>> 
>>> 
>>> For the CSS WG,
>>> 
>>> Bert
>>> --
>>> Bert Bos                                ( W 3 C )http://www.w3.org/

>>> http://www.w3.org/people/bos                               W3C/ERCIM
>>> bert@w3.org                             2004 Rt des Lucioles / BP 93
>>> +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France
>>> 
>>> 
>> 
> 

Received on Friday, 2 December 2011 12:51:07 UTC