Re: [css3-speech] Heads-up: CSS WG plans last call for css3-speech from Alan Gresley on 2011-09-29 (www-style@w3.org from September 2011)

From: Alan Gresley <alan@css-class.com>
Date: Fri, 30 Sep 2011 00:56:09 +1000
To: paul.bagshaw@orange.com
CC: daniel.weck@gmail.com, www-style@w3.org, w3c-voice-wg@w3.org
Message-ID: <4E848709.8070108@css-class.com>
On 29/09/2011 5:45 PM, paul.bagshaw@orange.com wrote:
> Hi,

Hello Paul.

> Yes, you do need to improve the related informative note. To resolve
> this issue:
>
> 1. you at least need to demonstrate to the reader that the value of
> the ssml:prosody volume attribute is equivalent to a function (please
> define it) of CCS-Speech properties.

Can you clarify what you mean by the word function? From CSS3-speech 
[1], I see this.

   | The effective volume variation between ‘x-soft’
   | and ‘x-loud’ represents the dynamic range (in terms
   | of loudness) of the speech output.

> Pay particular attention to
> key-word values, since they will lead you to a messy solution.

What we are aiming for is a styling to a baseline (the baseline could be 
the full equalization of the volume output of TTS and non-TTS audio 
streams across the full dynamic range).

> And if
> you really wish to claim "The feature set exposed by this
> specification is designed to match the model described by the Speech
> Synthesis Markup Language (SSML) Version 1.1", then you can go that
> one step further and make the function a one-to-one mapping.

How is it possible to demonstrate that a value of the ssml:prosody 
volume attribute is equivalent to a function of CCS-Speech properties 
and then in turn, request one-to-one mapping?

> 2. you should remove any pretentious illusion that speech synthesis
> vendors will one day conform to some futuristic sound-level standard
> and consequently modify all their existing voices.

TTS vendors would have to modify all their existing voices (or better, 
synthesized sound) if they want to claim support for some future 
standard (see also the part concerning WAI below).

Just giving one example for amplification. Since amplifiers will come 
out that show increases in gain and as users of voice-synthesis 
technology will most likely update their computers, software and 
amplifiers, the greater there will be a difference between absolute 
values in decibels between old technology and newer technology. We don't 
want a situation where a user cannot hear a voice if the pitch is of a 
higher pitch (could result from a CSS parsing method where things can be 
dropped for backwards compatibility) and then have a user selects 
something to change the voice to a lower pitch with no consideration of 
their own dynamic range in hearing. An amplifiers with large gain could 
produce a level in decibels that can be damaging to the users hearing.

> It's not going to
> happen and it is inappropriate to propose a specification based on
> such a hypothesis.
>
> With regards,
>
> -- Paul

Let me point you to both the 'UAAG 2.0' and 'Implementing UAAG 2.0 
Working Drafts by the WAI.

http://www.w3.org/TR/UAAG20/#abstract

   | A user agent that conforms to these guidelines will
   | promote accessibility through its own user interface
   | and through other internal facilities, including
   | its ability to communicate with other technologies
   | (especially assistive technologies). Furthermore,
   | all users, not just users with disabilities,
   | should find conforming user agents to be more
   | usable. In addition to helping developers of
   | browsers and media players, this document will
   | also benefit developers of assistive technologies
   | because it explains what types of information
   | and control an assistive technology may expect
   | from a conforming user agent.

Wherein there is Guideline 1.6 - Provide synthesized speech configuration:

http://www.w3.org/TR/UAAG20/#gl-speech-config

   | If synthesized speech is produced, the user can
   | specify the following: (Level A)
   | - speech volume (independently of other sources of
   | audio).


Note the part about, "independently of other sources of audio." This is 
not done by a wav [2] [3]. A TTS should communicate with a computer that 
in turn communicates with a sound device to arrive at a desired baseline 
(I do note that SSML 1.1 only has baseline in reference to pitch but not 
volume).

Furthermore Guideline 1.6 has:

http://www.w3.org/TR/IMPLEMENTING-UAAG20/#gl-speech-config

   | The objective of these success criteria is to allow
   | the user to customize the specified speech
   | characteristics to settings that allow the user to
   | perceive and understand the audio information.

   | Users may need to increase the volume to a level
   | within their range of perception for example.

The increasing of volume is the safest way to discover a suitable baseline.


1. http://www.w3.org/TR/css3-speech/#voice-volume
2. http://www.w3.org/TR/speech-synthesis11/#AppA
3. http://www.w3.org/TR/IMPLEMENTING-UAAG20/#def-audio


-- 
Alan Gresley
http://css-3d.org/
http://css-class.com/
Received on Thursday, 29 September 2011 14:56:44 UTC