CSS3 Speech Module : Working Draft 27 July 2004 Comments

Hi,

I've reviewed the 2004 Draft of the CSS3 Speech Module. I previously 
submitted comments on the 2003 draft here:
http://lists.w3.org/Archives/Public/www-style/2003Jun/0137.html

These comments never received any formal response from the working 
group, but I see the 2004 draft has addressed around 50% of my issues 
with the previous draft, so I'm pleased with the direction being taken.

Here are my comments on the 2004 Draft, split into comments on style or 
grammar and comments on the substance of the spec.

Grammatical & Style Comments
----------------------------

1.
Section: Abstract
Problem: typo

The sentence "CSS define aural properties that ..." should be
The sentence "CSS defines aural properties that ..."

2.
Section: Definition of property 'speak'
Problem: English usage

In the definitions of 'literal-punctuation' and 'no-punctuation' the 
sentence

"Similar as 'normal' value but..." should be
"Similar to 'normal' value but..."

3.
Section: Definition of property 'speak'
Problem: English usage

The sentence:
"Speech synthesizers are knowledgeable about what is a number and what 
isn't."
"Speech synthesizers are knowledgeable about what is and is not a 
number."

Should not use 'isn't' in formal written English.

4.
Section: Definition of the property 'voice-duration'

This sentence is poor:
  "This allows authors to specify how long they want a given element to 
be rendered."

("how long they want" seems like it is plural purely to avoid the 
he/she problem, and the phrasing is basically slang)

Perhaps something like
"Allows authors to specify how long it should take to render the given 
element."

Substantive Comments
--------------------

1.
Section: Definition of the property 'speak'

This draft of the spec - 
http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/ - defined two 
additional properties, 'date' and 'words'. The later is probably only 
marginally useful (in theory it was supposed to force 'ASCII' to be 
rendered as "as-key" rather than "a s c i i") but I'm really surprised 
at the removal of "date" which would seem to be really useful.

2.
Section: Definition of the properties 'cue-before' and 'cue-after'

None of the current examples make it clear that this is legal:

cue-before: url('bell.aiff') loud;

While grammar shows this is possible, an example would help the less 
technical reader understand how this property works.

(I was going to make a comment about "cue-during" and mixing, but the 
recent discussion of a CSS audio module on www-style indicates this 
possibility is already being considered.)

3.
Section: Definitions of the properties, 'mark-before' and 'mark-after'

in both cases the definition reads:

Value: <string>

but it should be

Value: <string> | attr(attribute-name)

To match the description below it.

4.
Section: Definition of the property 'voice-family'

4.1. CSS3 is still using 'child', 'young' and 'old' but SSML has 
shifted to requiring age to be expressed in years.
(see http://www.w3.org/TR/speech-synthesis/#S3.2.1)

One suspects the reason SSML did this was to avoid the political 
correctness issue of having to define an age which is "old". 'child', 
'young' and 'old' are more useful than absolute numbers. Should CSS 
harmonize with SSML and only use numbers, or at least allow age to be 
defined in numbers in addition to child/young/old for compatability?

4.2. In addition to 'male' and 'female' the <generic-voice> families 
should include 'natural' and 'artificial' as many synthesizers have a 
robot-like voice that is neither male nor female. Note that SSML 
defines 'neutral' so as a minimum this should be added for 
compatibility.

4.3. As per my 2003 comments, although I like the fact there is a 
facility for selecting variations, using <number> for specifying then 
is not a satisfactory solution.

* firstly using absolute numbers is not very portable. If I write

body { voice-family: male 1 }
.foo { voice-family: male 2 }
.bar { voice-family: male 3 }

Then what happens if the synthesizer only has two male voices? When 
something of class 'bar' is rendered, does the synthesizer round-robin 
back to "male 1" or does it stay with the current voice because it 
doesn't have enough male voices? At the very least the specification 
should specify what "best effort" strategy the synthesizer should 
apply. This allows document authors to at least predict whether the 
voice will change or not (assuming the synthesizer has at least 2 
voices).

* The definition for <number> says: "e.g. the second or next male 
voice", but no way to indicate "next" and "previous" is given. Clearly 
'1', '2', '3' work for specifying variants absolutely, put how do ask 
for the next voice? Perhaps something like this

.foo { voice-family: male +1  //select the next male voice, relative to 
the inherited voice}

However this would be easier:

Value: 	[[<specific-voice> | [<relative-voice-specifier>] [<age>] 
<generic-voice>],]*
		[<specific-voice> | [<relative-voice-specifier>] [<age>] 
<generic-voice>] | inherit

<relative-voice-specifier>
	Possible values are 'previous' and 'next'

.foo { voice-family: next old male }

This would require vendors order their voices, but it would allow 
document authors to reliably control whether the voice changes.

eg, Assume a synthesizer has 3 male voices "Fred", "Bruce" and "Ralph"

<ul>
   <li>one</li>
   <li><ul><li>foo</li>
		 <li>bar</li>
       </ul>
   </li>
</ul>

ul { voice-family: male; }    --> Fred
ul ul { voice-family: next male; } --> Bruce
ul ul ul { voice-family: previous male; } --> Fred

* Along similar lines, another value would be useful:

<relative-voice-specifier>
	Possible values are 'previous', 'next' and 'different'

ul { voice-family: young female; }
//slightly silly example, you probably wouldn't change the voice for 
'em'
em { voice-family: different female; }

'different' is similar to 'previous' and 'next' but rather than cycling 
through the voices in a set order it asks the synthesizer to change the 
voice. The new voice should be as close to the inherited value as 
possible, within the constraints of the available voices. Thus the 
above 'em' declaration should first try to use a different 'young 
female' voice, then a different 'female' voice, then a 'neuter' and 
finally a 'male' voice if the synthesizer only has one female voice. 
Naturally all of these voices must speak the same language first and 
foremost!

Overall I believe something like 'previous', 'next' and 'different' 
would be more useful, more intuitive and more portable than absolute 
integer indices.

5.
Section: Definition of 'voice-pitch'

Regarding semitone changes: I think CSS should be harmonized with SSML 
unless adding the new unit to CSS is undesirable for some reason?

Thanks for your time. Be interested in hearing any feedback.

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

         (see you later space cowboy ...)

Received on Tuesday, 10 August 2004 04:17:18 UTC