CSS3 Speech Module Comments (14th May Draft)

Hi

What follows are some personal comments on the first published draft of 
the CSS 3 speech module (May 14th Draft).
Disclaimer: these comments are my own personal opinion. This is worth 
stating as I'm a member of the jsr-113 Expert Group for the development 
of version 2 of the Java Speech API. I'm not speaking for that group in 
any way in this message.

I'll go through the document in order, by section, introducing my own 
numbering for convenience:

1. Introduction

The example in the introduction looks like it has a couple of typos 
(what is voice-: ?):

p.heidi { voice-: left; voice-family: female }
P.peter { voice-: right; voice-family: male }

2. voice-balance

The prose for this section mentions two properties, 'leftwards' and 
'rightwards' that are not included in the list of valid values given at 
the beginning of the voice-balance definition. Should be consistent.

In general I approve of this simplification. Its probably actually 
implementable, whereas the previous azimuth model wasn't really 
practical.

3. speak

Obviously perfect correspondence with the SSML 'say-as' element is 
difficult. It does seems there are some gaps though:

CSS has 'spell-out' where SSML uses 'letters'
CSS lacks 'date' as an option
CSS lacks the ability to mark a number as a telephone number (ah, I see 
now this is covered by 14. interpret-as below)
CSS lacks the 'words' option... used to force an acronym like "ASCII" 
to be pronounced as a word rather than as letters.

In he literal-punctuation and no-punctuation sections, the prose 
description reads:

"Similar as 'normal' value but ...".

Change to "Similar to 'normal' value but..."

Similarly below change:

"Speech synthesisers are knowledgeable about what is a number and what 
isn't"

to "... what is and is not a number".

Also: was the British spelling of synthesizer intentional? (not that I 
care, being British myself, but I thought W3C standard was American 
English).

First editor's note: agree on not trying to deal with cardinal and 
ordinal for now. I recall considerable feedback on this part of SSML in 
the last review cycle. Wait to see what the SSML team decide to do 
before trying to merge into CSS

Second editor's note reads in part: "The value 'code' has been replaced 
by 'all' ...

However the speak property does not define a value called 'all'. Did 
you mean to say 'literal-punctuation' or something else?

4. pause-before, pause-after, pause

I'm surprised to see CSS doesn't include the textual values allowed by 
the 'break' element in SSML (x-small, small, medium, large, x-large). 
Be wary though: since the last SSML draft the working group agreed to 
redefine some of the 'break' values based on feedback. If you 
incorporate these into CSS, be sure to use the revised list as 
mentioned on www-voice.

It couldn't hurt to include examples of valid seconds and milliseconds 
values in the <time> section. Sure people can look us CSS time units, 
but it'd be easier if there were an example.

5. cue-before, cue-after, cue

There would seem to be a need for 'cue-during' indicating that a sample 
should be played in the background whilst an element is rendered. Does 
this introduce too much additional complexity?

eg,

p.noisyBar { cue-during: url("club-music.mp3"); }

I imagine 2 further facilities would be required to make this fully 
useful:

cue-during:

Value	[<uri> looped <number>]

(apologies if that definition's not right)

The <number> would be defined as for voice-volume, and would be the 
sample volume.
If 'looped' is present then the sample loops if it is too short to 
cover the element being rendered, otherwise it is played only once.

6. voice-family


<age> has been redefined in SSML to take a pure integer value 
(essentially all of the string values were for reasons of political 
correctness: no one wanted to define a numeric equivalent for 'old'). 
You could choose to be consistent with this, though obviously its then 
not entirely clear how one specifies 'child'. I guess one uses the 
number '5' and hopes the speech synthesiser is smart enough to figure 
out what you want. If you do keep the textual values 'adult' was in the 
older drafts of SSML, I think.

Taking a look at the first example:

h1 { voice-family: announcer old male };

The definition of this property is quite clever, but confusing. When I 
first saw 'announcer' I thought it was a generic voice that you'd 
forgotten to define. Looking at the syntax closely, I see one must 
define an age in order to use a generic voice (eg, voice:family: child 
male), so I can conclude 'announcer' is intended as a specific voice 
name. This doesn't seem very intuitive.

I do like the way you set it up as voice-family working just like font 
family. A simple left to right priority list seems easier though 
(though I think you'd then need a separate property for the age):

voice-family: david, announcer, male;
voice-age: old;

The above seems more self explanatory.

I think you're going in the right direction with 'generic-family' but  
can get a lot more mileage out of it than just 'male' and 'female'. I 
would love to see 'robotic' (or artificial), 'natural', 'authoritative' 
etc.

This would work nicely with the simpler form of voice-family proposed 
above:

voice-family: Victoria, Agnes, female, natural; // fallback to more and 
more general voices

I like the idea of the <number> in this definition, because I think I 
know the problem you're trying to solve.
In any case, I don't think the domain of 'positive integers' is the 
best one. The definition says:

"Indicates a preferred variant of the other voice characteristics. 
(e.g. the second or next male voice). Possible values are positive 
integers."

This doesn't help me much. I can see I could use '2' or '3' to ask for 
the 3rd matching variant, but:

* how would I specify a relative values like 'next' as you imply I can? 
Use 1, 2? +1 ?

* this seems awfully brittle. Having the stylesheet know the order a 
synthesizer will try variants in seems unwise. Any configuration 
changes in the speech engine (or from one machine to another) and the 
stylesheet could mean something else entirely to what the author 
thought it did.

* you should say what happens if the variant is out of range. eg, if an 
engine has 2 old male voices, and I write voice-family: old male 3; 
does it wrap back around to the first>

I'm not sure how to specify variants better in the general case, there 
is one common case I'd like to see implemented, which is a facility to 
force a change in the voice variant. This would be useful in many cases 
when parsing markup as one can use voices to differentiate where nested 
content begins and ends

eg,

<ul>
	<li>One</li>
  	<li><ul>	<li>Foo</li>
			<li>Bar</li>
		</ul>
	</li>
	<li>three</li>
</ul>

ul { voice-family: young female };
ul ul { voice-family: change };

Here the value 'change' indicates the value should change from the 
inherited value. I'd say it should first look for the closest variant 
(young female 2 in the notation in the working draft), then, if there 
are no suitable matching voices, relax the constraints until it finds 
another voice. The point is best effort must be made to change the 
current voice. The only time the voice would remain the same would be 
if a synthesizer only has one voice installed for the language being 
spoken.

I'd also like to be able to write a selector which says the voice 
should change for however many nested levels of <ul> I might encounter, 
so I don't need to write anything like this:

ul {};
ul ul {};
ul ul ul {}; etc ...  but that's a suggestion for the Selectors module, 
I think.

7. voice-rate

Are percentages allowed in this element or not. The definition says 
"refer to inherited value" but <percentage> is not listed in the value 
section or the prose. Please remove or define <percentage>

The editor's note claims: "The values 'faster' and 'slower' were 
removed to be consistent with SSML." but in fact you did not remove 
them, they're still there.

Will you support relative numbers +5, -10 as per SSML or defer to just 
percentages?

8. voice-pitch

As for voice-rate with respect to <percentage>'s definition being 
missing, and the question of whether to support relative numbers or not.

"SSML allows for relative values in semitones. This would necessitate a 
new CSS unit "st". How valuable is this? What about the alternative of 
providing 'higher' and 'lower' for consistency with other related voice 
properties?"

I've no real opinion on semitones, but the synthesiser I've worked with 
most doesn't use them.
On the question of 'higher' and 'lower' - well it seems like you've 
removed most of the relative values from the other properties, so 
adding them here only makes sense if you reverse that decision.

9. voice-pitch-range

Once again <percentage> is implied by "refers to inherited value", but 
not defined.

In SSML valid values for 'range' include x-high and x-low - these are 
missing from CSS.
higher and lower seem a little silly on this property (and wider and 
narrower might be better anyway)
If you add semitones to voice-pitch, be sure to add them here too.

10. Pitch Contour

You've not included the SSML pitch contour concept in CSS?

11. voice-stress

Looks fine, but no support at all for relative values. Not sure how 
valuable they would be.

12. voice-duration

"This allows authors to specify how long they want a given element to 
be rendered. "

This seems like awkward phrasing. How about:

"This allows authors to specify over what time period they want a given 
element to be rendered. "

Similarly:

"Specifies a value in seconds or milliseconds for the desired time to 
take to speak the element contents"

reads better as

"Specifies the time in seconds or milliseconds that should be taken to 
speak the element contents"

13. phonemes

The example is going to be a problem. I've got a pretty wide set of 
fonts but IPA glyphs seem hard to come by: the example doesn't render 
right for me.

14. interpret-as

I see this property also maps to 'say-as' in SSML. The definition 
should be moved so that it follows that of the 'speak' property to 
avoid confusion (eg,  I asked above if 'telephone' should be a valid 
value for 'speak' whereas its obviously defined here instead).

CSS has a richer set of values than SSML. Will you try to influence the 
SSML group to include some of the values from CSS?
The 'word' option from say-as still seems to be missing from CSS though.

I agree with the last comment about the SSML say-as element: clearly 
its in flux, which is going to make it hard to track. It would be very 
good if the two specifications can agree though.

Thanks for your time. I hope these comments are useful.

AndyT (lordpixel - the cat who walks through walls)
A little bigger on the inside

         (see you later space cowboy ...)

Received on Saturday, 21 June 2003 11:41:56 UTC