Re: CSS3 Speech Module : Working Draft 27 July 2004 Comments

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 10 Aug 2004, Andrew Thompson wrote:

> Hi,
> 
> I've reviewed the 2004 Draft of the CSS3 Speech Module. I previously 
> submitted comments on the 2003 draft here:
> http://lists.w3.org/Archives/Public/www-style/2003Jun/0137.html
> 
> These comments never received any formal response from the working 
> group, but I see the 2004 draft has addressed around 50% of my issues 
> with the previous draft, so I'm pleased with the direction being taken.

Thanks for your persistence, and sorry for having missed your 
earlier email.

> Here are my comments on the 2004 Draft, split into comments on
> style or grammar and comments on the substance of the spec.
> 
> Grammatical & Style Comments
> ----------------------------
> 
> 1.
> Section: Abstract
> Problem: typo
> 
> The sentence "CSS define aural properties that ..." should be
> The sentence "CSS defines aural properties that ..."
> 
> 2.
> Section: Definition of property 'speak'
> Problem: English usage
> 
> In the definitions of 'literal-punctuation' and 'no-punctuation' the 
> sentence
> 
> "Similar as 'normal' value but..." should be
> "Similar to 'normal' value but..."
> 
> 3.
> Section: Definition of property 'speak'
> Problem: English usage
> 
> The sentence:
> "Speech synthesizers are knowledgeable about what is a number and what 
> isn't."
> "Speech synthesizers are knowledgeable about what is and is not a 
> number."
> 
> Should not use 'isn't' in formal written English.
> 
> 4.
> Section: Definition of the property 'voice-duration'
> 
> This sentence is poor:
>   "This allows authors to specify how long they want a given element to 
> be rendered."
> 
> ("how long they want" seems like it is plural purely to avoid the 
> he/she problem, and the phrasing is basically slang)
> 
> Perhaps something like
> "Allows authors to specify how long it should take to render the given 
> element."

Thanks for these corrections, we will try to incorporate them in the
next revision to the draft.


> Substantive Comments
> --------------------
> 
> 1.
> Section: Definition of the property 'speak'
> 
> This draft of the spec - 
> http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/ - defined two 
> additional properties, 'date' and 'words'. The later is probably only 
> marginally useful (in theory it was supposed to force 'ASCII' to be 
> rendered as "as-key" rather than "a s c i i") but I'm really surprised 
> at the removal of "date" which would seem to be really useful.

This will be addressed in terms of the SSML say-as mechanism, the
details of which are still being refined in the W3C Voice Browser
working group. Appendix A notes: "The interpret-as property has been 
temporarily dropped until the Voice Browser working group has 
further progressed work on the SSML <say-as> element."

The as-key example can be handled explicitly using the 'phonemes'
property. It may also be worth considering a CSS equivalent to the
SSML <sub> element, e.g.

      <sub alias="as-key">ASCII</sub>

which in principle could be represented in CSS by something like:

      say-as: "as-key"

But it is unfortunate that SSML uses <say-as> for the different
idea of indicating the meaning of the enclosed text, and not for
how it should be said. That was an unfortunate accident of history.
We therefore should pick another name for the CSS property. My
preference is for something that is self evident, which "substitute"
isn't. Other ideas include "say-instead", "speak-as" and "say-with".

A further idea is as an elaboration of the "speak" property, e.g.

    speak: as(as-key)

But that would prevent the application of the existing speak
properties. Note that such a property could only be applied to a
specific instance of an element. In the longer term, the use of
prounciation lexicons would provide a better solution. 

Your ideas on this are welcomed.


> 2.
> Section: Definition of the properties 'cue-before' and 'cue-after'
> 
> None of the current examples make it clear that this is legal:
> 
> cue-before: url('bell.aiff') loud;
> 
> While grammar shows this is possible, an example would help the less 
> technical reader understand how this property works.

Good idea.

> (I was going to make a comment about "cue-during" and mixing, but the 
> recent discussion of a CSS audio module on www-style indicates this 
> possibility is already being considered.)

A design goal for the CSS3 speech module was to align with SSML so
that implementors could take advantage of the availability of SSML
processors. This reflects the commercial reality that good quality
speech synthesis systems are complex and hard to develop, and that
the market for CSS3 speech is projected to much smaller than for
SSML which is dominated by phone-based applications of interactive
voice response services.  The fact that SSML doesn't support the
ability to play sound files at the same time as synthetic speech,
argues against introducing such a feature into the CSS3 speech
module. After all we need the support of implementers in order to
be able to get through the W3C Candidate Recommendation phase.

> 3.
> Section: Definitions of the properties, 'mark-before' and 'mark-after'
> 
> in both cases the definition reads:
> 
> Value: <string>
> 
> but it should be
> 
> Value: <string> | attr(attribute-name)
> 
> To match the description below it.

Thanks for spotting this.

 
> 4.
> Section: Definition of the property 'voice-family'
> 
> 4.1. CSS3 is still using 'child', 'young' and 'old' but SSML has 
> shifted to requiring age to be expressed in years.
> (see http://www.w3.org/TR/speech-synthesis/#S3.2.1)
> 
> One suspects the reason SSML did this was to avoid the political 
> correctness issue of having to define an age which is "old". 'child', 
> 'young' and 'old' are more useful than absolute numbers. Should CSS 
> harmonize with SSML and only use numbers, or at least allow age to be 
> defined in numbers in addition to child/young/old for compatability?

The problem is that this property already uses a number for the 
voice variant. Having two numbers would lead to confusion and 
authoring mistakes. It therefore seemed wise to used a simple
enumeration.

> 4.2. In addition to 'male' and 'female' the <generic-voice> families 
> should include 'natural' and 'artificial' as many synthesizers have a 
> robot-like voice that is neither male nor female. Note that SSML 
> defines 'neutral' so as a minimum this should be added for 
> compatibility.

Agreed and thanks for spotting this.

> 4.3. As per my 2003 comments, although I like the fact there is a 
> facility for selecting variations, using <number> for specifying then 
> is not a satisfactory solution.
> 
> * firstly using absolute numbers is not very portable. If I write
> 
> body { voice-family: male 1 }
> .foo { voice-family: male 2 }
> .bar { voice-family: male 3 }
> 
> Then what happens if the synthesizer only has two male voices? When 
> something of class 'bar' is rendered, does the synthesizer round-robin 
> back to "male 1" or does it stay with the current voice because it 
> doesn't have enough male voices? At the very least the specification 
> should specify what "best effort" strategy the synthesizer should 
> apply. This allows document authors to at least predict whether the 
> voice will change or not (assuming the synthesizer has at least 2 
> voices).

This is tricky given that SSML doesn't provide a specific algorithm
other than with respect to the value for xml:lang. The current 
wording is the best I have been able to come up with.


> 
> * The definition for <number> says: "e.g. the second or next male 
> voice", but no way to indicate "next" and "previous" is given. Clearly 
> '1', '2', '3' work for specifying variants absolutely, put how do ask 
> for the next voice? Perhaps something like this
> 
> .foo { voice-family: male +1  //select the next male voice, relative to 
> the inherited voice}

In CSS +1 is the same as 1 so that wouldn't work.

> However this would be easier:
> 
> Value: 	[[<specific-voice> | [<relative-voice-specifier>] [<age>] 
> <generic-voice>],]*
> 		[<specific-voice> | [<relative-voice-specifier>] [<age>] 
> <generic-voice>] | inherit
> 
> <relative-voice-specifier>
> 	Possible values are 'previous' and 'next'
> 
> .foo { voice-family: next old male }
> 
> This would require vendors order their voices, but it would allow 
> document authors to reliably control whether the voice changes.
> 
> eg, Assume a synthesizer has 3 male voices "Fred", "Bruce" and "Ralph"
> 
> <ul>
>    <li>one</li>
>    <li><ul><li>foo</li>
> 		 <li>bar</li>
>        </ul>
>    </li>
> </ul>
> 
> ul { voice-family: male; }    --> Fred
> ul ul { voice-family: next male; } --> Bruce
> ul ul ul { voice-family: previous male; } --> Fred
> 
> * Along similar lines, another value would be useful:
> 
> <relative-voice-specifier>
> 	Possible values are 'previous', 'next' and 'different'
> 
> ul { voice-family: young female; }
> //slightly silly example, you probably wouldn't change the voice for 
> 'em'
> em { voice-family: different female; }
> 
> 'different' is similar to 'previous' and 'next' but rather than cycling 
> through the voices in a set order it asks the synthesizer to change the 
> voice. The new voice should be as close to the inherited value as 
> possible, within the constraints of the available voices. Thus the 
> above 'em' declaration should first try to use a different 'young 
> female' voice, then a different 'female' voice, then a 'neuter' and 
> finally a 'male' voice if the synthesizer only has one female voice. 
> Naturally all of these voices must speak the same language first and 
> foremost!
> 
> Overall I believe something like 'previous', 'next' and 'different' 
> would be more useful, more intuitive and more portable than absolute 
> integer indices.

Unfortunately the need to align with SSML precludes this. The speech
engine vendors are currently focusing on the VoiceXML market and are
much less interested in CSS, so for now, we need to align with SSML.

> 5.
> Section: Definition of 'voice-pitch'
> 
> Regarding semitone changes: I think CSS should be harmonized with SSML 
> unless adding the new unit to CSS is undesirable for some reason?

Adding the "st" suffix for semitones would be easy enough, but CSS
doesn't discriminate between +1 and 1, so  1st would be plus one
semitone while -1st would be minus one semitone.

Would this feature be useful in a style sheet as opposed to an SSML
file where it can be applied at a fine grained level in a pitch
contour?

The primary use case was for when you wanted to get the TTS engine
to "sing" by tweaking the pitch contours.

This remains an open issue ....


> 
> Thanks for your time. Be interested in hearing any feedback.

Thanks for your feedback. Would you be willing to help with
work on test suites and implementations?

- -- 
 Dave Raggett <dsr@w3.org>  W3C lead for voice and multimodal.
 http://www.w3.org/People/Raggett +44 1225 866240 (or 867351)
 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBGKRib3AdEmxAsUsRAmXMAJ9Gw0+Q/mcTLfeLIgg76rx5mZ8fJwCfVtoo
XhkD9DuF/3JITGdyqQocSXk=
=lGPG
-----END PGP SIGNATURE-----

Received on Tuesday, 10 August 2004 10:33:08 UTC