- From: Dave Raggett <dsr@w3.org>
- Date: Tue, 10 Aug 2004 11:33:01 +0100 (BST)
- To: Andrew Thompson <lordpixel@mac.com>
- Cc: www style <www-style@w3.org>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 10 Aug 2004, Andrew Thompson wrote: > Hi, > > I've reviewed the 2004 Draft of the CSS3 Speech Module. I previously > submitted comments on the 2003 draft here: > http://lists.w3.org/Archives/Public/www-style/2003Jun/0137.html > > These comments never received any formal response from the working > group, but I see the 2004 draft has addressed around 50% of my issues > with the previous draft, so I'm pleased with the direction being taken. Thanks for your persistence, and sorry for having missed your earlier email. > Here are my comments on the 2004 Draft, split into comments on > style or grammar and comments on the substance of the spec. > > Grammatical & Style Comments > ---------------------------- > > 1. > Section: Abstract > Problem: typo > > The sentence "CSS define aural properties that ..." should be > The sentence "CSS defines aural properties that ..." > > 2. > Section: Definition of property 'speak' > Problem: English usage > > In the definitions of 'literal-punctuation' and 'no-punctuation' the > sentence > > "Similar as 'normal' value but..." should be > "Similar to 'normal' value but..." > > 3. > Section: Definition of property 'speak' > Problem: English usage > > The sentence: > "Speech synthesizers are knowledgeable about what is a number and what > isn't." > "Speech synthesizers are knowledgeable about what is and is not a > number." > > Should not use 'isn't' in formal written English. > > 4. > Section: Definition of the property 'voice-duration' > > This sentence is poor: > "This allows authors to specify how long they want a given element to > be rendered." > > ("how long they want" seems like it is plural purely to avoid the > he/she problem, and the phrasing is basically slang) > > Perhaps something like > "Allows authors to specify how long it should take to render the given > element." Thanks for these corrections, we will try to incorporate them in the next revision to the draft. > Substantive Comments > -------------------- > > 1. > Section: Definition of the property 'speak' > > This draft of the spec - > http://www.w3.org/TR/2002/WD-speech-synthesis-20021202/ - defined two > additional properties, 'date' and 'words'. The later is probably only > marginally useful (in theory it was supposed to force 'ASCII' to be > rendered as "as-key" rather than "a s c i i") but I'm really surprised > at the removal of "date" which would seem to be really useful. This will be addressed in terms of the SSML say-as mechanism, the details of which are still being refined in the W3C Voice Browser working group. Appendix A notes: "The interpret-as property has been temporarily dropped until the Voice Browser working group has further progressed work on the SSML <say-as> element." The as-key example can be handled explicitly using the 'phonemes' property. It may also be worth considering a CSS equivalent to the SSML <sub> element, e.g. <sub alias="as-key">ASCII</sub> which in principle could be represented in CSS by something like: say-as: "as-key" But it is unfortunate that SSML uses <say-as> for the different idea of indicating the meaning of the enclosed text, and not for how it should be said. That was an unfortunate accident of history. We therefore should pick another name for the CSS property. My preference is for something that is self evident, which "substitute" isn't. Other ideas include "say-instead", "speak-as" and "say-with". A further idea is as an elaboration of the "speak" property, e.g. speak: as(as-key) But that would prevent the application of the existing speak properties. Note that such a property could only be applied to a specific instance of an element. In the longer term, the use of prounciation lexicons would provide a better solution. Your ideas on this are welcomed. > 2. > Section: Definition of the properties 'cue-before' and 'cue-after' > > None of the current examples make it clear that this is legal: > > cue-before: url('bell.aiff') loud; > > While grammar shows this is possible, an example would help the less > technical reader understand how this property works. Good idea. > (I was going to make a comment about "cue-during" and mixing, but the > recent discussion of a CSS audio module on www-style indicates this > possibility is already being considered.) A design goal for the CSS3 speech module was to align with SSML so that implementors could take advantage of the availability of SSML processors. This reflects the commercial reality that good quality speech synthesis systems are complex and hard to develop, and that the market for CSS3 speech is projected to much smaller than for SSML which is dominated by phone-based applications of interactive voice response services. The fact that SSML doesn't support the ability to play sound files at the same time as synthetic speech, argues against introducing such a feature into the CSS3 speech module. After all we need the support of implementers in order to be able to get through the W3C Candidate Recommendation phase. > 3. > Section: Definitions of the properties, 'mark-before' and 'mark-after' > > in both cases the definition reads: > > Value: <string> > > but it should be > > Value: <string> | attr(attribute-name) > > To match the description below it. Thanks for spotting this. > 4. > Section: Definition of the property 'voice-family' > > 4.1. CSS3 is still using 'child', 'young' and 'old' but SSML has > shifted to requiring age to be expressed in years. > (see http://www.w3.org/TR/speech-synthesis/#S3.2.1) > > One suspects the reason SSML did this was to avoid the political > correctness issue of having to define an age which is "old". 'child', > 'young' and 'old' are more useful than absolute numbers. Should CSS > harmonize with SSML and only use numbers, or at least allow age to be > defined in numbers in addition to child/young/old for compatability? The problem is that this property already uses a number for the voice variant. Having two numbers would lead to confusion and authoring mistakes. It therefore seemed wise to used a simple enumeration. > 4.2. In addition to 'male' and 'female' the <generic-voice> families > should include 'natural' and 'artificial' as many synthesizers have a > robot-like voice that is neither male nor female. Note that SSML > defines 'neutral' so as a minimum this should be added for > compatibility. Agreed and thanks for spotting this. > 4.3. As per my 2003 comments, although I like the fact there is a > facility for selecting variations, using <number> for specifying then > is not a satisfactory solution. > > * firstly using absolute numbers is not very portable. If I write > > body { voice-family: male 1 } > .foo { voice-family: male 2 } > .bar { voice-family: male 3 } > > Then what happens if the synthesizer only has two male voices? When > something of class 'bar' is rendered, does the synthesizer round-robin > back to "male 1" or does it stay with the current voice because it > doesn't have enough male voices? At the very least the specification > should specify what "best effort" strategy the synthesizer should > apply. This allows document authors to at least predict whether the > voice will change or not (assuming the synthesizer has at least 2 > voices). This is tricky given that SSML doesn't provide a specific algorithm other than with respect to the value for xml:lang. The current wording is the best I have been able to come up with. > > * The definition for <number> says: "e.g. the second or next male > voice", but no way to indicate "next" and "previous" is given. Clearly > '1', '2', '3' work for specifying variants absolutely, put how do ask > for the next voice? Perhaps something like this > > .foo { voice-family: male +1 //select the next male voice, relative to > the inherited voice} In CSS +1 is the same as 1 so that wouldn't work. > However this would be easier: > > Value: [[<specific-voice> | [<relative-voice-specifier>] [<age>] > <generic-voice>],]* > [<specific-voice> | [<relative-voice-specifier>] [<age>] > <generic-voice>] | inherit > > <relative-voice-specifier> > Possible values are 'previous' and 'next' > > .foo { voice-family: next old male } > > This would require vendors order their voices, but it would allow > document authors to reliably control whether the voice changes. > > eg, Assume a synthesizer has 3 male voices "Fred", "Bruce" and "Ralph" > > <ul> > <li>one</li> > <li><ul><li>foo</li> > <li>bar</li> > </ul> > </li> > </ul> > > ul { voice-family: male; } --> Fred > ul ul { voice-family: next male; } --> Bruce > ul ul ul { voice-family: previous male; } --> Fred > > * Along similar lines, another value would be useful: > > <relative-voice-specifier> > Possible values are 'previous', 'next' and 'different' > > ul { voice-family: young female; } > //slightly silly example, you probably wouldn't change the voice for > 'em' > em { voice-family: different female; } > > 'different' is similar to 'previous' and 'next' but rather than cycling > through the voices in a set order it asks the synthesizer to change the > voice. The new voice should be as close to the inherited value as > possible, within the constraints of the available voices. Thus the > above 'em' declaration should first try to use a different 'young > female' voice, then a different 'female' voice, then a 'neuter' and > finally a 'male' voice if the synthesizer only has one female voice. > Naturally all of these voices must speak the same language first and > foremost! > > Overall I believe something like 'previous', 'next' and 'different' > would be more useful, more intuitive and more portable than absolute > integer indices. Unfortunately the need to align with SSML precludes this. The speech engine vendors are currently focusing on the VoiceXML market and are much less interested in CSS, so for now, we need to align with SSML. > 5. > Section: Definition of 'voice-pitch' > > Regarding semitone changes: I think CSS should be harmonized with SSML > unless adding the new unit to CSS is undesirable for some reason? Adding the "st" suffix for semitones would be easy enough, but CSS doesn't discriminate between +1 and 1, so 1st would be plus one semitone while -1st would be minus one semitone. Would this feature be useful in a style sheet as opposed to an SSML file where it can be applied at a fine grained level in a pitch contour? The primary use case was for when you wanted to get the TTS engine to "sing" by tweaking the pitch contours. This remains an open issue .... > > Thanks for your time. Be interested in hearing any feedback. Thanks for your feedback. Would you be willing to help with work on test suites and implementations? - -- Dave Raggett <dsr@w3.org> W3C lead for voice and multimodal. http://www.w3.org/People/Raggett +44 1225 866240 (or 867351) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBGKRib3AdEmxAsUsRAmXMAJ9Gw0+Q/mcTLfeLIgg76rx5mZ8fJwCfVtoo XhkD9DuF/3JITGdyqQocSXk= =lGPG -----END PGP SIGNATURE-----
Received on Tuesday, 10 August 2004 10:33:08 UTC