- From: Andrew Thompson <lordpixel@mac.com>
- Date: Sat, 21 Jun 2003 11:41:54 -0400
- To: www style <www-style@w3.org>
Hi What follows are some personal comments on the first published draft of the CSS 3 speech module (May 14th Draft). Disclaimer: these comments are my own personal opinion. This is worth stating as I'm a member of the jsr-113 Expert Group for the development of version 2 of the Java Speech API. I'm not speaking for that group in any way in this message. I'll go through the document in order, by section, introducing my own numbering for convenience: 1. Introduction The example in the introduction looks like it has a couple of typos (what is voice-: ?): p.heidi { voice-: left; voice-family: female } P.peter { voice-: right; voice-family: male } 2. voice-balance The prose for this section mentions two properties, 'leftwards' and 'rightwards' that are not included in the list of valid values given at the beginning of the voice-balance definition. Should be consistent. In general I approve of this simplification. Its probably actually implementable, whereas the previous azimuth model wasn't really practical. 3. speak Obviously perfect correspondence with the SSML 'say-as' element is difficult. It does seems there are some gaps though: CSS has 'spell-out' where SSML uses 'letters' CSS lacks 'date' as an option CSS lacks the ability to mark a number as a telephone number (ah, I see now this is covered by 14. interpret-as below) CSS lacks the 'words' option... used to force an acronym like "ASCII" to be pronounced as a word rather than as letters. In he literal-punctuation and no-punctuation sections, the prose description reads: "Similar as 'normal' value but ...". Change to "Similar to 'normal' value but..." Similarly below change: "Speech synthesisers are knowledgeable about what is a number and what isn't" to "... what is and is not a number". Also: was the British spelling of synthesizer intentional? (not that I care, being British myself, but I thought W3C standard was American English). First editor's note: agree on not trying to deal with cardinal and ordinal for now. I recall considerable feedback on this part of SSML in the last review cycle. Wait to see what the SSML team decide to do before trying to merge into CSS Second editor's note reads in part: "The value 'code' has been replaced by 'all' ... However the speak property does not define a value called 'all'. Did you mean to say 'literal-punctuation' or something else? 4. pause-before, pause-after, pause I'm surprised to see CSS doesn't include the textual values allowed by the 'break' element in SSML (x-small, small, medium, large, x-large). Be wary though: since the last SSML draft the working group agreed to redefine some of the 'break' values based on feedback. If you incorporate these into CSS, be sure to use the revised list as mentioned on www-voice. It couldn't hurt to include examples of valid seconds and milliseconds values in the <time> section. Sure people can look us CSS time units, but it'd be easier if there were an example. 5. cue-before, cue-after, cue There would seem to be a need for 'cue-during' indicating that a sample should be played in the background whilst an element is rendered. Does this introduce too much additional complexity? eg, p.noisyBar { cue-during: url("club-music.mp3"); } I imagine 2 further facilities would be required to make this fully useful: cue-during: Value [<uri> looped <number>] (apologies if that definition's not right) The <number> would be defined as for voice-volume, and would be the sample volume. If 'looped' is present then the sample loops if it is too short to cover the element being rendered, otherwise it is played only once. 6. voice-family <age> has been redefined in SSML to take a pure integer value (essentially all of the string values were for reasons of political correctness: no one wanted to define a numeric equivalent for 'old'). You could choose to be consistent with this, though obviously its then not entirely clear how one specifies 'child'. I guess one uses the number '5' and hopes the speech synthesiser is smart enough to figure out what you want. If you do keep the textual values 'adult' was in the older drafts of SSML, I think. Taking a look at the first example: h1 { voice-family: announcer old male }; The definition of this property is quite clever, but confusing. When I first saw 'announcer' I thought it was a generic voice that you'd forgotten to define. Looking at the syntax closely, I see one must define an age in order to use a generic voice (eg, voice:family: child male), so I can conclude 'announcer' is intended as a specific voice name. This doesn't seem very intuitive. I do like the way you set it up as voice-family working just like font family. A simple left to right priority list seems easier though (though I think you'd then need a separate property for the age): voice-family: david, announcer, male; voice-age: old; The above seems more self explanatory. I think you're going in the right direction with 'generic-family' but can get a lot more mileage out of it than just 'male' and 'female'. I would love to see 'robotic' (or artificial), 'natural', 'authoritative' etc. This would work nicely with the simpler form of voice-family proposed above: voice-family: Victoria, Agnes, female, natural; // fallback to more and more general voices I like the idea of the <number> in this definition, because I think I know the problem you're trying to solve. In any case, I don't think the domain of 'positive integers' is the best one. The definition says: "Indicates a preferred variant of the other voice characteristics. (e.g. the second or next male voice). Possible values are positive integers." This doesn't help me much. I can see I could use '2' or '3' to ask for the 3rd matching variant, but: * how would I specify a relative values like 'next' as you imply I can? Use 1, 2? +1 ? * this seems awfully brittle. Having the stylesheet know the order a synthesizer will try variants in seems unwise. Any configuration changes in the speech engine (or from one machine to another) and the stylesheet could mean something else entirely to what the author thought it did. * you should say what happens if the variant is out of range. eg, if an engine has 2 old male voices, and I write voice-family: old male 3; does it wrap back around to the first> I'm not sure how to specify variants better in the general case, there is one common case I'd like to see implemented, which is a facility to force a change in the voice variant. This would be useful in many cases when parsing markup as one can use voices to differentiate where nested content begins and ends eg, <ul> <li>One</li> <li><ul> <li>Foo</li> <li>Bar</li> </ul> </li> <li>three</li> </ul> ul { voice-family: young female }; ul ul { voice-family: change }; Here the value 'change' indicates the value should change from the inherited value. I'd say it should first look for the closest variant (young female 2 in the notation in the working draft), then, if there are no suitable matching voices, relax the constraints until it finds another voice. The point is best effort must be made to change the current voice. The only time the voice would remain the same would be if a synthesizer only has one voice installed for the language being spoken. I'd also like to be able to write a selector which says the voice should change for however many nested levels of <ul> I might encounter, so I don't need to write anything like this: ul {}; ul ul {}; ul ul ul {}; etc ... but that's a suggestion for the Selectors module, I think. 7. voice-rate Are percentages allowed in this element or not. The definition says "refer to inherited value" but <percentage> is not listed in the value section or the prose. Please remove or define <percentage> The editor's note claims: "The values 'faster' and 'slower' were removed to be consistent with SSML." but in fact you did not remove them, they're still there. Will you support relative numbers +5, -10 as per SSML or defer to just percentages? 8. voice-pitch As for voice-rate with respect to <percentage>'s definition being missing, and the question of whether to support relative numbers or not. "SSML allows for relative values in semitones. This would necessitate a new CSS unit "st". How valuable is this? What about the alternative of providing 'higher' and 'lower' for consistency with other related voice properties?" I've no real opinion on semitones, but the synthesiser I've worked with most doesn't use them. On the question of 'higher' and 'lower' - well it seems like you've removed most of the relative values from the other properties, so adding them here only makes sense if you reverse that decision. 9. voice-pitch-range Once again <percentage> is implied by "refers to inherited value", but not defined. In SSML valid values for 'range' include x-high and x-low - these are missing from CSS. higher and lower seem a little silly on this property (and wider and narrower might be better anyway) If you add semitones to voice-pitch, be sure to add them here too. 10. Pitch Contour You've not included the SSML pitch contour concept in CSS? 11. voice-stress Looks fine, but no support at all for relative values. Not sure how valuable they would be. 12. voice-duration "This allows authors to specify how long they want a given element to be rendered. " This seems like awkward phrasing. How about: "This allows authors to specify over what time period they want a given element to be rendered. " Similarly: "Specifies a value in seconds or milliseconds for the desired time to take to speak the element contents" reads better as "Specifies the time in seconds or milliseconds that should be taken to speak the element contents" 13. phonemes The example is going to be a problem. I've got a pretty wide set of fonts but IPA glyphs seem hard to come by: the example doesn't render right for me. 14. interpret-as I see this property also maps to 'say-as' in SSML. The definition should be moved so that it follows that of the 'speak' property to avoid confusion (eg, I asked above if 'telephone' should be a valid value for 'speak' whereas its obviously defined here instead). CSS has a richer set of values than SSML. Will you try to influence the SSML group to include some of the values from CSS? The 'word' option from say-as still seems to be missing from CSS though. I agree with the last comment about the SSML say-as element: clearly its in flux, which is going to make it hard to track. It would be very good if the two specifications can agree though. Thanks for your time. I hope these comments are useful. AndyT (lordpixel - the cat who walks through walls) A little bigger on the inside (see you later space cowboy ...)
Received on Saturday, 21 June 2003 11:41:56 UTC