"phonemes" property in the CSS3 Speech module from Stephen Zilles on 2011-02-03 (www-style@w3.org from February 2011)

From: Stephen Zilles <szilles@adobe.com>
Date: Thu, 3 Feb 2011 15:31:30 -0800
To: "www-style@w3.org list" <www-style@w3.org>
Message-ID: <CE2F61DA5FA23945A4EA99A212B1579537CD82899B@nambx03.corp.adobe.com>

There was a interesting and informative discussion of the "phonemes" property
http://dev.w3.org/csswg/css3-speech/#issue-phonemes
in the CSS3 Speech Module during the WG meeting last Wednesday
http://lists.w3.org/Archives/Public/www-style/2011Feb/0029.html

[The comments below are not based on any expertise in Speech synthesis; I have none. They are from the perspective is "What is styling and What makes sense in CSS" Sometimes, some apparent styling options may not make sense in CSS due to the structure of the Web authoring environment.]

The "phonemes" property is part of a two part solution to giving alternate pronunciations to text that is receiving synthesized speech. Typically, speech would be synthesized using a particular "accent", say American, to determine the rules for pronouncing the text. But, say the author wanted to show the distinction between two ways of pronouncing "tomato"; namely, "toe-may-toe" and "toe-mahh-toe". The default pronunciation of the text, "tomato" would give only one of these pronunciation, say the first one. To get the other pronunciation, "toe-mahh-toe", it would be necessary to use different pronunciation rules.

The "phonemes" property (together with the @phonetic-alphabet rule) are designed to allow an author to define and use a different pronunciation. The @phonetic-alphabet rule allows a document to specify, in the stylesheet, a single phonetic alphabet, such as the "International Phonetic Alphabet", that will be used to express the non default pronunciations.

To allow a sequence of text, say a <span>, to be given a different pronunciation, the span must be given some identification, typically an ID attribute, that can be referenced in the selector of a style ruleset. A declaration of this ruleset would then specify the "phonemes" property with a string value that expresses the desired pronunciation of the content of the span using characters in the specified phonetic alphabet.

It is reasonable to argue that changing the pronunciation of the word, "tomato" is a stylistic change. Certainly the underlying text remains the same in both cases. For that reason, some people argued that this is a reasonable use of CSS and styling.

There is another viewpoint, however. This viewpoint notes that the pronunciation change replaces whatever the content of the span is whether or not that content spells "tomato". That is, if someone thinks that the example would be better with the word, "vitamin", then unless the style rulesset are also edited to change the different pronunciation, the text "vitamin" would be pronounced "toe-mahh-toe", clearly an unintended effect. [The word "vitamin" is pronounced as "veye-tah-min" in American and as "vih-tah-min" in English, so it too would likely need the different pronunciation mechanism.]

This second viewpoint suggests that several ways to resolve the issue. The first of these ways is, I believe fanciful, but is instructive of the nature of a solution. If instead of having a "phonemes" property, we might solve the problem with a "dialect" property. The "dialect" property (which assumes there is a standard notation for specifying a dialect in which to speak - something that I doubt exists) would like the "phonemes" property attach a dialect label to the span in question, say "American" or "British". Then the speech synthesis system would have to be able to speak in a number of dialects, each called out by one of the dialect labels. But, in every case, the text of the span would be what is input to speech synthesis. And the dialect label would "style" the speech. If the content were changed from "tomato" to "vitamin", then depending on which dialect label were used the spoken styling would change consistently without any change to the stylesheet.

But, I (a rank amateur in speech synthesis) am unaware of any standard encoding of dialects which would, in any case, likely require a large dictionary for each dialect. So, what does the above fanciful solution tell us? It tells us that either the styling ought to be acting on the content of the styled element or that any alternative styling that replaces the content ought to be part of the element itself and not part of the stylesheet used for the document. This is a consequence of the separation of styling and content. These have become separate files and requiring simultaneous edits of both to make changes has been shown to often lead to inconsistencies.

When the content is what is styled as in the "dialect" case, it suffices to edit the content of the element and the styling will follow. If the content is replaced rather than being styled, it is necessary to edit both the content of the element and the alternate pronunciation. This is facilitated if both the content and the pronunciation are part of the same element so the need for editing both is more apparent.

So, given this viewpoint, one solution would be to have an attribute, for example, "pronounceAs" on the element (the above "span") that is to have a different pronunciation. The value of this attribute would be the same as the value of the "phonemes" property. The only distinction between "pronounceAs" and "phonemes" being where the pronunciation data is stored. In the former, it is with the content it replaces and in the latter, the data is with the stylesheet.

OK, This would be a better solution to the editing problem, but it seems to require introducing another special attribute to both XML and HTML. This would be a pain.

To avoid, having to identify a specific XML attribute name, such as "pronounceAs", one could have a CSS property, for example, "UseToPronouce" that controls whether the pronunciation data is used. This property, when used in a ruleset with an "attribute selector" that matches the attribute whose value has the pronunciation string, would either trigger the pronunciation replacement when the value was "always" or would ignore the pronunciation replacement if the value was "never". Of course, not having a ruleset that selects for that attribute would also ignore the pronunciation data so there is not much use for the "never" value except when using the CSSOM to turn off alternative pronunciations.

The above approach would solve most of what "phonemes" was intended to do. The part that is still missing is the mechanism for specifying which phonetic-alphabet is being used in the pronunciation data. This is done with an "@phonetic-alphabet" rule in the existing CSS3 Speech WD. This again has the problem that the information is in the stylesheet rather than the document being styled. This probably a less serious problem for editing that is the "phonemes" problem, but it does have the same risk of requiring two files to be edited to make a change; in this case, a change of the phonetic-alphabet being used.

The same scheme used above to specify an XML attribute is carrying pronunciation data and to enable its usage (a property analogous to "UseToPronouce") can be used to identify and enable an attribute that had as its value the identifier of a phonetic-alphabet.

Since the names of the attributes that identify the phonetic-alphabet and the pronunciation data are not codified in the solution immediately above, it is possible to have multiple sets of pronunciation data with a different attribute name for each kind of data; that is, each set would use a different phonetic-alphabet. Then a media query could be used to choose the ruleset that selected on the phonetic alphabet understood by the User Agent on which the document was being spoken.

Note, the above contribution is not entirely original. It is modeled on the way that AltGlyphs
http://www.w3.org/TR/SVG11/text.html#AlternateGlyphs
are specified in SVG. SVG has two (standardized) attributes that can be used on an AltGlyph element. (An AltGlyph element is really a specialized Tspan element that allows these two extra attributes. The two attributes specify the font file "format" from which the replacement glyphs (versus replacement pronunciation) are drawn and the "glyphRef" which identifies (in a scheme appropriate to the chosen font file format) the glyph (or glyphs) that is to replace the spanned content. This is a solution that has worked for SVG for years.

So, in summary, because documents are edited and because requiring edits to affect two separate files is generally a bad idea, it makes sense to attach replacement data (whether pronunciations or different presentation glyphs) to the content that is being replaced. This can be done, by putting the replacement data in an attribute of the element whose content is being replaced and using the selection mechanisms of CSS to enable the use of that data to replace the content. Then, as appropriate, CSS is controlling the styling, the use of the replacement data, but is not carrying that data.

Steve Zilles

Received on Thursday, 3 February 2011 23:32:07 UTC