Re: "phonemes" property in the CSS3 Speech module from Daniel Weck on 2011-02-04 (www-style@w3.org from February 2011)

From: Daniel Weck <daniel.weck@gmail.com>
Date: Fri, 4 Feb 2011 09:33:26 +0000
To: "www-style@w3.org list" <www-style@w3.org>
Message-Id: <1A7785B7-397C-45F3-87C8-DA84A38B889E@gmail.com>
Thanks for your thorough analysis Stephen.

I agree that the inherent connexion between text (in the HTML  
document) and associated pronunciation in the CSS is a valid argument  
in favor of moving the "phonemes" property (and the declaration of its  
associated phonetic dictionary) from CSS3-Speech into the markup  
itself. This would clearly make it easier for authors to maintain  
content. By the way, this is the path chosen by the current draft of  
EPUB3 ( http://epub-revision.googlecode.com/svn/trunk/build/spec/epub30-overview.html#sec-tts 
  ).

Please note that although the "tomato" pronunciation example indeed  
ties-in well with the concept of "dialect" (different accents), there  
are other use-cases whereby the disambiguation is required within the  
*same* "dialect". For example, the text token "read" in british  
english may be spoken as 'reed' or 'red'. Text-To-Speech engines  
usually process such token based on the surrounding context, but there  
are cases where the lack of context requires explicit authoring of a  
pronunciation rule (e.g. the line of text "I read it.").

Also note that content replacement in CSS is analogous to the text  
normalization phase that precedes the text-to-phoneme conversion in  
speech systems. TTS engines carry out their own text normalization  
based on pre-defined rules (to deal with dates, currencies,  
abbreviations, etc.), but authors must be able to enforce specific  
rules (which may override the default behavior). CSS-based content  
generation and replacement can be misused of course, but in our  
current CSS3-Speech draft we give an example with "abbr", which  
showcases the clear separation of data and styling (no additional data  
is provided by the CSS rule): http://dev.w3.org/csswg/css3-speech/#content

Regards, Daniel

On 3 Feb 2011, at 23:31, Stephen Zilles wrote:

> There was a interesting and informative discussion of the “phonemes”  
> property
>   http://dev.w3.org/csswg/css3-speech/#issue-phonemes
> in the CSS3 Speech Module during the WG meeting last Wednesday
>   http://lists.w3.org/Archives/Public/www-style/2011Feb/0029.html
>
> [The comments below are not based on any expertise in Speech  
> synthesis; I have none. They are from the perspective is “What is  
> styling and What makes sense in CSS” Sometimes, some apparent  
> styling options may not make sense in CSS due to the structure of  
> the Web authoring environment.]
>
> The “phonemes” property is part of a two part solution to giving  
> alternate pronunciations to text that is receiving synthesized  
> speech. Typically, speech would be synthesized using a particular  
> “accent”, say American, to determine the rules for pronouncing the  
> text. But, say the author wanted to show the distinction between two  
> ways of pronouncing “tomato”; namely, “toe-may-toe” and “toe-mahh- 
> toe”. The default pronunciation of the text, “tomato” would give  
> only one of these pronunciation, say the first one. To get the other  
> pronunciation, “toe-mahh-toe”, it would be necessary to use  
> different pronunciation rules.
>
> The “phonemes” property (together with the @phonetic-alphabet rule)  
> are designed to allow an author to define and use a different  
> pronunciation. The @phonetic-alphabet rule allows a document to  
> specify, in the stylesheet, a single phonetic alphabet, such as the  
> “International Phonetic Alphabet”, that will be used to express the  
> non default pronunciations.
>
> To allow a sequence of text, say a <span>, to be given a different  
> pronunciation, the span must be given some identification, typically  
> an ID attribute, that can be referenced in the selector of a style  
> ruleset. A declaration of this ruleset would then specify the  
> “phonemes” property with a string value that expresses the desired  
> pronunciation of the content of the span using characters in the  
> specified phonetic alphabet.
>
> It is reasonable to argue that changing the pronunciation of the  
> word, “tomato” is a stylistic change. Certainly the underlying text  
> remains the same in both cases. For that reason, some people argued  
> that this is a reasonable use of CSS and styling.
>
> There is another viewpoint, however. This viewpoint notes that the  
> pronunciation change replaces whatever the content of the span is  
> whether or not that content spells “tomato”. That is, if someone  
> thinks that the example would be better with the word, “vitamin”,  
> then unless the style rulesset are also edited to change the  
> different pronunciation, the text “vitamin” would be pronounced “toe- 
> mahh-toe”, clearly an unintended effect. [The word “vitamin” is  
> pronounced as “veye-tah-min” in American and as “vih-tah-min” in  
> English, so it too would likely need the different pronunciation  
> mechanism.]
>
> This second viewpoint suggests that several ways to resolve the  
> issue. The first of these ways is, I believe fanciful, but is  
> instructive of the nature of a solution. If instead of having a  
> “phonemes” property, we might solve the problem with a “dialect”  
> property. The “dialect” property (which assumes there is a standard  
> notation for specifying a dialect in which to speak – something that  
> I doubt exists) would like the “phonemes” property attach a dialect  
> label to the span in question, say “American” or “British”. Then the  
> speech synthesis system would have to be able to speak in a number  
> of dialects, each called out by one of the dialect labels. But, in  
> every case, the text of the span would be what is input to speech  
> synthesis. And the dialect label would “style” the speech. If the  
> content were changed from “tomato” to “vitamin”, then depending on  
> which dialect label were used the spoken styling would change  
> consistently without any change to the stylesheet.
>
> But, I (a rank amateur in speech synthesis) am unaware of any  
> standard encoding of dialects which would, in any case, likely  
> require a large dictionary for each dialect. So, what does the above  
> fanciful solution tell us? It tells us that either the styling ought  
> to be acting on the content of the styled element or that any  
> alternative styling that replaces the content ought to be part of  
> the element itself and not part of the stylesheet used for the  
> document. This is a consequence of the separation of styling and  
> content. These have become separate files and requiring simultaneous  
> edits of both to make changes has been shown to often lead to  
> inconsistencies.
>
> When the content is what is styled as in the “dialect” case, it  
> suffices to edit the content of the element and the styling will  
> follow. If the content is replaced rather than being styled, it is  
> necessary to edit both the content of the element and the alternate  
> pronunciation.  This is facilitated if both the content and the  
> pronunciation are part of the same element so the need for editing  
> both is more apparent.
>
> So, given this viewpoint, one solution would be to have an  
> attribute, for example, “pronounceAs” on the element (the above  
> “span”) that is to have a different pronunciation. The value of this  
> attribute would be the same as the value of the “phonemes” property.  
> The only distinction between “pronounceAs” and “phonemes” being  
> where the pronunciation data is stored. In the former, it is with  
> the content it replaces and in the latter, the data is with the  
> stylesheet.
>
> OK, This would be a better solution to the editing problem, but it  
> seems to require introducing another special attribute to both XML  
> and HTML. This would be a pain.
>
> To avoid, having to identify a specific XML attribute name, such as  
> “pronounceAs”, one could have a CSS property, for example,   
> “UseToPronouce” that controls whether the pronunciation data is  
> used. This property, when used in a ruleset with an “attribute  
> selector” that matches the attribute whose value has the  
> pronunciation string, would either trigger the pronunciation  
> replacement when the value was “always” or would ignore the  
> pronunciation replacement if the value was “never”. Of course, not  
> having a ruleset that selects for that attribute would also ignore  
> the pronunciation data so there is not much use for the “never”  
> value except when using the CSSOM to turn off alternative  
> pronunciations.
>
> The above approach would solve most of what “phonemes” was intended  
> to do. The part that is still missing is the mechanism for  
> specifying which phonetic-alphabet is being used in the  
> pronunciation data. This is done with an “@phonetic-alphabet” rule  
> in the existing CSS3 Speech WD. This again has the problem that the  
> information is in the stylesheet rather than the document being  
> styled. This probably a less serious problem for editing that is the  
> “phonemes” problem, but it does have the same risk of requiring two  
> files to be edited to make a change; in this case, a change of the  
> phonetic-alphabet being used.
>
> The same scheme used above to specify an XML attribute is carrying  
> pronunciation data and to enable its usage (a property analogous to  
> “UseToPronouce”) can be used to identify and enable an attribute  
> that had as its value the identifier of a phonetic-alphabet.
>
> Since the names of the attributes that identify the phonetic- 
> alphabet and the pronunciation data are not codified in the solution  
> immediately above, it is possible to have multiple sets of  
> pronunciation data with a different attribute name for each kind of  
> data; that is, each set would use a different phonetic-alphabet.  
> Then a media query could be used to choose the ruleset that selected  
> on the phonetic alphabet understood by the User Agent on which the  
> document was being spoken.
>
> Note, the above contribution is not entirely original. It is modeled  
> on the way that AltGlyphs
>   http://www.w3.org/TR/SVG11/text.html#AlternateGlyphs
> are specified in SVG. SVG has two (standardized) attributes that can  
> be used on an AltGlyph element. (An AltGlyph element is really a  
> specialized Tspan element that allows these two extra attributes.  
> The two attributes specify the font file “format” from which the  
> replacement glyphs (versus replacement pronunciation) are drawn and  
> the “glyphRef” which identifies (in a scheme appropriate to the  
> chosen font file format) the glyph (or glyphs)  that is to replace  
> the spanned content. This is a solution that has worked for SVG for  
> years.
>
> So, in summary, because documents are edited and because requiring  
> edits to affect two separate files is generally a bad idea, it makes  
> sense to attach replacement data (whether pronunciations or  
> different presentation glyphs) to the content that is being  
> replaced. This can be done, by putting the replacement data in an  
> attribute of the element whose content is being replaced and using  
> the selection mechanisms of CSS to enable the use of that data to  
> replace the content. Then, as appropriate, CSS is controlling the  
> styling, the use of the replacement data, but is not carrying that  
> data.
>
> Steve Zilles
>
Received on Friday, 4 February 2011 10:40:53 UTC