- From: Daniel Weck <daniel.weck@gmail.com>
- Date: Fri, 4 Feb 2011 09:33:26 +0000
- To: "www-style@w3.org list" <www-style@w3.org>
Thanks for your thorough analysis Stephen. I agree that the inherent connexion between text (in the HTML document) and associated pronunciation in the CSS is a valid argument in favor of moving the "phonemes" property (and the declaration of its associated phonetic dictionary) from CSS3-Speech into the markup itself. This would clearly make it easier for authors to maintain content. By the way, this is the path chosen by the current draft of EPUB3 ( http://epub-revision.googlecode.com/svn/trunk/build/spec/epub30-overview.html#sec-tts ). Please note that although the "tomato" pronunciation example indeed ties-in well with the concept of "dialect" (different accents), there are other use-cases whereby the disambiguation is required within the *same* "dialect". For example, the text token "read" in british english may be spoken as 'reed' or 'red'. Text-To-Speech engines usually process such token based on the surrounding context, but there are cases where the lack of context requires explicit authoring of a pronunciation rule (e.g. the line of text "I read it."). Also note that content replacement in CSS is analogous to the text normalization phase that precedes the text-to-phoneme conversion in speech systems. TTS engines carry out their own text normalization based on pre-defined rules (to deal with dates, currencies, abbreviations, etc.), but authors must be able to enforce specific rules (which may override the default behavior). CSS-based content generation and replacement can be misused of course, but in our current CSS3-Speech draft we give an example with "abbr", which showcases the clear separation of data and styling (no additional data is provided by the CSS rule): http://dev.w3.org/csswg/css3-speech/#content Regards, Daniel On 3 Feb 2011, at 23:31, Stephen Zilles wrote: > There was a interesting and informative discussion of the “phonemes” > property > http://dev.w3.org/csswg/css3-speech/#issue-phonemes > in the CSS3 Speech Module during the WG meeting last Wednesday > http://lists.w3.org/Archives/Public/www-style/2011Feb/0029.html > > [The comments below are not based on any expertise in Speech > synthesis; I have none. They are from the perspective is “What is > styling and What makes sense in CSS” Sometimes, some apparent > styling options may not make sense in CSS due to the structure of > the Web authoring environment.] > > The “phonemes” property is part of a two part solution to giving > alternate pronunciations to text that is receiving synthesized > speech. Typically, speech would be synthesized using a particular > “accent”, say American, to determine the rules for pronouncing the > text. But, say the author wanted to show the distinction between two > ways of pronouncing “tomato”; namely, “toe-may-toe” and “toe-mahh- > toe”. The default pronunciation of the text, “tomato” would give > only one of these pronunciation, say the first one. To get the other > pronunciation, “toe-mahh-toe”, it would be necessary to use > different pronunciation rules. > > The “phonemes” property (together with the @phonetic-alphabet rule) > are designed to allow an author to define and use a different > pronunciation. The @phonetic-alphabet rule allows a document to > specify, in the stylesheet, a single phonetic alphabet, such as the > “International Phonetic Alphabet”, that will be used to express the > non default pronunciations. > > To allow a sequence of text, say a <span>, to be given a different > pronunciation, the span must be given some identification, typically > an ID attribute, that can be referenced in the selector of a style > ruleset. A declaration of this ruleset would then specify the > “phonemes” property with a string value that expresses the desired > pronunciation of the content of the span using characters in the > specified phonetic alphabet. > > It is reasonable to argue that changing the pronunciation of the > word, “tomato” is a stylistic change. Certainly the underlying text > remains the same in both cases. For that reason, some people argued > that this is a reasonable use of CSS and styling. > > There is another viewpoint, however. This viewpoint notes that the > pronunciation change replaces whatever the content of the span is > whether or not that content spells “tomato”. That is, if someone > thinks that the example would be better with the word, “vitamin”, > then unless the style rulesset are also edited to change the > different pronunciation, the text “vitamin” would be pronounced “toe- > mahh-toe”, clearly an unintended effect. [The word “vitamin” is > pronounced as “veye-tah-min” in American and as “vih-tah-min” in > English, so it too would likely need the different pronunciation > mechanism.] > > This second viewpoint suggests that several ways to resolve the > issue. The first of these ways is, I believe fanciful, but is > instructive of the nature of a solution. If instead of having a > “phonemes” property, we might solve the problem with a “dialect” > property. The “dialect” property (which assumes there is a standard > notation for specifying a dialect in which to speak – something that > I doubt exists) would like the “phonemes” property attach a dialect > label to the span in question, say “American” or “British”. Then the > speech synthesis system would have to be able to speak in a number > of dialects, each called out by one of the dialect labels. But, in > every case, the text of the span would be what is input to speech > synthesis. And the dialect label would “style” the speech. If the > content were changed from “tomato” to “vitamin”, then depending on > which dialect label were used the spoken styling would change > consistently without any change to the stylesheet. > > But, I (a rank amateur in speech synthesis) am unaware of any > standard encoding of dialects which would, in any case, likely > require a large dictionary for each dialect. So, what does the above > fanciful solution tell us? It tells us that either the styling ought > to be acting on the content of the styled element or that any > alternative styling that replaces the content ought to be part of > the element itself and not part of the stylesheet used for the > document. This is a consequence of the separation of styling and > content. These have become separate files and requiring simultaneous > edits of both to make changes has been shown to often lead to > inconsistencies. > > When the content is what is styled as in the “dialect” case, it > suffices to edit the content of the element and the styling will > follow. If the content is replaced rather than being styled, it is > necessary to edit both the content of the element and the alternate > pronunciation. This is facilitated if both the content and the > pronunciation are part of the same element so the need for editing > both is more apparent. > > So, given this viewpoint, one solution would be to have an > attribute, for example, “pronounceAs” on the element (the above > “span”) that is to have a different pronunciation. The value of this > attribute would be the same as the value of the “phonemes” property. > The only distinction between “pronounceAs” and “phonemes” being > where the pronunciation data is stored. In the former, it is with > the content it replaces and in the latter, the data is with the > stylesheet. > > OK, This would be a better solution to the editing problem, but it > seems to require introducing another special attribute to both XML > and HTML. This would be a pain. > > To avoid, having to identify a specific XML attribute name, such as > “pronounceAs”, one could have a CSS property, for example, > “UseToPronouce” that controls whether the pronunciation data is > used. This property, when used in a ruleset with an “attribute > selector” that matches the attribute whose value has the > pronunciation string, would either trigger the pronunciation > replacement when the value was “always” or would ignore the > pronunciation replacement if the value was “never”. Of course, not > having a ruleset that selects for that attribute would also ignore > the pronunciation data so there is not much use for the “never” > value except when using the CSSOM to turn off alternative > pronunciations. > > The above approach would solve most of what “phonemes” was intended > to do. The part that is still missing is the mechanism for > specifying which phonetic-alphabet is being used in the > pronunciation data. This is done with an “@phonetic-alphabet” rule > in the existing CSS3 Speech WD. This again has the problem that the > information is in the stylesheet rather than the document being > styled. This probably a less serious problem for editing that is the > “phonemes” problem, but it does have the same risk of requiring two > files to be edited to make a change; in this case, a change of the > phonetic-alphabet being used. > > The same scheme used above to specify an XML attribute is carrying > pronunciation data and to enable its usage (a property analogous to > “UseToPronouce”) can be used to identify and enable an attribute > that had as its value the identifier of a phonetic-alphabet. > > Since the names of the attributes that identify the phonetic- > alphabet and the pronunciation data are not codified in the solution > immediately above, it is possible to have multiple sets of > pronunciation data with a different attribute name for each kind of > data; that is, each set would use a different phonetic-alphabet. > Then a media query could be used to choose the ruleset that selected > on the phonetic alphabet understood by the User Agent on which the > document was being spoken. > > Note, the above contribution is not entirely original. It is modeled > on the way that AltGlyphs > http://www.w3.org/TR/SVG11/text.html#AlternateGlyphs > are specified in SVG. SVG has two (standardized) attributes that can > be used on an AltGlyph element. (An AltGlyph element is really a > specialized Tspan element that allows these two extra attributes. > The two attributes specify the font file “format” from which the > replacement glyphs (versus replacement pronunciation) are drawn and > the “glyphRef” which identifies (in a scheme appropriate to the > chosen font file format) the glyph (or glyphs) that is to replace > the spanned content. This is a solution that has worked for SVG for > years. > > So, in summary, because documents are edited and because requiring > edits to affect two separate files is generally a bad idea, it makes > sense to attach replacement data (whether pronunciations or > different presentation glyphs) to the content that is being > replaced. This can be done, by putting the replacement data in an > attribute of the element whose content is being replaced and using > the selection mechanisms of CSS to enable the use of that data to > replace the content. Then, as appropriate, CSS is controlling the > styling, the use of the replacement data, but is not carrying that > data. > > Steve Zilles >
Received on Friday, 4 February 2011 10:40:53 UTC