- From: A. Vine <andrea.vine@Sun.COM>
- Date: Wed, 15 Dec 2004 15:17:52 -0800
- To: www-international@w3.org
Elizabeth J. Pyatt wrote: > I do see your point, but I'm not sure what language tag will help in > your scenarios. It's much safer to assume "Simplified Chinese, Mandarin in the PRC" when the tag is "zh-cn" than when it's "zh". I'm talking about existing tags, not those that might be interpreted in the future. Obviously if you get "zh-mandarin" or whatever the Mandarin subtag is, it doesn't really help (over "zh") for rendering or charset guessing, but it does for voice synthesis. > > Currently, I believe that the encoding tag (e.g. "gb5") plus the actual > characters tells the browsers what to display with or without a language > tag. The GBn, CNSn, and Big5 encodings give pretty strong hints (though in some cases no more definitive than UTF-8). But UTF-8 doesn't. You'd have to then look at the codepoints used to determine the script, and thus the font. > If you have a page tagged correctly for language BUT forget the > encoding tag, you are in trouble generally speaking. Yes, although using the language tag along with charset heuristics and you can make a fairly accurate guess. > But if you forget > the language tag but include the encoding tag, usually you will get good > results visually. Guessing the language from the charset can be much less accurate if you're using UTF-8, or ISO-8859-1 or -15, or several others. Language heuristics take more processing and a larger sample to get reasonable accuracy. > >> I don't know what voice synthesizer to load. I have to guess or make >> assumptions or run some additional heuristics. > > > And this is where I think the language tag is most valid. Because of the > way the Chinese script (Simple/Traditional) is designed, it may be a bit > "language blind" in some cases. > > A speaker in Hong Kong seeing the characters may read it in Cantonese > (zh-, but a speaker in Beiging may read it in Mandarin Chinese. You > could design a speech synthesizer which reads aloud either depending on > user preferences. True, so long as there are some user prefs available to the synthesizer, it could always default to those when in doubt. User prefs are always a good idea. > This is why I claim that you may be stuck without a > specific language code. It's true, though the more specific, the less of a chance of matching. It might be nice to know that the German if spoken through a synthesizer is that of the Lauterbach dialect, but if the tag is "de-DE-lauterbach" then even though the text is perfectly uderstandable in its written form to many other German readers, it won't match someone's "de-DE" language preference and so will not be shown. Andrea
Received on Wednesday, 15 December 2004 23:13:03 UTC