Re: Language Identifier List up for comments from A. Vine on 2004-12-15 (www-international@w3.org from October to December 2004)

From: A. Vine <andrea.vine@Sun.COM>
Date: Wed, 15 Dec 2004 15:17:52 -0800
To: www-international@w3.org
Message-id: <41C0C620.1000900@sun.com>

Elizabeth J. Pyatt wrote:

> I do see your point, but I'm not sure what language tag will help in 
> your scenarios.

It's much safer to assume "Simplified Chinese, Mandarin in the PRC" when 
the tag is "zh-cn" than when it's "zh".  I'm talking about existing 
tags, not those that might be interpreted in the future.  Obviously if 
you get "zh-mandarin" or whatever the Mandarin subtag is, it doesn't 
really help (over "zh") for rendering or charset guessing, but it does 
for voice synthesis.

> 
> Currently, I believe that the encoding tag (e.g. "gb5") plus the actual 
> characters tells the browsers what to display with or without a language 
> tag.

The GBn, CNSn, and Big5 encodings give pretty strong hints (though in 
some cases no more definitive than UTF-8).  But UTF-8 doesn't.  You'd 
have to then look at the codepoints used to determine the script, and 
thus the font.

> If you have a page tagged correctly for language BUT forget the 
> encoding tag, you are in trouble generally speaking. 

Yes, although using the language tag along with charset heuristics and 
you can make a fairly accurate guess.

> But if you forget 
> the language tag but include the encoding tag, usually you will get good 
> results visually.

Guessing the language from the charset can be much less accurate if 
you're using UTF-8, or ISO-8859-1 or -15, or several others.  Language 
heuristics take more processing and a larger sample to get reasonable 
accuracy.

> 
>>  I don't know what voice synthesizer to load.  I have to guess or make 
>> assumptions or run some additional heuristics.
> 
> 
> And this is where I think the language tag is most valid. Because of the 
> way the Chinese script (Simple/Traditional) is designed, it may be a bit 
> "language blind" in some cases.
> 
> A speaker in Hong Kong seeing the characters may read it in Cantonese 
> (zh-, but a speaker in Beiging may read it in Mandarin Chinese. You 
> could design a speech synthesizer which reads aloud either depending on 
> user preferences.

True, so long as there are some user prefs available to the synthesizer, 
it could always default to those when in doubt.  User prefs are always a 
good idea.

> This is why I claim that you may be stuck without a 
> specific language code.

It's true, though the more specific, the less of a chance of matching. 
It might be nice to know that the German if spoken through a synthesizer 
is that of the Lauterbach dialect, but if the tag is "de-DE-lauterbach" 
then even though the text is perfectly uderstandable in its written form 
to many other German readers, it won't match someone's "de-DE" language 
preference and so will not be shown.

Andrea

Received on Wednesday, 15 December 2004 23:13:03 UTC