Re: Language Identifier List up for comments

Elizabeth J. Pyatt wrote:

> But now you are talking about differences in a script, not differences 
> in a language.

Um, when you're talking about the written word, they are somewhat 
inseparable.

> 
> You can use either Simplified or Traditional Characters to write 
> Mandarin Chinese (and Traditional can be used for Cantonese - I don't 
> know about Simplified, per se, for Cantonese).

Yes, I suspect many of us on this list are keenly aware of that.

> 
> Previously, the language codes have been used to encode both script and  
> language. I was assuming the characters embedded would convey which 
> script is being used.

Nope.  Take the EUC encodings - unless you know which EUC you're dealing 
with, it could be one or the other. (Of course, if there's a charset 
tag, that would solve that problem, but pretend there isn't, because 
that's often the case.)  And that has little to do with fonts.  Fonts 
contain the glyphs they contain, no more, no less.  Even if the 
codepoints are distinctive (as in Unicode), you still need to know if 
you have to load a font to support those particular codepoints.  So 
you'd have to add heuristics to determine what sort of font you're 
looking for.  It's possible, but not done in most places.  The advantage 
of having the tag is to speed things up so that heuristics aren't necessary.

My point is that 'zh' is pretty nebulous in terms of language, dialect, 
and writing system so as to be unhelpful for pretty much any processing. 
  So assumptions will likely be made about the details (SCH, Mandarin in 
the PRC).

> 
> In some ways, a Chinese text could represent several languages depending 
> on how it is formed. Are there script changes that happen to write the 
> different Chinese dialects on an everyday basis?

A question for the Chinese scholars on the list.

> 
> Elizabeth
> 
>>
>>
>> At a minimum it's really helpful to know whether it's Simplified or 
>> Traditional, because it may affect the font chosen for rendering (take 
>> for example a situation where the machine config has a 
>> Traditional-only font as a default and the text is in Simplified.) But 
>> beyond rendering, if software is trying to pick text from a language 
>> preference list, "zh" really messes us up.  It's much more generic 
>> than "en".  From a matching perspective, we tend to assume that "zh" 
>> really means "Simplified Chinese rendering of Mandarin as used in the 
>> PRC", but that is not the intention of the "zh" identifier.
>>
>> Andrea
> 
> 
> 

Received on Tuesday, 14 December 2004 22:44:56 UTC