Re: Language Identifier List up for comments from A. Vine on 2004-12-15 (www-international@w3.org from October to December 2004)

From: A. Vine <andrea.vine@sun.com>
Date: Wed, 15 Dec 2004 11:55:54 -0800
To: www-international@w3.org
Message-id: <41C096CA.9030107@sun.com>
Elizabeth J. Pyatt wrote:

> A. Vine wrote
> 
>>
>>> But now you are talking about differences in a script, not 
>>> differences in a language.
>>
>>
>> Um, when you're talking about the written word, they are somewhat 
>> inseparable.
> 
> 
> I disagree on this point. There are Central Asian languages (e.g. Uzbek) 
> which can be written in three scripts (Roman, Cyrillic, Arabic), yet 
> they are not called different languages. 

You are misinterpreting my point.  When a language is written, it has a 
script (or writing system, some might prefer).  What that script _is_ is 
another matter.  _That_ it is, is my point.  Which is one reason why 
"zh" alone is unhelpful for actual, practical application.  If I am a 
browser and I get a page that says it's "zh", I don't know what to do. 
I don't know what to match it to, I don't know what font to load, I 
don't know what voice synthesizer to load.  I have to guess or make 
assumptions or run some additional heuristics.

What most software does right now is makes assumptions due to legacy use 
of "zh" meaning "Simplified Chinese, Mandarin in the PRC".  It doesn't 
matter what we do from now on, as long as that legacy tag is out there 
(and it is).

> I realize that there are cases 
> of similar spoken forms being labelled as different languages because 
> they are written in different scripts, but that is more a matter of 
> politics than of linguistics.
> 
> I concede that the encoding tag is not enough to specify the script, but 
> I would consider script to be a  third meta tag. (i.e. ISO-15924 - 
> http://www.unicode.org/iso15924/iso15924-codes.html)

It's fine if it's there, but software interpretation of script subtags 
is a future concept, not a current one.

> 
> I see that using Chinese-TW is NOT recommended, and I am glad to see 
> that. I also see why "zh" would not be helpful in of itself as it is 
> currently defined. I was assuming a definition of "zh" as the written 
> form used in Chinese dialect communities, but that does not appear to be 
> the correct definition. It would not be Mandarin Chinese because it can 
> be read all over the country by speakers of the different dialects.

I have heard this, but I have also heard from some of our Chinese l10n 
folks that there are some differences in the way things would be written 
in some dialects.  In others words, it may be understood but it's not 
"native".  But I leave this to the Chinese scholars.

> 
> It's almost like a data set of numeric text which could be read in 
> almost any language.
> 
> 1 2 3
> =uno,dos,tres?
> =one,two,three?
> 
> What kind of language tag would a set of numbers be? "Math"? Would it 
> have no tag and assume a user agent will use the default language 
> (whatever it is). I assume that a speech synthesizer agent would treat 
> Chinese characters as if it were Mandarin Chinese and pronounce it 
> accordingly, but you could build several agents that could read them in 
> the other forms (Hakka, Cantonese)
> 
> I would argue that if you're speaking of pin yin Romanization, it might 
> be important to specify that it is the Mandarin form because now 
> phonetic form is represented. The Romanized form of Cantonese would be 
> different.

Again, I leave it to the Chinese scholars.

Andrea

> 
> Elizabeth Pyatt
> 
>
Received on Wednesday, 15 December 2004 19:51:02 UTC