Re: Language Identifier List up for comments from Elizabeth J. Pyatt on 2004-12-15 (www-international@w3.org from October to December 2004)

From: Elizabeth J. Pyatt <ejp10@psu.edu>
Date: Wed, 15 Dec 2004 16:53:16 -0500
To: "A. Vine" <andrea.vine@sun.com>
Cc: www-international@w3.org
Message-Id: <p06100500bde65cd2d0ee@[128.118.8.31]>

I do see your point, but I'm not sure what language tag will help in 
your scenarios.


>>I disagree on this point. There are Central Asian languages (e.g. 
>>Uzbek) which can be written in three scripts (Roman, Cyrillic, 
>>Arabic), yet they are not called different languages.
>
>You are misinterpreting my point.  When a language is written, it 
>has a script (or writing system, some might prefer).  What that 
>script _is_ is another matter.  _That_ it is, is my point.  Which is 
>one reason why "zh" alone is unhelpful for actual, practical 
>application.  If I am a browser and I get a page that says it's 
>"zh", I don't know what to do. I don't know what to match it to, I 
>don't know what font to load,

Currently, I believe that the encoding tag (e.g. "gb5") plus the 
actual characters tells the browsers what to display with or without 
a language tag. If you have a page tagged correctly for language BUT 
forget the encoding tag, you are in trouble generally speaking. But 
if you forget the language tag but include the encoding tag, usually 
you will get good results visually.

>  I don't know what voice synthesizer to load.  I have to guess or 
>make assumptions or run some additional heuristics.

And this is where I think the language tag is most valid. Because of 
the way the Chinese script (Simple/Traditional) is designed, it may 
be a bit "language blind" in some cases.

A speaker in Hong Kong seeing the characters may read it in Cantonese 
(zh-, but a speaker in Beiging may read it in Mandarin Chinese. You 
could design a speech synthesizer which reads aloud either depending 
on user preferences. This is why I claim that you may be stuck 
without a specific language code.

It's theoretically many languages in one script - which IS a major 
advantage of the script. However, it's hard to say what language it 
is phonetically. The world tends to assume it's zh-han because of 
political matters, but apparently there's a bit of fudge factor 
involved.

>I have heard this, but I have also heard from some of our Chinese 
>l10n folks that there are >some differences in the way things would 
>be written in some dialects.  In others words, it may >be understood 
>but it's not "native".  But I leave this to the Chinese scholars.

If there are genuine script differences, then I agree you would have 
to specify zh-han vs. the other zh's.. Maybe I am misunderstanding 
the situation. It would be nice if the Chinese scholars could tell 
the group.

I definitely agree you would have to do it for the Roman forms.


>What most software does right now is makes assumptions due to legacy 
>use of "zh" meaning "Simplified Chinese, Mandarin in the PRC".  It 
>doesn't matter what we do from now on, as long as that legacy tag is 
>out there (and it is).

That sounds about right. That's pretty much what the world assumes 
when you say "Chinese"

Cheers

Elizabeth

>
>
>Andrea
>
>>
>>Elizabeth Pyatt


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=
Elizabeth J. Pyatt, Ph.D.
Instructional Designer
Education Technology Services, TLT/ITS
Penn State University
ejp10@psu.edu, (814) 865-0805 or (814) 865-2030 (Main Office)

210 Rider Building II
227 W. Beaver Avenue
State College, PA   16801-4819
http://www.personal.psu.edu/ejp10/psu
http://tlt.psu.edu

Received on Wednesday, 15 December 2004 21:57:50 UTC