Re: XML Blueberry (non-ASCII name characters in Japan) from John Cowan on 2001-07-09 (www-xml-blueberry-comments@w3.org from July 2001)

From: John Cowan <jcowan@reutershealth.com>
Date: Mon, 09 Jul 2001 16:01:38 -0400
To: Elliotte Rusty Harold <elharo@metalab.unc.edu>
CC: xml-dev@lists.xml.org, www-xml-blueberry-comments <www-xml-blueberry-comments@w3.org>
Message-ID: <3B4A0DA2.1090808@reutershealth.com>
Elliotte Rusty Harold wrote:

 > No, I was specifically thinking about the additional Han ideographs
 > for Japanese and Chinese.

I admit that the case for including these specifically is much
weaker than for the languages with previously unencoded scripts.
However, in for a penny, in for a pound (see below); it will
not cost much more to handle them all.

 > For the scripts you mention it would be
 > enough to list the languages along with some documentation of the
 > number of people who speak them, and prefer to write code in their
 > native tongues.

Until now, many of these people have been unable to write code
in their native tongues.  Further, I don't think that writing markup
is to be identified with writing code, though I admit that it is
different from writing plain text.  Plenty of people can and do mark up
documents structurally who are utterly innocent of programming.

In addition, many can read marked up text who don't write any
(and indeed, many can read plain text who never write any).
This is discouraged if the markup is Latin whereas the text
is local.

 > The distinction [between content and markup] keeps getting glossed
 > over,

Not by me.

 > Dhivehi I've never heard of, and
 > it doesn't seem to be in Unicode 3.0. I can't find it in any of my
 > references, at least under that name. Is it new in 3.1? Wait, I
 > just found it on the Internet. It's called Thaana in Unicode, and
 > is spoken in the Maldives by about 250,000 people.

Thaana is the script, Dhivehi is the language.

 > It might or
 > might not have an established Roman transliteration. The web sites
 > I looked at were unclear on this point.
 >
 > Yi is definitely different. There is an established Roman based
 > alphabet for it.  It may not be the preferred script for all Yi
 > speakers, but it's adequate for markup.

That is no argument.  There are several established Latin-script
transliterations for Greek, and every educated Greek-speaker knows and
uses the Latin script, but Greeks want to write Greek in the Greek
script.  Wherefore it is encoded in Unicode and other encodings,
and allowed in XML names.

 > In any case, Burmese and Khmer are genuinely different scripts that
 > don't seem to have accepted mappings into any other scripts.
 > However, they are both relatively small scripts that can fit into
 > the upper half of a one-byte font even if the purported character
 > set is something else completely like 8859-1. In fact, I suspect
 > that's how they're used today.

I don't understand the relevance of this.  Adding so much as one
Unicode character breaks compatibility just as much as adding
40,000 plus.  Why are relatively small scripts to be privileged
in this process? Or are these users to be stuck with font-kludge
encodings forever?  (More accurately, are they to use Unicode
for plain-text documents, but not for marked-up ones?)

 > Let's try and put some numbers on this. For Burmese, Dhivehi, Khmer,
 > and the Ethiopic languages we're probably talking in the ballpark
 > of 100 million people. (source: Kenneth Katzner, Languages of the
 > World). Of these 100 million people how many of them are likely to
 > write markup?

Quien sabe?  If you build it, they will come.

 > I suggest that a reasonable means of answering at least the latter
 > of these questions would be to investigate the computer science
 > programs in the countries and regions where these languages are
 > spoken. If they're taught primarily in Burmese, Dhivehi, Khmer,
 > etc. then I think it's plausible to assume that markup writers in
 > these languages would use their native tongue.

Again, I think this connection between markup and programming
is unwarranted.  I can throw a stone from where I am sitting
now and hit several persons who can do markup perfectly well
but cannot program at all.

Dhivehi in particular AFAIK has never had an
encoding before, except for font-kludge encodings, and it
is RTL to boot, so teaching programming using it would be a difficult
matter.

 > On the other hand,
 > if it turns out that some other language is the accepted language
 > for technical communication within these countries, then I propose
 > that it's not necessary in XML markup.

See Makoto's posting about Japanese vs. English as used by
Japanese engineers.

-- 
There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein
Received on Monday, 9 July 2001 16:17:30 UTC