- From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Date: Thu, 21 Jun 2001 09:37:59 -0400
- To: unicode@unicode.org, unicore@unicode.org, www-international@w3.org
This is going out to three mailing lists. I'd like to add a fourth and suggest that future discussion take place on xml-dev, which probably has the broadest reach of interested parties. Starting in Unicode 3.0 a number of new characters have been added both for new scripts that were previously unencoded such as Amharic and Cherokee as well as for old scripts that were incomplete such as Chinese. The concern is that since XML 1.0 is based on Unicode 2.0, "fully native-language XML markup is not possible in at least the following languages: Amharic, Burmese, Canadian aboriginal languages, Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian (traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese, Japanese, Korean (Hangul script), and Vietnamese can make use of only a limited subset of their complete character repertoires." If this were true, it would be a very serious criticism of XML 1.0 Fortunately, however, the claim is not nearly as dire as the proposal makes out. Indeed the proposal substantially overstates the need for any changes. The XML 1.0 BNF productions do not allow these newly defined characters to be used in element, attribute, and entity names. However, they can be used in the text of element content and attribute values. This means that XML is fully adequate for literature and data in Amharic, Burmese, Canadian aboriginal languages, Cantonese, Cherokee, Dhivehi, Khmer, Mongolian, Oromo, Syriac, Tigre, Yi, Mandarin, Japanese, Korean, and Vietnamese. Only the markup, that is, the tags, would have to be written in another script. Given that there aren't even localized operating systems in most of these languages, and that today's software effectively requires users to have a solid knowledge of at least the ASCII characters, I don't think the need to write markup (as opposed to text) in Cherokee justifies breaking backwards compatibility. But wait! It's not even that bad. Several of the languages listed are total red herrings. You most certainly can write markup in Cantonese, Japanese, Korean, Mandarin, and Vietnamese today. The new characters Unicode has added to these scripts are very obscure. In fact, experts often disagree over whether some of them exist at all, or are merely typographical variations of existing characters. Since the 1700s Vietnamese has been written in a Latin-based alphabet that is fully available in XML and that can write any Vietnamese word. Vietnamese only uses the Han ideographs for classical documents and occasional signage or decoration, and it seems very unlikely that a Vietnamese speaker would write their markup using Han ideographs. Japanese has not one but two phonetic alphabets that can write any Japanese word if the right Han ideograph character is not encoded. Chinese speakers can use either Latin characters or the native Bopomofo phonetic system for the very rare cases where a character they need is not encoded. The fact is most native speakers of Chinese, Japanese, Korean and Vietnamese do not recognize the vast majority of these new characters, and the need for them in markup (again, as opposed to text) is non-existent. There are a few good points in this proposal. I'm sure there's an occasional need for writing markup in Amharic, Burmese, Khmer, Mongolian, Yi, and a few of the other languages the proposal lists. But I don't believe there's enough of a need to justify breaking compatibility with existing XML parsers, software, and systems. The XML Blueberry Requirements vastly overstate the case by ignoring the difference between markup and text in XML documents. I'd be willing to break backwards compatibility to allow text in these languages if we had to, but we don't. Text is already adequately handled by XML 1.0. All we're arguing about now are the tags, and that's just not a strong enough reason to break backwards compatibility. -- +-----------------------+------------------------+-------------------+ | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer | +-----------------------+------------------------+-------------------+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +----------------------------------+---------------------------------+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +----------------------------------+---------------------------------+
Received on Thursday, 21 June 2001 09:48:33 UTC