- From: John Cowan <jcowan@reutershealth.com>
- Date: Wed, 16 Oct 2002 08:52:40 -0400 (EDT)
- To: www-xml-blueberry-comments@w3.org
Scripsit: This is a proposal for a new backwards incompatible version of XML. The specific goal is to address some shortcomings of the XML 1.0 character model relative to Unicode 3.1, as well as throwing a sop to IBM. The concern with respect to IBM is that one of the world's largest corporations, with thousands of patents, legions of programmers, billions of dollars in revenue, and resources pouring out of every orifice is somehow unable to handle documents where lines end with carriage returns and line feeds, as documents do on every non-IBM system on the planet. The only reason there's a problem here at all is because IBM tried to go it alone as a monopoly and set standards by fiat for years rather than working with the rest of the industry. Consequently their mainframe character sets don't really interoperate well with everybody else's character sets. In XML this arises as a problem with line endings when someone edits an XML document with an IBM mainframe text editor. IBM mostly grew out of their anti-competitive monopolistic tendencies over the last thirty years (with a large dose of assistance from the U.S. government). However, there are still some legacy issues relating to their attempt to dictate standards to the rest of the industry, and this is one of them. Now rather than fixing their own broken mainframe text editing software, they want everyone else on the planet to change their software so IBM doesn't have to. (If this reminds anybody of the current mess with Oracle and UTF-8, you're not alone.) This proposal was laughed out of the W3C a few months ago when IBM made it, or at least it seemed to be. However, it's now risen from the dead as part of XML Blueberry; but it doesn't make any more sense now than it did then; and it still deserves to be laughed off the table with whooping cries of derision. The second proposal for breaking backwards compatibility with existing parsers is much more serious, and requires a more thoughtful response. Starting in Unicode 3.0 a number of new characters have been added both for new scripts that were previously unencoded such as Amharic and Cherokee as well as for old scripts that were incomplete such as Chinese. The concern is that since XML 1.0 is based on Unicode 2.0, "fully native-language XML markup is not possible in at least the following languages: Amharic, Burmese, Canadian aboriginal languages, Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian (traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese, Japanese, Korean (Hangul script), and Vietnamese can make use of only a limited subset of their complete character repertoires." If this were true, it would be a very serious criticism of XML 1.0 Fortunately, however, the claim is not nearly as dire as the proposal makes out. Indeed the proposal substantially overstates the need for any changes. The XML 1.0 BNF productions do not allow these newly defined characters to be used in element, attribute, and entity names. However, they can be used in the text of element content and attribute values. This means that XML is fully adequate for literature and data in Amharic, Burmese, Canadian aboriginal languages, Cantonese, Cherokee, Dhivehi, Khmer, Mongolian, Oromo, Syriac, Tigre, Yi, Mandarin, Japanese, Korean, and Vietnamese. Only the markup, that is, the tags, would have to be written in another script. Given that there aren't even localized operating systems in most of these languages, and that today's software effectively requires users to have a solid knowledge of at least the ASCII characters, I don't think the need to write markup (as opposed to text) in Cherokee justifies breaking backwards compatibility. But wait! It's not even that bad. Several of the languages listed are total red herrings. You most certainly can write markup in Cantonese, Japanese, Korean, Mandarin, and Vietnamese today. The new characters Unicode has added to these scripts are very obscure. In fact, experts often disagree over whether some of them exist at all, or are merely typographical variations of existing characters. Since the 1700s Vietnamese has been written in a Latin-based alphabet that is fully available in XML and that can write any Vietnamese word. Vietnamese only uses the Han ideographs for classical documents and occasional signage or decoration, and it seems very unlikely that a Vietnamese speaker would write their markup using Han ideographs. Japanese has not one but two phonetic alphabets that can write any Japanese word if the right Han ideograph character is not encoded. Chinese speakers can use either Latin characters or the native Bopomofo phonetic system for the very rare cases where a character they need is not encoded. The fact is most native speakers of Chinese, Japanese, Korean and Vietnamese do not recognize the vast majority of these new characters, and the need for them in markup (again, as opposed to text) is non-existent. There are a few good points in this proposal. I'm sure there's an occasional need for writing markup in Amharic, Burmese, Khmer, Mongolian, Yi, and a few of the other languages the proposal lists. But I don't believe there's enough of a need to justify breaking compatibility with existing XML parsers, software, and systems. The XML Blueberry Requirements vastly overstate the case by ignoring the difference between markup and text in XML documents. I'd be willing to break backwards compatibility to allow text in these languages if we had to, but we don't. Text is already adequately handled by XML 1.0. All we're arguing about now are the tags, and that's just not a strong enough reason to break backwards compatibility. -- John Cowan <jcowan@reutershealth.com> http://www.reutershealth.com I amar prestar aen, han mathon ne nen, http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
Received on Wednesday, 16 October 2002 08:54:14 UTC