XML 1.1 comments by Elliotte Rusty Harold (from cafeconleche.org)

Scripsit:

This is a proposal for a new backwards incompatible version of XML. The
specific goal is to address some shortcomings of the XML 1.0 character
model relative to Unicode 3.1, as well as throwing a sop to IBM.

The concern with respect to IBM is that one of the world's largest
corporations, with thousands of patents, legions of programmers, billions
of dollars in revenue, and resources pouring out of every orifice is
somehow unable to handle documents where lines end with carriage returns
and line feeds, as documents do on every non-IBM system on the planet. The
only reason there's a problem here at all is because IBM tried to go
it alone as a monopoly and set standards by fiat for years rather than
working with the rest of the industry. Consequently their mainframe
character sets don't really interoperate well with everybody else's
character sets. In XML this arises as a problem with line endings when
someone edits an XML document with an IBM mainframe text editor. IBM
mostly grew out of their anti-competitive monopolistic tendencies
over the last thirty years (with a large dose of assistance from the
U.S. government). However, there are still some legacy issues relating to
their attempt to dictate standards to the rest of the industry, and this
is one of them. Now rather than fixing their own broken mainframe text
editing software, they want everyone else on the planet to change their
software so IBM doesn't have to. (If this reminds anybody of the current
mess with Oracle and UTF-8, you're not alone.) This proposal was laughed
out of the W3C a few months ago when IBM made it, or at least it seemed
to be. However, it's now risen from the dead as part of XML Blueberry;
but it doesn't make any more sense now than it did then; and it still
deserves to be laughed off the table with whooping cries of derision.

The second proposal for breaking backwards compatibility with
existing parsers is much more serious, and requires a more thoughtful
response. Starting in Unicode 3.0 a number of new characters have
been added both for new scripts that were previously unencoded such as
Amharic and Cherokee as well as for old scripts that were incomplete
such as Chinese. The concern is that since XML 1.0 is based on Unicode
2.0, "fully native-language XML markup is not possible in at least the
following languages: Amharic, Burmese, Canadian aboriginal languages,
Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian
(traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese,
Japanese, Korean (Hangul script), and Vietnamese can make use of only
a limited subset of their complete character repertoires."

If this were true, it would be a very serious criticism of XML 1.0
Fortunately, however, the claim is not nearly as dire as the proposal
makes out. Indeed the proposal substantially overstates the need for any
changes. The XML 1.0 BNF productions do not allow these newly defined
characters to be used in element, attribute, and entity names. However,
they can be used in the text of element content and attribute values. This
means that XML is fully adequate for literature and data in Amharic,
Burmese, Canadian aboriginal languages, Cantonese, Cherokee, Dhivehi,
Khmer, Mongolian, Oromo, Syriac, Tigre, Yi, Mandarin, Japanese, Korean,
and Vietnamese. Only the markup, that is, the tags, would have to
be written in another script. Given that there aren't even localized
operating systems in most of these languages, and that today's software
effectively requires users to have a solid knowledge of at least the ASCII
characters, I don't think the need to write markup (as opposed to text)
in Cherokee justifies breaking backwards compatibility.

But wait! It's not even that bad. Several of the languages listed are
total red herrings. You most certainly can write markup in Cantonese,
Japanese, Korean, Mandarin, and Vietnamese today. The new characters
Unicode has added to these scripts are very obscure. In fact, experts
often disagree over whether some of them exist at all, or are merely
typographical variations of existing characters. Since the 1700s
Vietnamese has been written in a Latin-based alphabet that is fully
available in XML and that can write any Vietnamese word. Vietnamese
only uses the Han ideographs for classical documents and occasional
signage or decoration, and it seems very unlikely that a Vietnamese
speaker would write their markup using Han ideographs. Japanese has not
one but two phonetic alphabets that can write any Japanese word if the
right Han ideograph character is not encoded. Chinese speakers can use
either Latin characters or the native Bopomofo phonetic system for the
very rare cases where a character they need is not encoded. The fact is
most native speakers of Chinese, Japanese, Korean and Vietnamese do not
recognize the vast majority of these new characters, and the need for
them in markup (again, as opposed to text) is non-existent.

There are a few good points in this proposal. I'm sure there's an
occasional need for writing markup in Amharic, Burmese, Khmer, Mongolian,
Yi, and a few of the other languages the proposal lists. But I don't
believe there's enough of a need to justify breaking compatibility
with existing XML parsers, software, and systems. The XML Blueberry
Requirements vastly overstate the case by ignoring the difference between
markup and text in XML documents. I'd be willing to break backwards
compatibility to allow text in these languages if we had to, but we
don't. Text is already adequately handled by XML 1.0. All we're arguing
about now are the tags, and that's just not a strong enough reason to
break backwards compatibility.


-- 
John Cowan <jcowan@reutershealth.com>     http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,    http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_

Received on Wednesday, 16 October 2002 08:54:14 UTC