Blueberry/Unicode/XML (fwd)

----- Forwarded message from Tim Bray -----

From xml-dev-errors@lists.xml.org Tue Jul 10 00:39:56 2001
Envelope-to: cowan@mercury.ccil.org
Received: from one.elistx.com ([209.116.252.130])
	by mercury.ccil.org with esmtp (Exim 3.12 #1 (Debian))
	id 15JpJU-0000rN-00
	for <cowan@mercury.ccil.org>; Tue, 10 Jul 2001 00:39:56 -0400
Received: from CONVERSION-DAEMON.eListX.com by eListX.com (PMDF V6.0-24 #44856)
	id <0GG800G01Q2YE3@eListX.com> for cowan@mercury.ccil.org; Tue,
	10 Jul 2001 00:35:41 -0400 (EDT)
Received: from ELIST-DAEMON.eListX.com by eListX.com (PMDF V6.0-24 #44856)
	id <0GG800G04Q2VDW@eListX.com> (original mail from tbray@textuality.com); Tue,
	10 Jul 2001 00:35:20 -0400 (EDT)
Received: from CONVERSION-DAEMON.eListX.com by eListX.com (PMDF V6.0-24 #44856)
	id <0GG800G01Q2VDS@eListX.com> for xml-dev@elist.lists.xml.org
	(ORCPT xml-dev@lists.xml.org); Tue, 10 Jul 2001 00:35:19 -0400 (EDT)
Received: from DIRECTORY-DAEMON.eListX.com by eListX.com (PMDF V6.0-24 #44856)
	id <0GG800G01Q2UDR@eListX.com> for xml-dev@elist.lists.xml.org
	(ORCPT xml-dev@lists.xml.org); Tue, 10 Jul 2001 00:35:18 -0400 (EDT)
Received: from mail.dev.antarcti.ca (gt.antarcti.ca [209.17.183.233])
	by eListX.com (PMDF V6.0-24 #44856) with ESMTP id <0GG800CMKQ2TMZ@eListX.com>
	for xml-dev@lists.xml.org; Tue, 10 Jul 2001 00:35:18 -0400 (EDT)
Received: from rune.antarcti.ca (dev1.dev.antarcti.ca [10.1.1.8])
	by mail.dev.antarcti.ca (Postfix) with ESMTP id E36CD10A23	for
	<xml-dev@lists.xml.org>; Mon, 09 Jul 2001 21:33:34 -0700 (PDT)
Date: Mon, 09 Jul 2001 21:33:12 -0700
From: Tim Bray <tbray@textuality.com>
Subject: Blueberry/Unicode/XML
In-reply-to: <3B49E743.5042FDCD@mitre.org>
X-Sender: tbray@pop.intergate.ca
To: xml-dev@lists.xml.org
Message-id: <5.1.0.14.2.20010709211010.02557b20@pop.intergate.ca>
X-Mailer: QUALCOMM Windows Eudora Version 5.1
List-Owner: <mailto:xml-dev-help@lists.xml.org>
List-Post: <mailto:xml-dev@lists.xml.org>
List-Subscribe: <mailto:xml-dev-request@lists.xml.org?body=subscribe>
List-Unsubscribe: <mailto:xml-dev-request@lists.xml.org?body=unsubscribe>
List-Archive: <http://lists.xml.org/archives/xml-dev>
List-Help: <http://lists.xml.org/elists/admin_email.shtml>,
	<mailto:xml-dev-request@lists.xml.org?body=help>

Boy, this one's tough.  I buy neither Elliote's assertion that
changing XML is unthinkable, nor John Cowan's assertion that the
depth of the cultural affront to users of pre-Unicode-3.1 
languages is so high as to outweigh consideration of cost.

I just went and reviewed the Blueberry requirements at
http://www.w3.org/TR/xml-blueberry-req and I'm not very comfy
with them.  There is repeated and specific reference to the
problem being that posed by Unicode 3.1.  The problem isn't
3.1, it's that Unicode is an unfinished standard that
continues to grow actively, whereas it would be nice if
we could declare XML syntax finished and go back to our
plows.

XML 1.0 took a design decision in favor of enumeration of 
name characters, simply because the alternative - outsourcing 
the problem to the Unicode/ISO10646 process - had two 
problems:

(a) We didn't know them well enough to trust them, and
(b) writing a satisfying set of rules for XML name chars
    based solely on Unicode metadata is pretty hard.

The force of argument (b) is unabated.  (a) seems less of
a worry now simply because the Unicode and XML gangs have 
gotten pretty comfy with each other.  But I do have a worry
at the back of my mind whether the W3C *institutionally* 
ought to trust the consortium *institutionally* with 
something of this magnitude.  And what happens of ISO and
Unicode stop getting along one of these centuries, whose
side is XML on?

A few weeks ago, I was in favor of leaving it the way it
is, but only by about 55-45.  I found the most convincing
argument on the other side was the person who postulated
a Khmer user typing away in emacs and having a disconnect
because there are lots of characters they can use for 
people's names but not as attribute names.  On the other
hand, this problem is not unique to Khmer - just ask 
Mr. O'Hara.

And the notion of having a single monolithic XML whose
interoperability, while not perfect, is pretty $#!%* good,
partially based on those unwieldy character-class 
productions, is something that it will hurt to lose.  And
it is a reasonable position to say "The markup name character 
class snapshot was based on Unicode 2.0, sorry 'bout that."

Realistically, there are 3 options:

1. Leave it the way it is.
2. Do Blueberry and then repeat the process for Unicode 3.2
   and 4.0 and so on every couple of years forever.
3. Bite the bullet, write the rules in terms of Unicode
   metadata and go to a pure use-by-reference architecture,
   probably adding a syntactic signal to reference the
   Unicode version number.

I think (3.) will prove to be really hard to do well - and 
then the Unicode metadata fields might get changed and screw
it all up.  I think (2.) is not unreasonable, but has the 
institutional disadvantage that the XML standardization effort 
has to become an ongoing process ad infinitum.  

I still go for (1.).  My opposition to NEL has hardened,
because of a strong fear that this one will cause real 
wreckage on a widespread basis, not just in linguistic
corner cases.

But I really can't see how anyone can get behind any of 
these positions and feel entirely comfortable with where
they find themselves standing.  I sure don't. -Tim


------------------------------------------------------------------
The xml-dev list is sponsored by XML.org, an initiative of OASIS
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To unsubscribe from this elist send a message with the single word
"unsubscribe" in the body to: xml-dev-request@lists.xml.org

----- End of forwarded message from Tim Bray -----

-- 
John Cowan                                   cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
	--Douglas Hofstadter

Received on Tuesday, 10 July 2001 09:11:28 UTC