Name Constraints should be kept in XML 1.1 from Rick Jelliffe on 2002-02-03 (www-xml-blueberry-comments@w3.org from February 2002)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Sun, 3 Feb 2002 21:05:25 +1100
To: <www-xml-blueberry-comments@w3.org>
Message-ID: <003d01c1ac9a$4c83ed60$4bc8a8c0@AlletteSystems.com>

Since there has been no response to my direct email to the WG several months
ago, I assume it has fallen through the cracks and I hope the WG will forgive
me for requesting that the issues raised in that email will find their way onto the
issues list.

There seems to be two rationales for removing the name restrictions in XML.
First, to decouple XML from the particular version of Unicode (supposedly
bringing in, thereby, new scripts), and second to simplify XML.

The cost is, of course, that XML documents with mislabelled encodings are
less likely to be caught. I have not seen any discussion from the WG on what
they propose to replace this functionality of XML 1.0 with. Certainly, I
expect that respect for potential and actual non-Western XML users, which
so clearly motivates the desire to allow new characters, must also impell the
WG to state what alternative should be used to catch such encoding errors.

Is there another alternative which does not throw the baby out with the bathwater?
I urge the WG to re-consider this issue.

In particular, I suggest the WG consider or re-consider the following two part solution:

1) "A name error MUST be reported as a validity error. A name error MAY
be reported as a WF error."

This allows lightweight processors to implement smaller (or no) naming
rules converters. The rules in the XML 1.1 draft are an example of such a
very lightweight version. I attach a small and efficent Java library which would
compile to just over 1K; it is another example of code which WF systems
could adopt, as a coarse-grain way to catch errors. Note that UTF-8
is also AFAIK code-compatible with Big5; UTF-8 data erroneously labelled
as Big5 will not cause complaints from an XML parser: if native language
markup has been used, then the larger the vocabulary used the more likelihood
that the error will be detected.

Big5 is unusual in that the second byte of multi-byte characters may be
in the ASCII range. Other encodings may not have this problem as much,
unless they are used with transcoders that fail-without-error.

As many existing and older transcoder libraries do not
generate exceptions when an encoding error is found, the naming rules
may be the only way of detecting encoding errors before the data has
been inserted into a database, possibly corrupting the whole database.

The WG may be interested in a practical experience here: I worked on
a commerical Java/XML three-tier web project for more than six months in Taiwan,
only to find that the data in Unicode "char" and String was coming
in from the middleware as Big5 bytes one-byte per Java char. The programmers,
trained in US and Britain, though outstanding in other areas, had
followed the customary practices used to get round-tripping working.

Because there was no stage which alerted anyone that the wrong
encodings were being used, it was not until late in the project, when
trying to use the data with standard Java libararies rather than
shovelling the bytes through, that the mistake was found. I do not
believe that the programmers were unusual in this: they were working
in the way appropriate to non-WWW and non-multiple-encoding systems.

The lesson I hope the WG will draw from this is that non-ASCII,
non-UTF-n workers need all the help they can get in detecting encoding errors.
Getting rid of one of the few pieces of infrastructure that can help works
against internationalization. The WG should find a way to support
native-language markup with Yi without making things less robust
in Taipei.

2) "The naming rules should make use of the Unicode identifier properties.
with whatever changes are needed, rather than being enumerated.

John Cowan's excellent work a year ago on this should be followed.
The WG should follow the Unicode properties: it is ironic to discard
them in the name of increased Unicode support.

Furthermore, this would give the property that documents using
naming characters in a new version of Unicode will be rejected
by validating systems whose Unicode property tables do not include
those characters. This adds a measure of robustness, that a
system that was not built to cope with surrogates (for example)
or a particular script will reject the document.

I ask the WG to consider this, and to provide thorough answers
in a timely-enough fashion for debate before XML 1.1 is adopted.

Cheers
Rick Jelliffe
Chief Technical Officer, Topologi Pty. Ltd.
http://www.topologi.com/

Invited Expert, W3C I18n IG
Formerly Invited Expert, W3C XML IG
Formerly Member, W3C XML Schemas WG, for Academia Sinica Taiwan
Member, 1995-1999, China/Korea/Japan Document Processing Group
Project Leader, 1993-1997, Extended Reference Concrete Syntax project,
moved into CJK DOCP Standardization Project Regarding East Asian Documents (SPREAD)
Project Leader, "Chinese XML Now!" project, Academia Sinica Computing Centre, 1999.
Australian Delegate, 1995-1998, 2001-, ISO JTC1 SC34 Document Description and Processing Languages
Editor, ISO/IEC CD 19757 Document Schema Definition Language (DSDL) Part 4 - Path-based integrity constraints

Attachments

text/java attachment: XMLChars.java

Received on Sunday, 3 February 2002 04:56:06 UTC