- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Sun, 3 Feb 2002 21:05:25 +1100
- To: <www-xml-blueberry-comments@w3.org>
- Message-ID: <003d01c1ac9a$4c83ed60$4bc8a8c0@AlletteSystems.com>
Since there has been no response to my direct email to the WG several months ago, I assume it has fallen through the cracks and I hope the WG will forgive me for requesting that the issues raised in that email will find their way onto the issues list. There seems to be two rationales for removing the name restrictions in XML. First, to decouple XML from the particular version of Unicode (supposedly bringing in, thereby, new scripts), and second to simplify XML. The cost is, of course, that XML documents with mislabelled encodings are less likely to be caught. I have not seen any discussion from the WG on what they propose to replace this functionality of XML 1.0 with. Certainly, I expect that respect for potential and actual non-Western XML users, which so clearly motivates the desire to allow new characters, must also impell the WG to state what alternative should be used to catch such encoding errors. Is there another alternative which does not throw the baby out with the bathwater? I urge the WG to re-consider this issue. In particular, I suggest the WG consider or re-consider the following two part solution: 1) "A name error MUST be reported as a validity error. A name error MAY be reported as a WF error." This allows lightweight processors to implement smaller (or no) naming rules converters. The rules in the XML 1.1 draft are an example of such a very lightweight version. I attach a small and efficent Java library which would compile to just over 1K; it is another example of code which WF systems could adopt, as a coarse-grain way to catch errors. Note that UTF-8 is also AFAIK code-compatible with Big5; UTF-8 data erroneously labelled as Big5 will not cause complaints from an XML parser: if native language markup has been used, then the larger the vocabulary used the more likelihood that the error will be detected. Big5 is unusual in that the second byte of multi-byte characters may be in the ASCII range. Other encodings may not have this problem as much, unless they are used with transcoders that fail-without-error. As many existing and older transcoder libraries do not generate exceptions when an encoding error is found, the naming rules may be the only way of detecting encoding errors before the data has been inserted into a database, possibly corrupting the whole database. The WG may be interested in a practical experience here: I worked on a commerical Java/XML three-tier web project for more than six months in Taiwan, only to find that the data in Unicode "char" and String was coming in from the middleware as Big5 bytes one-byte per Java char. The programmers, trained in US and Britain, though outstanding in other areas, had followed the customary practices used to get round-tripping working. Because there was no stage which alerted anyone that the wrong encodings were being used, it was not until late in the project, when trying to use the data with standard Java libararies rather than shovelling the bytes through, that the mistake was found. I do not believe that the programmers were unusual in this: they were working in the way appropriate to non-WWW and non-multiple-encoding systems. The lesson I hope the WG will draw from this is that non-ASCII, non-UTF-n workers need all the help they can get in detecting encoding errors. Getting rid of one of the few pieces of infrastructure that can help works against internationalization. The WG should find a way to support native-language markup with Yi without making things less robust in Taipei. 2) "The naming rules should make use of the Unicode identifier properties. with whatever changes are needed, rather than being enumerated. John Cowan's excellent work a year ago on this should be followed. The WG should follow the Unicode properties: it is ironic to discard them in the name of increased Unicode support. Furthermore, this would give the property that documents using naming characters in a new version of Unicode will be rejected by validating systems whose Unicode property tables do not include those characters. This adds a measure of robustness, that a system that was not built to cope with surrogates (for example) or a particular script will reject the document. I ask the WG to consider this, and to provide thorough answers in a timely-enough fashion for debate before XML 1.1 is adopted. Cheers Rick Jelliffe Chief Technical Officer, Topologi Pty. Ltd. http://www.topologi.com/ Invited Expert, W3C I18n IG Formerly Invited Expert, W3C XML IG Formerly Member, W3C XML Schemas WG, for Academia Sinica Taiwan Member, 1995-1999, China/Korea/Japan Document Processing Group Project Leader, 1993-1997, Extended Reference Concrete Syntax project, moved into CJK DOCP Standardization Project Regarding East Asian Documents (SPREAD) Project Leader, "Chinese XML Now!" project, Academia Sinica Computing Centre, 1999. Australian Delegate, 1995-1998, 2001-, ISO JTC1 SC34 Document Description and Processing Languages Editor, ISO/IEC CD 19757 Document Schema Definition Language (DSDL) Part 4 - Path-based integrity constraints
Attachments
- text/java attachment: XMLChars.java
Received on Sunday, 3 February 2002 04:56:06 UTC