Re: IRIs everywhere (including XML namespaces) from Elliotte Rusty Harold on 2002-10-11 (www-tag@w3.org from October 2002)

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Date: Fri, 11 Oct 2002 15:40:26 -0400
To: <www-tag@w3.org>
Cc: xml-names-editor@w3.org
Message-Id: <p04330104b9ccd4354800@[192.168.254.4]>

A further thought on IRIs based on my experience today trying to add 
support for them to XOM:

These things are complex. The process of taking a UTF-16 encoded Java 
(or C++, or C#) string, encoding it in UTF-8, and then hex escaping 
some of it, is non-trivial. It's absolutely doable, but it requires 
way more knowledge of Unicode and the intricacies of various 
encodings of the Unicode character set than most developers possess. 
Converting plane-1 characters encoded with surrogate pairs into UTF-8 
is especially tricky. Most programmers will not know there's anything 
special here they have to watch out for. This is very much an experts 
only job.

Unfortunately, there is no support for this in the standard 
libraries, at least in Java. Worse yet many of the functions that 
allege to do part of this actually have various subtle bugs that 
cause them to generate incorrect output. For instance, in Java 1.3 
and earlier the URLEncoder class uses the platform default character 
set instead of UTF-8. In Java 1.4, there's finally an option to 
specify UTF-8; but if you don't, you still get the platform default 
encoding. Even then, a programmer still has to break up an IRI into 
parts and encode only some of them. For instance 
URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons 
and the slashes, even though they should not be encoded.

I suspect, over time, if IRIs are adopted, the libraries will catch 
up; and eventually the bugs will be worked out. However, we should be 
prepared for a lot of buggy, non-conforming code in the meantime. 
Worst case scenario: this will be like early HTML where 
implementation bugs become standard features out of necessity. Some 
older methods in Java to this day generate incorrect UTF-8 in the 
name of backwards compatibility with errors made in Java 1.0 in 1995.

One way to alleviate the problems: specs that specify IRIs (or 
reinvent them as older, pre-IRI specs like XLink do) should include 
detailed pseudo-code and perhaps actual code for making the 
conversion to URIs. They should not rely on handwaving about 
converting strings to UTF-8 and hex encoding certain bytes. The 
conversion to UTF-8 will be screwed up, repeatedly. We've seen this 
in many other APIs in the past, not the least of which is the Java 
class library itself. It is important to warn implementers of the 
location of the mines in the field they are about to cross.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
|              http://www.cafeconleche.org/books/xian2/              |
|  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+

Received on Friday, 11 October 2002 15:43:22 UTC