Re: IRIs everywhere (including XML namespaces) from Martin Duerst on 2002-10-15 (xml-names-editor@w3.org from October 2002)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 15 Oct 2002 09:03:20 +0900
To: Elliotte Rusty Harold <elharo@metalab.unc.edu>, <www-tag@w3.org>
Cc: xml-names-editor@w3.org, www-international@w3.org
Message-Id: <4.2.0.58.J.20021015084249.041ef650@localhost>
Hello Elliotte,

At 15:40 02/10/11 -0400, Elliotte Rusty Harold wrote:

>A further thought on IRIs based on my experience today trying to add 
>support for them to XOM:

Many thanks for your work and your comments.

It would be good if you could copy www-international@w3.org on these
comments, as this is now (after the I18N WG rechartering a month ago)
the list for comments on the IRI draft (draft-duerst-iri-xx.txt).
You can also follow editing of this draft at
http://www.w3.org/International/iri-edit/.[Please ignore the Bidi
section for a few more days, it's just being overhauled.]
[I have copied www-international@w3.org.]


>These things are complex. The process of taking a UTF-16 encoded Java (or 
>C++, or C#) string, encoding it in UTF-8, and then hex escaping some of 
>it, is non-trivial. It's absolutely doable, but it requires way more 
>knowledge of Unicode and the intricacies of various encodings of the 
>Unicode character set than most developers possess. Converting plane-1 
>characters encoded with surrogate pairs into UTF-8 is especially tricky. 
>Most programmers will not know there's anything special here they have to 
>watch out for. This is very much an experts only job.

I just have added a note with an example to the next version
of draft-duerst-iri-xx.txt. Please see
http://www.w3.org/International/iri-edit/draft-duerst-iri.txt,
just search for 'BMP'.


>Unfortunately, there is no support for this in the standard libraries, at 
>least in Java. Worse yet many of the functions that allege to do part of 
>this actually have various subtle bugs that cause them to generate 
>incorrect output. For instance, in Java 1.3 and earlier the URLEncoder 
>class uses the platform default character set instead of UTF-8. In Java 
>1.4, there's finally an option to specify UTF-8; but if you don't, you 
>still get the platform default encoding. Even then, a programmer still has 
>to break up an IRI into parts and encode only some of them. For instance 
>URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons and 
>the slashes, even though they should not be encoded.

I think using existing methods in URLEncoder is probably not worth it.
Creating some new class/method is probably much easier.


>I suspect, over time, if IRIs are adopted, the libraries will catch up; 
>and eventually the bugs will be worked out. However, we should be prepared 
>for a lot of buggy, non-conforming code in the meantime. Worst case 
>scenario: this will be like early HTML where implementation bugs become 
>standard features out of necessity. Some older methods in Java to this day 
>generate incorrect UTF-8 in the name of backwards compatibility with 
>errors made in Java 1.0 in 1995.
>
>One way to alleviate the problems: specs that specify IRIs (or reinvent 
>them as older, pre-IRI specs like XLink do) should include detailed 
>pseudo-code and perhaps actual code for making the conversion to URIs. 
>They should not rely on handwaving about converting strings to UTF-8 and 
>hex encoding certain bytes. The conversion to UTF-8 will be screwed up, 
>repeatedly. We've seen this in many other APIs in the past, not the least 
>of which is the Java class library itself. It is important to warn 
>implementers of the location of the mines in the field they are about to cross.

I'm very glad to add more examples, and pointers to the latest specs,
to draft-duerst-iri-xx.txt. As we are in the endgame with this one,
the more specific the contribution, the easier for me to integrate it.
The main problem with examples is that internet-drafts and RFCs have
to be all-ASCII. The notation for examples is therefore a major pain.

I'm a bit wary of pseudo-code, because the conversion from UTF-16 to
UTF-8 is not really something that should be defined in an IRI document.

On the other hand, I would also be very glad to have actual code, in
different programming languages, on the W3C web site, or in CVS.
If you or anybody else has something that they can contribute as
a starting point, that would be great.


Regards,    Martin.
Received on Tuesday, 15 October 2002 01:00:39 UTC