- From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Date: Fri, 11 Oct 2002 15:40:26 -0400
- To: <www-tag@w3.org>
- Cc: xml-names-editor@w3.org
A further thought on IRIs based on my experience today trying to add support for them to XOM: These things are complex. The process of taking a UTF-16 encoded Java (or C++, or C#) string, encoding it in UTF-8, and then hex escaping some of it, is non-trivial. It's absolutely doable, but it requires way more knowledge of Unicode and the intricacies of various encodings of the Unicode character set than most developers possess. Converting plane-1 characters encoded with surrogate pairs into UTF-8 is especially tricky. Most programmers will not know there's anything special here they have to watch out for. This is very much an experts only job. Unfortunately, there is no support for this in the standard libraries, at least in Java. Worse yet many of the functions that allege to do part of this actually have various subtle bugs that cause them to generate incorrect output. For instance, in Java 1.3 and earlier the URLEncoder class uses the platform default character set instead of UTF-8. In Java 1.4, there's finally an option to specify UTF-8; but if you don't, you still get the platform default encoding. Even then, a programmer still has to break up an IRI into parts and encode only some of them. For instance URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons and the slashes, even though they should not be encoded. I suspect, over time, if IRIs are adopted, the libraries will catch up; and eventually the bugs will be worked out. However, we should be prepared for a lot of buggy, non-conforming code in the meantime. Worst case scenario: this will be like early HTML where implementation bugs become standard features out of necessity. Some older methods in Java to this day generate incorrect UTF-8 in the name of backwards compatibility with errors made in Java 1.0 in 1995. One way to alleviate the problems: specs that specify IRIs (or reinvent them as older, pre-IRI specs like XLink do) should include detailed pseudo-code and perhaps actual code for making the conversion to URIs. They should not rely on handwaving about converting strings to UTF-8 and hex encoding certain bytes. The conversion to UTF-8 will be screwed up, repeatedly. We've seen this in many other APIs in the past, not the least of which is the Java class library itself. It is important to warn implementers of the location of the mines in the field they are about to cross. -- +-----------------------+------------------------+-------------------+ | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer | +-----------------------+------------------------+-------------------+ | XML in a Nutshell, 2nd Edition (O'Reilly, 2002) | | http://www.cafeconleche.org/books/xian2/ | | http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/ | +----------------------------------+---------------------------------+ | Read Cafe au Lait for Java News: http://www.cafeaulait.org/ | | Read Cafe con Leche for XML News: http://www.cafeconleche.org/ | +----------------------------------+---------------------------------+
Received on Friday, 11 October 2002 15:43:22 UTC