Re: IRIs everywhere (including XML namespaces) from Misha.Wolf@reuters.com on 2002-10-13 (www-tag@w3.org from October 2002)

From: <Misha.Wolf@reuters.com>
Date: Sun, 13 Oct 2002 09:08:15 +0100
To: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Cc: www-tag@w3.org, xml-names-editor@w3.org, w3c-i18n-ig@w3.org
Message-ID: <T5de96eaeb3c407b70799c@reuters.com>
Hello Elliotte,

You write about developers who will be made to jump through various
hoops as a consequence of XML namespaces following other W3C
specifications and using IRIs.  What software components are these folks
likely to be developing?  I would estimate that each OS would need one
component which translates IRIs to URIs.  You seem to be suggesting many
such components.

In an earlier mail [1], you wrote:
> >- XML 1.0 system identifiers [2]
>
> This references an erratum, not the actual spec. The original 1st
> edition spec is clear though that what we now call an IRI is allowed;
> that the escaping is performed by the processor as necessary, not the
> author.

I was puzzled then, as you seemed to be suggesting that, in contrast
with XML's usage of IRIs, in the case of XML namespaces, the author
would need to perform the IRI-to-URI conversion.  At the time I didn't
respond, but your latest mail suggests that we have very different
mental models of the way in which XML namespaces would use IRIs.  In my
model, the IRIs are not converted to URIs unless and until they are to
be dereferenced.  Please explain your model.

[1] http://lists.w3.org/Archives/Public/www-tag/2002Oct/0206

Thanks,
Misha


On 11/10/2002 20:40:26 Elliotte Rusty Harold wrote:
> A further thought on IRIs based on my experience today trying to add
> support for them to XOM:
>
> These things are complex. The process of taking a UTF-16 encoded Java
> (or C++, or C#) string, encoding it in UTF-8, and then hex escaping
> some of it, is non-trivial. It's absolutely doable, but it requires
> way more knowledge of Unicode and the intricacies of various
> encodings of the Unicode character set than most developers possess.
> Converting plane-1 characters encoded with surrogate pairs into UTF-8
> is especially tricky. Most programmers will not know there's anything
> special here they have to watch out for. This is very much an experts
> only job.
>
> Unfortunately, there is no support for this in the standard
> libraries, at least in Java. Worse yet many of the functions that
> allege to do part of this actually have various subtle bugs that
> cause them to generate incorrect output. For instance, in Java 1.3
> and earlier the URLEncoder class uses the platform default character
> set instead of UTF-8. In Java 1.4, there's finally an option to
> specify UTF-8; but if you don't, you still get the platform default
> encoding. Even then, a programmer still has to break up an IRI into
> parts and encode only some of them. For instance
> URLEncoder.encode("http://www.yahoo.com:80/") will encode the colons
> and the slashes, even though they should not be encoded.
>
> I suspect, over time, if IRIs are adopted, the libraries will catch
> up; and eventually the bugs will be worked out. However, we should be
> prepared for a lot of buggy, non-conforming code in the meantime.
> Worst case scenario: this will be like early HTML where
> implementation bugs become standard features out of necessity. Some
> older methods in Java to this day generate incorrect UTF-8 in the
> name of backwards compatibility with errors made in Java 1.0 in 1995.
>
> One way to alleviate the problems: specs that specify IRIs (or
> reinvent them as older, pre-IRI specs like XLink do) should include
> detailed pseudo-code and perhaps actual code for making the
> conversion to URIs. They should not rely on handwaving about
> converting strings to UTF-8 and hex encoding certain bytes. The
> conversion to UTF-8 will be screwed up, repeatedly. We've seen this
> in many other APIs in the past, not the least of which is the Java
> class library itself. It is important to warn implementers of the
> location of the mines in the field they are about to cross.
> --
>
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> |          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
> |              http://www.cafeconleche.org/books/xian2/              |
> |  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
> +----------------------------------+---------------------------------+
> |  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
> |  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
> +----------------------------------+---------------------------------+
>



------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.
Received on Sunday, 13 October 2002 04:11:45 UTC