W3C home > Mailing lists > Public > public-sml@w3.org > July 2007

IRI support

From: Philippe Le Hegaret <plh@w3.org>
Date: Thu, 19 Jul 2007 16:14:19 -0400
To: public-sml <public-sml@w3.org>
Message-Id: <1184876060.4293.33.camel@localhost>

Marv suggested that we mustn't support IRI during the call since it is
not explicitly mentioned in Java5. While he is correct that IRI doesn't
appear explicitly, Java 5 automatically converts IRI to URI.

In fact, java 5 does not support URI but a deviation of URI:
Character categories
A [non-US-ASCII] character is encoded by replacing it with the sequence
of escaped octets that represent that character in the UTF-8 character
set. The Euro currency symbol ('\u20AC'), for example, is encoded as "%
E2%82%AC". (Deviation from RFC 2396, which does not specify any
particular character set.)

This deviation matches the algorithm for converting an IRI to a URI
(this shouldn't be a surprised):
Step 2. For each character in 'ucschar' or 'iprivate', apply steps 2.1
through 2.3 below.
2.1 Convert the character to a sequence of one or more octets
        using UTF-8 [RFC3629]. 
2.2 Convert each octet to %HH, where HH is the hexadecimal
        notation of the octet value. Note that this is identical to the
        percent-encoding mechanism in section 2.1 of [RFC3986]. To
        reduce variability, the hexadecimal notation SHOULD use
        uppercase letters. 
2.3 Replace the original character with the resulting character
        sequence (i.e., a sequence of %HH triplets).

So, in other words, not supporting IRI in Java5 would require extra
processing since one would be forced to check that the URI does not
contain non-US-ASCII characters before passing to the URI constructor.

Received on Thursday, 19 July 2007 20:14:36 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:24:15 UTC