RE: IRI support from Waschke, Marvin G on 2007-07-19 (public-sml@w3.org from July 2007)

From: Waschke, Marvin G <Marvin.Waschke@ca.com>
Date: Thu, 19 Jul 2007 16:39:01 -0400
To: "Philippe Le Hegaret" <plh@w3.org>, "public-sml" <public-sml@w3.org>
Message-ID: <3EC3F0F6BFDDAA4DA4414B1EEE903252032494BB@USILMS12.ca.com>

Philippe--
At this point, I agree -- You have proved to me that Java 5 does support
IRIs. 

However, I would like to explain my hesitancy on this point. I find it
somewhat vexing that the support is not explicit. When I see support for
some feature proved by executing examples, I become a pedantic skeptic.
You can disprove support with a counter example, but you can't prove
support with examples. Hence, my doubts.

But the argument you present below is convincing. Thanks,
Marvin Waschke
BSO - Senior Technology Strategist
Tel:      +360 383 9022 +425 201 3502 x13502
Mobile:    +425 269 5592
Marvin.Waschke@ca.com
Blog: Iterating on IT Service


-----Original Message-----
From: public-sml-request@w3.org [mailto:public-sml-request@w3.org] On
Behalf Of Philippe Le Hegaret
Sent: Thursday, July 19, 2007 1:14 PM
To: public-sml
Subject: IRI support


Marv suggested that we mustn't support IRI during the call since it is
not explicitly mentioned in Java5. While he is correct that IRI doesn't
appear explicitly, Java 5 automatically converts IRI to URI.

In fact, java 5 does not support URI but a deviation of URI:
[[
Character categories
[...]
A [non-US-ASCII] character is encoded by replacing it with the sequence
of escaped octets that represent that character in the UTF-8 character
set. The Euro currency symbol ('\u20AC'), for example, is encoded as "%
E2%82%AC". (Deviation from RFC 2396, which does not specify any
particular character set.)
]]
http://java.sun.com/j2se/1.5.0/docs/api/java/net/URI.html#encode

This deviation matches the algorithm for converting an IRI to a URI
(this shouldn't be a surprised):
[[
Step 2. For each character in 'ucschar' or 'iprivate', apply steps 2.1
through 2.3 below.
        
2.1 Convert the character to a sequence of one or more octets
        using UTF-8 [RFC3629]. 
        
2.2 Convert each octet to %HH, where HH is the hexadecimal
        notation of the octet value. Note that this is identical to the
        percent-encoding mechanism in section 2.1 of [RFC3986]. To
        reduce variability, the hexadecimal notation SHOULD use
        uppercase letters. 
        
2.3 Replace the original character with the resulting character
        sequence (i.e., a sequence of %HH triplets).
]]
http://www.apps.ietf.org/rfc/rfc3987.html#sec-3.1

So, in other words, not supporting IRI in Java5 would require extra
processing since one would be forced to check that the URI does not
contain non-US-ASCII characters before passing to the URI constructor.


Philippe

Received on Thursday, 19 July 2007 20:39:26 UTC