RE: IURIs, URIs, CHARMOD from Jeremy Carroll on 2001-09-24 (uri@w3.org from September 2001)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Mon, 24 Sep 2001 16:20:23 +0100
To: "Graham Klyne" <GK@ninebynine.org>
Cc: <uri@w3.org>
Message-ID: <JAEBJCLMIFLKLOJGMELDMEDBCCAA.jjc@hplb.hpl.hp.com>

>
> >All refs to [RFC 2396] include the extension given by [RFC 2732]
>
> (a) I don't see what significance RFC 2732 has regarding this I18N
> debate.  Or am I missing something?

Isn't that the one that adds '[' and ']' to the allowed characters.


>
> >IRI decoding algorithm
> [...]
> >The output MAY be a UTF-8 encoded string.
>
> (b) If specifications are going to be changed to allow IRIs, then I think
> the opportunity should also be grasped to insist that the octet-sequence
> encoding of an IRI MUST BE UTF-8.
>

The point about this MAY is simply that the algorithms doesn't always work
:-(.

For better or worse there are legacy URIs that are not UTF-8 encoded, e.g.
in my example, http://lists.w3.org/Archives/Public/uri/2001Sep/0010.html

the URI http://iso-8859-1.example.org/D%FCrst/braces%7B%7D

cannot be unencoded to include the u umlaut without knowledge of the
iso-8859-1 encoding scheme.

IRI's, because of the idempotency, can include any URI. Some of these URIs
are the encoding of an IRI without any % escapes other than %25; some of
them such as ones with D%FCrst in them are not. Even ones which are the
UTF-8 encoding of some IRI might not have that as the intended unencoded
form, only the end server knows how to correctly decode a URI.

The decode algorithm correctly works for any URI that is UTF-8 encoded.

The IRI itself certainly need not be UTF-8 encoded, the rule is that as part
of the encoding to URI syntax the IRI is transcoded to UTF-8. A stronger
requirement that IRIs are UTF-8 unnecessarily prevents the inclusion of IRIs
in non UTF-8 documents.

One of the points about my proposal, is that current practice, (which is all
the proposed algorithm amounts to), is in fact, very robust, and can work
despite multiple character encodings used by servers and multiple character
encodings used for the IRIs themselves. This robustness depends on URIs not
containing '%' which hence has completely special treatment in my proposal.

Jeremy

Received on Monday, 24 September 2001 11:21:01 UTC