Re: Interpretation if %-escapes in IRIs from Bjoern Hoehrmann on 2003-04-30 (public-iri@w3.org from April 2003)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 30 Apr 2003 20:31:03 +0200
To: Martin Duerst <duerst@w3.org>
Cc: public-iri@w3.org
Message-ID: <3ef50548.532587350@smtp.bjoern.hoehrmann.de>

* Martin Duerst wrote:
>>   Is there a section in the current IRI draft that specifies how
>>%-escapes in IRIs are to be interpreted?

>%-escapes in IRIs are handled mostly the same way as in URIs.
>There is no special text about this. Do you think there should
>be? If yes, where should it go? What should it say?

The %-escaping mechanism in RFC 2396 is an irreversible encoding, RFC
2396 says, you can escape "&" as %26 but it does not say, that %26 can
be unescaped to "&". RFC 2396 also does not specify how characters
outside the US-ASCII range have to be %-escaped, neither does the IRI
draft (except when IRIs are converted to URIs).

IMO, the IRI draft should say, that if %-escaping is used in an IRI, the
escape sequence must be generated from UTF-8 octets and %-escapes must
be interpreted as octets in an UTF-8 sequence.

This approach would be problematical if the IRI originates from an URI
that used %-escapes that could not be interpreted as UTF-8 sequence or
if people like to encode abitrary binary data in the IRI. The latter is
IMO not a valid use case for IRIs, if a specific scheme wants binary
data, it should first convert the bytes to characters (using e.g.
Base64) and then apply %-escpaping to these characters if necessary. The
former could be resolved by either making such URIs unconvertable or by
adding an additional escaping scheme for either non-UTF-8 octets or
UTF-8 octets (like http://www.example.org/%U0000F6 for
http://www.example.org/ö), I prefer to make them unconvertable.

IRIs are a sequence of characters, I think this definition should not
change to a sequence of characters, intermixed with abitrary octets
after unescaping %-escapes.

Received on Wednesday, 30 April 2003 14:31:17 UTC