Re: Interpretation if %-escapes in IRIs [escapeInterpret-14]

Hello Bjoern,

I decided to create an issue for this, even though it may
not require any changes.
http://www.w3.org/International/iri-edit/#escapeInterpret-14

At 20:31 03/04/30 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >>   Is there a section in the current IRI draft that specifies how
> >>%-escapes in IRIs are to be interpreted?
>
> >%-escapes in IRIs are handled mostly the same way as in URIs.
> >There is no special text about this. Do you think there should
> >be? If yes, where should it go? What should it say?
>
>The %-escaping mechanism in RFC 2396 is an irreversible encoding, RFC
>2396 says, you can escape "&" as %26 but it does not say, that %26 can
>be unescaped to "&". RFC 2396 also does not specify how characters
>outside the US-ASCII range have to be %-escaped, neither does the IRI
>draft (except when IRIs are converted to URIs).

mostly agreed.


>IMO, the IRI draft should say, that if %-escaping is used in an IRI, the
>escape sequence must be generated from UTF-8 octets and %-escapes must
>be interpreted as octets in an UTF-8 sequence.

why should it say so? In that case, you should not really use
%-escaping in an IRI, you should use real characters.


>This approach would be problematical if the IRI originates from an URI
>that used %-escapes that could not be interpreted as UTF-8 sequence or
>if people like to encode abitrary binary data in the IRI. The latter is
>IMO not a valid use case for IRIs, if a specific scheme wants binary
>data, it should first convert the bytes to characters (using e.g.
>Base64) and then apply %-escpaping to these characters if necessary. The
>former could be resolved by either making such URIs unconvertable or by
>adding an additional escaping scheme for either non-UTF-8 octets or
>UTF-8 octets (like http://www.example.org/%U0000F6 for
>http://www.example.org/$B‹
(B, I prefer to make them unconvertable.

Exactly. There are definitely URIs that have escape sequences that
don't match UTF-8. There are examples in the latest version of the
draft. It can be pretty frequent that e.g. some part is UTF-8 based,
but another is not (e.g. the path is not, but the query part is, or
the other way round) In order to be able to use IRIs as a replacement
for URIs, we cannot have URIs that cannot be expressed as IRIs.

As for raw data, any actual examples I know support your oppinion
(e.g. the data: URI), but still people are bringing this case up
too often to just ignore it.


>IRIs are a sequence of characters, I think this definition should not
>change to a sequence of characters, intermixed with abitrary octets
>after unescaping %-escapes.

IRIs are a sequence of characters, correct. The same way that URIs
are a sequence of characters. A %-escape sequence is three characters,
both for IRIs and for URIs. This was discussed (for the URI side)
at length on www-tag. I don't know of any part of the draft where
IRIs are defined as "sequence of characters, intermixed with abitrary
octets after unescaping %-escapes". There is only one place where
such a construct is used, in the mapping from URIs to IRIs, in
an intermediate stage.

Do you think there is a need for any clarifications to the draft?
If yes, where?


Regards,    Martin.

Received on Thursday, 1 May 2003 16:40:32 UTC