- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 01 May 2003 16:18:07 -0400
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: public-iri@w3.org
Hello Bjoern, I decided to create an issue for this, even though it may not require any changes. http://www.w3.org/International/iri-edit/#escapeInterpret-14 At 20:31 03/04/30 +0200, Bjoern Hoehrmann wrote: >* Martin Duerst wrote: > >> Is there a section in the current IRI draft that specifies how > >>%-escapes in IRIs are to be interpreted? > > >%-escapes in IRIs are handled mostly the same way as in URIs. > >There is no special text about this. Do you think there should > >be? If yes, where should it go? What should it say? > >The %-escaping mechanism in RFC 2396 is an irreversible encoding, RFC >2396 says, you can escape "&" as %26 but it does not say, that %26 can >be unescaped to "&". RFC 2396 also does not specify how characters >outside the US-ASCII range have to be %-escaped, neither does the IRI >draft (except when IRIs are converted to URIs). mostly agreed. >IMO, the IRI draft should say, that if %-escaping is used in an IRI, the >escape sequence must be generated from UTF-8 octets and %-escapes must >be interpreted as octets in an UTF-8 sequence. why should it say so? In that case, you should not really use %-escaping in an IRI, you should use real characters. >This approach would be problematical if the IRI originates from an URI >that used %-escapes that could not be interpreted as UTF-8 sequence or >if people like to encode abitrary binary data in the IRI. The latter is >IMO not a valid use case for IRIs, if a specific scheme wants binary >data, it should first convert the bytes to characters (using e.g. >Base64) and then apply %-escpaping to these characters if necessary. The >former could be resolved by either making such URIs unconvertable or by >adding an additional escaping scheme for either non-UTF-8 octets or >UTF-8 octets (like http://www.example.org/%U0000F6 for >http://www.example.org/$B‹ (B, I prefer to make them unconvertable. Exactly. There are definitely URIs that have escape sequences that don't match UTF-8. There are examples in the latest version of the draft. It can be pretty frequent that e.g. some part is UTF-8 based, but another is not (e.g. the path is not, but the query part is, or the other way round) In order to be able to use IRIs as a replacement for URIs, we cannot have URIs that cannot be expressed as IRIs. As for raw data, any actual examples I know support your oppinion (e.g. the data: URI), but still people are bringing this case up too often to just ignore it. >IRIs are a sequence of characters, I think this definition should not >change to a sequence of characters, intermixed with abitrary octets >after unescaping %-escapes. IRIs are a sequence of characters, correct. The same way that URIs are a sequence of characters. A %-escape sequence is three characters, both for IRIs and for URIs. This was discussed (for the URI side) at length on www-tag. I don't know of any part of the draft where IRIs are defined as "sequence of characters, intermixed with abitrary octets after unescaping %-escapes". There is only one place where such a construct is used, in the mapping from URIs to IRIs, in an intermediate stage. Do you think there is a need for any clarifications to the draft? If yes, where? Regards, Martin.
Received on Thursday, 1 May 2003 16:40:32 UTC