Re: Interpretation if %-escapes in IRIs [escapeInterpret-14] from Martin Duerst on 2004-04-27 (public-iri@w3.org from April 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 27 Apr 2004 17:45:40 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-iri@w3.org
Message-Id: <4.2.0.58.J.20040427174507.07450fb0@localhost>
Hello Bjoern,

I haven't heard anything on this, and I'm therefore closing this issue.

Regards,    Martin.

At 06:29 03/06/27 -0400, Martin Duerst wrote:

>Hello Bjoern,
>
>Many thanks for all your questions.
>
>Most of these questions, if not all of them, are answered
>in the actual draft. Please check it and tell me where you
>think something is missing or not clear enough.
>
>At 05:37 03/05/02 +0200, Bjoern Hoehrmann wrote:
>>* Martin Duerst wrote:
>> >>IMO, the IRI draft should say, that if %-escaping is used in an IRI, the
>> >>escape sequence must be generated from UTF-8 octets and %-escapes must
>> >>be interpreted as octets in an UTF-8 sequence.
>> >
>> >why should it say so? In that case, you should not really use
>> >%-escaping in an IRI, you should use real characters.
>>
>>What if it is impossible to use "real" characters due limitations of the
>>transport media, the transport encoding,
>
>Then preferably use a transport-specific escaping or encoding
>(e.g. the various MIME mechanisms for email, numeric character
>references for HTML and XML,...).
>
>
>>if I need to escape a reserved character to avoid it's special meaning,
>
>Then use escaping. That's very clear in the draft.
>
>
>>if the character is disallowed
>
>Then use escaping. Again, the draft says so.
>
>
>>or if I want to encode binary data that does not represent any
>>character?
>
>Then use escaping. Same thing again.
>
>
>>What if my IRI-aware application receives an IRI containing %-escape
>>sequences but needs characters in order to work, like some kind of
>>server for file transfer expecting a file name or a database frontend
>>expecting a search string?
>
>Then the server will do the conversion from %-escapes to octets
>the same way it currently does, and some servers (e.g. Apache and IIS
>on WinNT/2000/XP), or server configurations, will convert further,
>where possible, to whatever character encoding is used internally
>in the server.
>
>
>>Let's say there is an 'uri' URI scheme and an 'iri' IRI scheme
>
>There is really no such difference. All URI schemes can be used
>with IRIs. For some, the benefit of using IRIs is greater than
>for others. I think what you wanted to say is that there are
>two protocol slots, let's say
>iri="http://www.example.org/search?Bj+APY-rn" and
>uri="http://www.example.org/search?Bj+APY-rn". I'll assume
>this for the following examples, but I'll not change your syntax.
>
>
>>(the + in
>>the query part has no special meaning and may thus stay unescaped):
>>
>>   uri://www.example.org/search?Bj+APY-rn
>>   iri://www.example.org/search?Bj+APY-rn
>>
>>Decoding the query part of the URI I would get the octets
>>
>>   <42><6A><2B><41><50><59><2D><72><6E>
>
>Yes.
>
>
>>The database frontend would then search for "Bjo"rn",
>
>Sorry to have to use "Bjo"rn" for your example due to my
>Japanese mailer.
>
>
>>since it decodes
>>the octets represented by characters in the URL as UTF-7 octets.
>
>If the database frontend is programmed that way, then that's correct.
>
>
>>What
>>about the IRI? Is the frontend supposed to search for "Bj+APY-rn" or
>>for "Bjo"rn"?
>
>If the same frontend is used, the same thing will happen.
>The frontend has no way to distinguish whether it receives an URI
>or an IRI.
>
>
>>Is a data character in an IRI a character or is it a
>>representation of an octet or even something else?
>
>It is a character. That does not prohibit that these characters
>are (mis)used to represent other characters, as in the case of
>UTF-7.
>
>
>>If an IRI data character is a "real" character, refer %-escape sequence
>>also to real characters? Are these IRIs equivalent:
>>
>>   iri://www.example.org/search?Bj%F6rn
>>   iri://www.example.org/search?Bjo"rn
>
>These are definitely not equivalent, because the %F6 is based
>on Latin-1, not UTF-8.
>
>
>>just like these URIs are:
>>
>>   uri://www.example.org/search?a
>>   uri://www.example.org/search?%61
>
>If you read section 6 of
>http://www.ietf.org/internet-drafts/draft-fielding-uri-rfc2396bis-03.txt
>carefully, you'll see that these are equivalent
>under certain definitions of equivalence, and for
>those protocols/applications that use this definition
>of equivalence.
>
>
>>Are these equivalent:
>>
>>   iri://www.example.org/search?Bj%C3%B6rn
>>   iri://www.example.org/search?Bjo"rn
>
>These are equivalent under certain definitions of equivalence.
>
>
>>and are these IRIs:
>>
>>   iri://www.example.org/search?a
>>   iri://www.example.org/search?%61
>
>They are as equivalent as the same URIs (see above).
>
>
>>equivalent? If the latter two IRIs are equivalent, how would one then
>>encode binary data in an IRI? What octets are represented in the query
>>part of e.g.
>>
>>   iri://www.example.org/search?<U+20AC>
>>   iri://www.example.org/search?<U+1D7F6>
>
>The octets, when octets are needed, are based on UTF-8, i.e.
>E2 82 AC in the first case, and F0 9D 9F B6 in the second case.
>
>
>>Consider I want to send an IRI in a text/plain e-mail using us-ascii,
>>but the IRI has non-ASCII characters, like
>>
>>   iri://www.example.org/bjo"rn
>
>In the first place, you should not use us-ascii for sending this IRI.
>There are many encodings, starting with iso-8859-1 and utf-8 that
>can easily transfer the IRI.
>
>
>>can I use %-escaping to encode the 'o"' and if yes, how would the IRI
>>then look like? Would it be
>>
>>   iri://www.example.org/bj%F6rn
>>   iri://www.example.org/bj%ECrn
>>   iri://www.example.org/bj%C3%B6rn
>
>If anything, it would be this one, with "bj%C3%B6rn", using UTF-8.
>While this would not work for namespaces (i.e. XML parsers and
>XSLT processors would treat the namespaces
>iri://www.example.org/bjo"rn and iri://www.example.org/bj%C3%B6rn
>differently), it would at least resolve to the same thing, e.g.
>over http (exactly the same applies to http://www.example.org/search?a
>and http://www.example.org/search?%61).
>
>
>>   iri://www.example.org/bj%00%F6rn
>>   ...
>>
>>Currently neither RFC 2396 nor the IRI draft give an advise here. Is
>>this a scenario not supported by IRIs?
>
>Which scenario? The scenario of sending IRIs over US-ASCII?
>Or another one?
>
>
>>If yes, why do you think it is
>>not necessary or not possible to support it,
>
>If you mean sending IRIs over US-ASCII, then it's not possible in
>the same way it's not really possible to send German or Japanese
>email over US-ASCII.
>
>
>>and why does the IRI draft
>>not mention that %-escaping cannot be used for non-ASCII characters, but
>>rather says it SHOULD NOT be used?
>
>Because it depends on exactly what you are doing.
>
>
>>If it is possible to use %-escaping
>>for non-ASCII characters, the IRI draft must say how the non-ASCII
>>character have to be encoded (actually, how any character is to be
>>encoded) and should say, how one gets the characters back.
>
>There are two very detailed sections in the draft discussing this.
>For escaping, see section 3.1, "Mapping of IRIs to URIs".
>For unescaping, see section 3.2, "Converting URIs to IRIs".
>If you find anything that is unclear, please tell us, so that I can
>fix it.
>
>
>Regards,   Martin.
Received on Tuesday, 27 April 2004 04:50:47 UTC