Re: Interpretation if %-escapes in IRIs [escapeInterpret-14] from Bjoern Hoehrmann on 2003-05-02 (public-iri@w3.org from May 2003)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Fri, 02 May 2003 05:37:51 +0200
To: Martin Duerst <duerst@w3.org>
Cc: public-iri@w3.org
Message-ID: <3f2ce7ed.656144636@smtp.bjoern.hoehrmann.de>

* Martin Duerst wrote:
>>IMO, the IRI draft should say, that if %-escaping is used in an IRI, the
>>escape sequence must be generated from UTF-8 octets and %-escapes must
>>be interpreted as octets in an UTF-8 sequence.
>
>why should it say so? In that case, you should not really use
>%-escaping in an IRI, you should use real characters.

What if it is impossible to use "real" characters due limitations of the
transport media, the transport encoding, if I need to escape a reserved
character to avoid it's special meaning, if the character is disallowed
or if I want to encode binary data that does not represent any
character?

What if my IRI-aware application receives an IRI containing %-escape
sequences but needs characters in order to work, like some kind of
server for file transfer expecting a file name or a database frontend
expecting a search string?

Let's say there is an 'uri' URI scheme and an 'iri' IRI scheme (the + in
the query part has no special meaning and may thus stay unescaped):

  uri://www.example.org/search?Bj+APY-rn
  iri://www.example.org/search?Bj+APY-rn

Decoding the query part of the URI I would get the octets

  <42><6A><2B><41><50><59><2D><72><6E>

The database frontend would then search for "Björn", since it decodes
the octets represented by characters in the URL as UTF-7 octets. What
about the IRI? Is the frontend supposed to search for "Bj+APY-rn" or
for "Björn"? Is a data character in an IRI a character or is it a
representation of an octet or even something else?

If an IRI data character is a "real" character, refer %-escape sequence
also to real characters? Are these IRIs equivalent:

  iri://www.example.org/search?Bj%F6rn
  iri://www.example.org/search?Björn

just like these URIs are:

  uri://www.example.org/search?a
  uri://www.example.org/search?%61

Are these equivalent:

  iri://www.example.org/search?Bj%C3%B6rn
  iri://www.example.org/search?Björn

and are these IRIs:

  iri://www.example.org/search?a
  iri://www.example.org/search?%61

equivalent? If the latter two IRIs are equivalent, how would one then
encode binary data in an IRI? What octets are represented in the query
part of e.g.

  iri://www.example.org/search?<U+20AC>
  iri://www.example.org/search?<U+1D7F6>

Consider I want to send an IRI in a text/plain e-mail using us-ascii,
but the IRI has non-ASCII characters, like

  iri://www.example.org/björn

can I use %-escaping to encode the "ö" and if yes, how would the IRI
then look like? Would it be

  iri://www.example.org/bj%F6rn
  iri://www.example.org/bj%ECrn
  iri://www.example.org/bj%C3%B6rn
  iri://www.example.org/bj%00%F6rn
  ...

Currently neither RFC 2396 nor the IRI draft give an advise here. Is
this a scenario not supported by IRIs? If yes, why do you think it is
not necessary or not possible to support it, and why does the IRI draft
not mention that %-escaping cannot be used for non-ASCII characters, but
rather says it SHOULD NOT be used? If it is possible to use %-escaping
for non-ASCII characters, the IRI draft must say how the non-ASCII
character have to be encoded (actually, how any character is to be
encoded) and should say, how one gets the characters back.

regards.

Received on Thursday, 1 May 2003 23:38:13 UTC