Special characters in URIs from Martin J. Duerst on 1999-05-26 (uri@w3.org from May 1999)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 26 May 1999 14:38:33 +0900
To: ietf-url@imc.org, uri@Bunyip.Com
Message-Id: <199905260654.PAA08644@sh.w3.mag.keio.ac.jp>

This is a question for background information from the URI/URL
community.

The fact that URIs (RFC 2396) don't define the character semantics
of the byte values they encode has been discussed on various occasions.

To alleviate the problem, various URL schemes have started to base
themselves on UTF-8, and some formats that carry URIs have defined
error behaviour based on UTF-8.

The second case basically works by saying that if in these formats
(e.g. HTML), an URI contains a non-ASCII character, this character
is converted to a byte sequence using UTF-8 and then %-encoded to
produce a legal URI.

The question now has come up whether this behaviour can be extended
to characters in the ASCII range, i.e. any of:

 control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
 space       = <US-ASCII coded character 20 hexadecimal>
 delims      = "<" | ">" | "#" | "%" | <">
 unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"


'#' and '%' of course have to stay excluded. For the formats in
question (mostly XML), control characters are not allowed anyway.
"<", <">, ... would appear as &lt;, &amp;,... only anyway.
Space would have to be used with caution because collapsing
rules might apply.

So the question is mainly about the rest:
"{", "}", "|", "\", "^", "[", "]", "`"

The reasons given in RFC 2396 for excluding them don't apply
in the relevant context, and before leaving that context,
these would be escaped anyway.

This is as far as my argumentation goes. I would very much like
to know if there is any problem with this.


Regards,   Martin. 




#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org

Received on Wednesday, 26 May 1999 02:58:24 UTC