- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 26 May 1999 14:38:33 +0900
- To: ietf-url@imc.org, uri@Bunyip.Com
This is a question for background information from the URI/URL community. The fact that URIs (RFC 2396) don't define the character semantics of the byte values they encode has been discussed on various occasions. To alleviate the problem, various URL schemes have started to base themselves on UTF-8, and some formats that carry URIs have defined error behaviour based on UTF-8. The second case basically works by saying that if in these formats (e.g. HTML), an URI contains a non-ASCII character, this character is converted to a byte sequence using UTF-8 and then %-encoded to produce a legal URI. The question now has come up whether this behaviour can be extended to characters in the ASCII range, i.e. any of: control = <US-ASCII coded characters 00-1F and 7F hexadecimal> space = <US-ASCII coded character 20 hexadecimal> delims = "<" | ">" | "#" | "%" | <"> unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" '#' and '%' of course have to stay excluded. For the formats in question (mostly XML), control characters are not allowed anyway. "<", <">, ... would appear as <, &,... only anyway. Space would have to be used with caution because collapsing rules might apply. So the question is mainly about the rest: "{", "}", "|", "\", "^", "[", "]", "`" The reasons given in RFC 2396 for excluding them don't apply in the relevant context, and before leaving that context, these would be escaped anyway. This is as far as my argumentation goes. I would very much like to know if there is any problem with this. Regards, Martin. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org
Received on Wednesday, 26 May 1999 02:58:24 UTC