- From: <bugzilla@wiggum.w3.org>
- Date: Fri, 04 Nov 2005 16:39:20 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2457 Summary: Rules for URI encoding don't match RFC 3986/3987 Product: XPath / XQuery / XSLT Version: Candidate Recommendation Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Functions and Operators AssignedTo: ashok.malhotra@oracle.com ReportedBy: mike@saxonica.com QAContact: public-qt-comments@w3.org I hate bringing up this old chestnut again, but I have a nasty feeling we've got it wrong. Currently encode-for-uri() does NOT escape a "#" sign. This seems contrary to the purpose of the function, and inconsistent with the treatment of other characters. In RFC 3986 (2.2 reserved characters), we read: reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" The spec goes on to say: URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. [This basically means that sub-delims are delimiters in some URI schemes/contexts, and not in others.] encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and "#" "-" "_" "." "!" "~" "*" "'" "(" ")" This seems to come largely from RFC2396, which has (in section 2.2) unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" the only difference being the "#". The concept of "mark" seems to have disappeared in 3986. RFC 2396 then says (2.4): Data must be escaped if it does not have a representation using an unreserved character So both RFCs agree that "#", if it is not used with its special purpose as a delimiter, must be escaped. So why don't we escape it? The history of this is so tortuous that I really don't want to research it. I think a lot of it has to do with the fact that RFC 2396 handled it badly. 3986 seems much clearer, and my recommendation would be that we not only add "#" to the list of characters that are escaped, but that we do exactly what 3986 says, which is to escape all characters in the "reserved" list (both gen-delims and sub-delims) above. Procedurally, as RFC 3986 is dated January 2005, I think we can reasonably argue that it was an oversight not to bring our specs into line with it for the last call, and that it's reasonable to rectify the situation during CR. Other WGs have been fairly interested in this question so we'll obviously need to consult. Note: I was alerted to the oddity of the current spec by the test results for fn-encode-for-uri1args-1 and related tests. The Saxon implementation currently does escape "#". Having looked at this, we should then look at the iri-to-uri() list as well. It's hard to see any relationship between that list of characters and RFC3986 either. In fact, the statement: All characters are escaped other than the lower case letters a-z, the upper case letters A-Z, the digits 0-9, the NUMBER SIGN "#" and HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP ".", EXCLAMATION MARK "!", TILDE "~", ASTERISK "*", APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS ")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON ":", COMMERCIAL AT "@", AMPERSAND "&", EQUALS SIGN "=", PLUS SIGN "+", DOLLAR SIGN "$", COMMA ",", LEFT SQUARE BRACKET "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN "%". seems equivalent to saying "escape all non-ASCII characters plus (", <, >, `, \, ^, and |) - which is a pretty bizarre list. We would expect to find the spec for iri-to-uri() in RFC3987, and sure enough, it's there. What it says is that every character in "ucschar" or "iprivate" must be %-encoded. That's defined like this: ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFD iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD which is pretty much the same as saying "non-ASCII characters" (and thus overlaps rather with escape-html-uri()). Since we now have a function called iri-to-uri(), it would seem that it ought to do what the IRI spec says. Previously raised internally at http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0044.html See also subsequent thread.
Received on Friday, 4 November 2005 16:39:25 UTC