- From: Ashok Malhotra <ashok.malhotra@oracle.com>
- Date: Wed, 7 Dec 2005 11:00:52 -0800
- To: www-tag@w3.org
At Dan's request forwarding to the public list. All the best, Ashok > -----Original Message----- > From: Ashok Malhotra [mailto:ashok.malhotra@oracle.com] > Sent: Tuesday, December 06, 2005 9:28 AM > To: tag@w3.org > Cc: w3c-xsl-query@w3c.org > Subject: Escaping the # mark > > Michael Kay has raised a question about escaping the # sign > in the two F&O functions that escape URIs. Since the tag has > been involved in earlier discussions on these functions we > thought we would ask for you opinion before we proceeded. > Many thanks for looking at this. Mike's comments are copied below. > > The current definitions are in sections 7.4.10 and 7.4.11 in > http://www.w3.org/TR/xpath-functions/ > > All the best, Ashok > > =========================================================== > > I hate bringing up this old chestnut again, but I have a > nasty feeling we've got it wrong. > > Currently encode-for-uri() does NOT escape a "#" sign. > > This seems contrary to the purpose of the function, and > inconsistent with the treatment of other characters. > > In RFC 3986 (2.2 reserved characters), we read: > > reserved = gen-delims / sub-delims > > gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" > > sub-delims = "!" / "$" / "&" / "'" / "(" / ")" > / "*" / "+" / "," / ";" / "=" > > The spec goes on to say: > > URI producing applications should percent-encode data octets that > correspond to characters in the reserved set unless these > characters > are specifically allowed by the URI scheme to represent > data in that > component. [This basically means that sub-delims are > delimiters in some > URI schemes/contexts, and not in others.] > > encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and > > "#" "-" "_" "." "!" "~" "*" "'" "(" ")" > > This seems to come largely from RFC2396, which has (in section 2.2) > > unreserved = alphanum | mark > > mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" > > the only difference being the "#". > > The concept of "mark" seems to have disappeared in 3986. > > RFC 2396 then says (2.4): > > Data must be escaped if it does not have a representation using an > unreserved character > > So both RFCs agree that "#", if it is not used with its > special purpose as a delimiter, must be escaped. > > So why don't we escape it? > > The history of this is so tortuous that I really don't want > to research it. > I think a lot of it has to do with the fact that RFC 2396 > handled it badly. > 3986 seems much clearer, and my recommendation would be that > we not only add "#" to the list of characters that are > escaped, but that we do exactly what > 3986 says, which is to escape all characters in the > "reserved" list (both gen-delims and sub-delims) above. > > Procedurally, as RFC 3986 is dated January 2005, I think we > can reasonably argue that it was an oversight not to bring > our specs into line with it for the last call, and that it's > reasonable to rectify the situation during CR. > Other WGs have been fairly interested in this question so > we'll obviously need to consult. > > Note: I was alerted to the oddity of the current spec by the > test results for fn-encode-for-uri1args-1 and related tests. > The Saxon implementation currently does escape "#". > > Having looked at this, we should then look at the > iri-to-uri() list as well. > It's hard to see any relationship between that list of characters and > RFC3986 either. In fact, the statement: > > All characters are escaped other than the lower case letters > a-z, the upper case letters A-Z, the digits 0-9, the NUMBER > SIGN "#" and HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP > ".", EXCLAMATION MARK "!", TILDE "~", ASTERISK "*", > APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS > ")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON > ":", COMMERCIAL AT "@", AMPERSAND "&", EQUALS SIGN "=", PLUS > SIGN "+", DOLLAR SIGN "$", COMMA ",", LEFT SQUARE BRACKET > "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN "%". > > seems equivalent to saying "escape all non-ASCII characters > plus (", <, >, `, \, ^, and |) - which is a pretty bizarre list. > > We would expect to find the spec for iri-to-uri() in RFC3987, > and sure enough, it's there. What it says is that every > character in "ucschar" or "iprivate" must be %-encoded. > That's defined like this: > > ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF > / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD > / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD > / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD > / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD > / %xD0000-DFFFD / %xE1000-EFFFD > > iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD > > which is pretty much the same as saying "non-ASCII > characters" (and thus overlaps rather with escape-html-uri()). > > Since we now have a function called iri-to-uri(), it would > seem that it ought to do what the IRI spec says. > >
Received on Wednesday, 7 December 2005 19:01:20 UTC