- From: Ashok Malhotra <ashok.malhotra@oracle.com>
- Date: Wed, 7 Dec 2005 11:00:52 -0800
- To: www-tag@w3.org
At Dan's request forwarding to the public list.
All the best, Ashok
> -----Original Message-----
> From: Ashok Malhotra [mailto:ashok.malhotra@oracle.com]
> Sent: Tuesday, December 06, 2005 9:28 AM
> To: tag@w3.org
> Cc: w3c-xsl-query@w3c.org
> Subject: Escaping the # mark
>
> Michael Kay has raised a question about escaping the # sign
> in the two F&O functions that escape URIs. Since the tag has
> been involved in earlier discussions on these functions we
> thought we would ask for you opinion before we proceeded.
> Many thanks for looking at this. Mike's comments are copied below.
>
> The current definitions are in sections 7.4.10 and 7.4.11 in
> http://www.w3.org/TR/xpath-functions/
>
> All the best, Ashok
>
> ===========================================================
>
> I hate bringing up this old chestnut again, but I have a
> nasty feeling we've got it wrong.
>
> Currently encode-for-uri() does NOT escape a "#" sign.
>
> This seems contrary to the purpose of the function, and
> inconsistent with the treatment of other characters.
>
> In RFC 3986 (2.2 reserved characters), we read:
>
> reserved = gen-delims / sub-delims
>
> gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>
> sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
> / "*" / "+" / "," / ";" / "="
>
> The spec goes on to say:
>
> URI producing applications should percent-encode data octets that
> correspond to characters in the reserved set unless these
> characters
> are specifically allowed by the URI scheme to represent
> data in that
> component. [This basically means that sub-delims are
> delimiters in some
> URI schemes/contexts, and not in others.]
>
> encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and
>
> "#" "-" "_" "." "!" "~" "*" "'" "(" ")"
>
> This seems to come largely from RFC2396, which has (in section 2.2)
>
> unreserved = alphanum | mark
>
> mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
>
> the only difference being the "#".
>
> The concept of "mark" seems to have disappeared in 3986.
>
> RFC 2396 then says (2.4):
>
> Data must be escaped if it does not have a representation using an
> unreserved character
>
> So both RFCs agree that "#", if it is not used with its
> special purpose as a delimiter, must be escaped.
>
> So why don't we escape it?
>
> The history of this is so tortuous that I really don't want
> to research it.
> I think a lot of it has to do with the fact that RFC 2396
> handled it badly.
> 3986 seems much clearer, and my recommendation would be that
> we not only add "#" to the list of characters that are
> escaped, but that we do exactly what
> 3986 says, which is to escape all characters in the
> "reserved" list (both gen-delims and sub-delims) above.
>
> Procedurally, as RFC 3986 is dated January 2005, I think we
> can reasonably argue that it was an oversight not to bring
> our specs into line with it for the last call, and that it's
> reasonable to rectify the situation during CR.
> Other WGs have been fairly interested in this question so
> we'll obviously need to consult.
>
> Note: I was alerted to the oddity of the current spec by the
> test results for fn-encode-for-uri1args-1 and related tests.
> The Saxon implementation currently does escape "#".
>
> Having looked at this, we should then look at the
> iri-to-uri() list as well.
> It's hard to see any relationship between that list of characters and
> RFC3986 either. In fact, the statement:
>
> All characters are escaped other than the lower case letters
> a-z, the upper case letters A-Z, the digits 0-9, the NUMBER
> SIGN "#" and HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP
> ".", EXCLAMATION MARK "!", TILDE "~", ASTERISK "*",
> APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS
> ")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON
> ":", COMMERCIAL AT "@", AMPERSAND "&", EQUALS SIGN "=", PLUS
> SIGN "+", DOLLAR SIGN "$", COMMA ",", LEFT SQUARE BRACKET
> "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN "%".
>
> seems equivalent to saying "escape all non-ASCII characters
> plus (", <, >, `, \, ^, and |) - which is a pretty bizarre list.
>
> We would expect to find the spec for iri-to-uri() in RFC3987,
> and sure enough, it's there. What it says is that every
> character in "ucschar" or "iprivate" must be %-encoded.
> That's defined like this:
>
> ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
> / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
> / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
> / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
> / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
> / %xD0000-DFFFD / %xE1000-EFFFD
>
> iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
>
> which is pretty much the same as saying "non-ASCII
> characters" (and thus overlaps rather with escape-html-uri()).
>
> Since we now have a function called iri-to-uri(), it would
> seem that it ought to do what the IRI spec says.
>
>
Received on Wednesday, 7 December 2005 19:01:20 UTC