FW: Escaping the # mark from Ashok Malhotra on 2005-12-07 (www-tag@w3.org from December 2005)

From: Ashok Malhotra <ashok.malhotra@oracle.com>
Date: Wed, 7 Dec 2005 11:00:52 -0800
To: www-tag@w3.org
Message-ID: <20051207110052452.00000003472@amalhotr-pc>
At Dan's request forwarding to the public list.

All the best, Ashok
 

> -----Original Message-----
> From: Ashok Malhotra [mailto:ashok.malhotra@oracle.com] 
> Sent: Tuesday, December 06, 2005 9:28 AM
> To: tag@w3.org
> Cc: w3c-xsl-query@w3c.org
> Subject: Escaping the # mark
> 
> Michael Kay has raised a question about escaping the # sign 
> in the two F&O functions that escape URIs.  Since the tag has 
> been involved in earlier discussions on these functions we 
> thought we would ask for you opinion before we proceeded.  
> Many thanks for looking at this.  Mike's comments are copied below.
> 
> The current definitions are in sections 7.4.10 and 7.4.11 in 
> http://www.w3.org/TR/xpath-functions/
> 
> All the best, Ashok
> 
> ===========================================================
> 
> I hate bringing up this old chestnut again, but I have a 
> nasty feeling we've got it wrong.
> 
> Currently encode-for-uri() does NOT escape a "#" sign.
> 
> This seems contrary to the purpose of the function, and 
> inconsistent with the treatment of other characters.
> 
> In RFC 3986 (2.2 reserved characters), we read:
> 
>       reserved    = gen-delims / sub-delims
> 
>       gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
> 
>       sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
>                   / "*" / "+" / "," / ";" / "="
> 
> The spec goes on to say:
> 
> URI producing applications should percent-encode data octets that
>    correspond to characters in the reserved set unless these 
> characters
>    are specifically allowed by the URI scheme to represent 
> data in that
>    component. [This basically means that sub-delims are 
> delimiters in some
>    URI schemes/contexts, and not in others.]
> 
> encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and 
>    
>       "#" "-" "_" "." "!" "~" "*" "'" "(" ")"
> 
> This seems to come largely from RFC2396, which has (in section 2.2)
> 
> unreserved  = alphanum | mark
> 
> mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
> 
> the only difference being the "#".
> 
> The concept of "mark" seems to have disappeared in 3986.
> 
> RFC 2396 then says (2.4):
> 
> Data must be escaped if it does not have a representation using an
>    unreserved character
> 
> So both RFCs agree that "#", if it is not used with its 
> special purpose as a delimiter, must be escaped.
> 
> So why don't we escape it?
> 
> The history of this is so tortuous that I really don't want 
> to research it.
> I think a lot of it has to do with the fact that RFC 2396 
> handled it badly.
> 3986 seems much clearer, and my recommendation would be that 
> we not only add "#" to the list of characters that are 
> escaped, but that we do exactly what
> 3986 says, which is to escape all characters in the 
> "reserved" list (both gen-delims and sub-delims) above.
> 
> Procedurally, as RFC 3986 is dated January 2005, I think we 
> can reasonably argue that it was an oversight not to bring 
> our specs into line with it for the last call, and that it's 
> reasonable to rectify the situation during CR.
> Other WGs have been fairly interested in this question so 
> we'll obviously need to consult.
> 
> Note: I was alerted to the oddity of the current spec by the 
> test results for fn-encode-for-uri1args-1 and related tests. 
> The Saxon implementation currently does escape "#".
> 
> Having looked at this, we should then look at the 
> iri-to-uri() list as well.
> It's hard to see any relationship between that list of characters and
> RFC3986 either. In fact, the statement:
> 
> All characters are escaped other than the lower case letters 
> a-z, the upper case letters A-Z, the digits 0-9, the NUMBER 
> SIGN "#" and HYPHEN-MINUS ("-"), LOW LINE ("_"), FULL STOP 
> ".", EXCLAMATION MARK "!", TILDE "~", ASTERISK "*", 
> APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS 
> ")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON 
> ":", COMMERCIAL AT "@", AMPERSAND "&", EQUALS SIGN "=", PLUS 
> SIGN "+", DOLLAR SIGN "$", COMMA ",", LEFT SQUARE BRACKET 
> "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN "%".
> 
> seems equivalent to saying "escape all non-ASCII characters 
> plus (", <, >, `, \, ^, and |) - which is a pretty bizarre list.
> 
> We would expect to find the spec for iri-to-uri() in RFC3987, 
> and sure enough, it's there. What it says is that every 
> character in "ucschar" or "iprivate" must be %-encoded. 
> That's defined like this:
> 
> ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD
> 
>    iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
> 
> which is pretty much the same as saying "non-ASCII 
> characters" (and thus overlaps rather with escape-html-uri()).
> 
> Since we now have a function called iri-to-uri(), it would 
> seem that it ought to do what the IRI spec says.
> 
>
Received on Wednesday, 7 December 2005 19:01:20 UTC