[Bug 2457] Rules for URI encoding don't match RFC 3986/3987 from bugzilla@wiggum.w3.org on 2005-11-04 (public-qt-comments@w3.org from November 2005)

From: <bugzilla@wiggum.w3.org>
Date: Fri, 04 Nov 2005 16:39:20 +0000
To: public-qt-comments@w3.org
Cc:
Message-Id: <E1EY4b2-00088d-Ik@wiggum.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2457

           Summary: Rules for URI encoding don't match RFC 3986/3987
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Functions and Operators
        AssignedTo: ashok.malhotra@oracle.com
        ReportedBy: mike@saxonica.com
         QAContact: public-qt-comments@w3.org


I hate bringing up this old chestnut again, but I have a nasty feeling we've
got it wrong.

Currently encode-for-uri() does NOT escape a "#" sign.

This seems contrary to the purpose of the function, and inconsistent with
the treatment of other characters.

In RFC 3986 (2.2 reserved characters), we read:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

The spec goes on to say:

URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component. [This basically means that sub-delims are delimiters in some
   URI schemes/contexts, and not in others.]

encode-for-uri() escapes all characters except A-Z, a-z, 0-9, and 
   
      "#" "-" "_" "." "!" "~" "*" "'" "(" ")"

This seems to come largely from RFC2396, which has (in section 2.2)

unreserved  = alphanum | mark

mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

the only difference being the "#".

The concept of "mark" seems to have disappeared in 3986.

RFC 2396 then says (2.4):

Data must be escaped if it does not have a representation using an
   unreserved character

So both RFCs agree that "#", if it is not used with its special purpose as a
delimiter, must be escaped.

So why don't we escape it?

The history of this is so tortuous that I really don't want to research it.
I think a lot of it has to do with the fact that RFC 2396 handled it badly.
3986 seems much clearer, and my recommendation would be that we not only add
"#" to the list of characters that are escaped, but that we do exactly what
3986 says, which is to escape all characters in the "reserved" list (both
gen-delims and sub-delims) above.

Procedurally, as RFC 3986 is dated January 2005, I think we can reasonably
argue that it was an oversight not to bring our specs into line with it for
the last call, and that it's reasonable to rectify the situation during CR.
Other WGs have been fairly interested in this question so we'll obviously
need to consult.

Note: I was alerted to the oddity of the current spec by the test results
for fn-encode-for-uri1args-1 and related tests. The Saxon implementation
currently does escape "#".

Having looked at this, we should then look at the iri-to-uri() list as well.
It's hard to see any relationship between that list of characters and
RFC3986 either. In fact, the statement:

All characters are escaped other than the lower case letters a-z, the upper
case letters A-Z, the digits 0-9, the NUMBER SIGN "#" and HYPHEN-MINUS
("-"), LOW LINE ("_"), FULL STOP ".", EXCLAMATION MARK "!", TILDE "~",
ASTERISK "*", APOSTROPHE "'", LEFT PARENTHESIS "(", and RIGHT PARENTHESIS
")", SEMICOLON ";", SOLIDUS "/", QUESTION MARK "?", COLON ":", COMMERCIAL AT
"@", AMPERSAND "&", EQUALS SIGN "=", PLUS SIGN "+", DOLLAR SIGN "$", COMMA
",", LEFT SQUARE BRACKET "[", RIGHT SQUARE BRACKET "]", and the PERCENT SIGN
"%".

seems equivalent to saying "escape all non-ASCII characters plus (", <, >,
`, \, ^, and |) - which is a pretty bizarre list.

We would expect to find the spec for iri-to-uri() in RFC3987, and sure
enough, it's there. What it says is that every character in "ucschar" or
"iprivate" must be %-encoded. That's defined like this:

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

which is pretty much the same as saying "non-ASCII characters" (and thus
overlaps rather with escape-html-uri()).

Since we now have a function called iri-to-uri(), it would seem that it
ought to do what the IRI spec says.

Previously raised internally at 
http://lists.w3.org/Archives/Member/w3c-xsl-query/2005Oct/0044.html

See also subsequent thread.
Received on Friday, 4 November 2005 16:39:25 UTC