RE: why the special case for % in fn:escape-uri?

> At 11:22 14/03/2003 -0600, Dan Connolly wrote:
> >I see:
> >
> >6.4.19.1 Examples
> >       * fn:escape-uri
> > 
> >("gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%
> 20Angeles#
> >ocean",
> >true()) returns 
> >"gopher%3A%2F%2Fspinaltap.micro.umn.edu%2F00%2FWeather%2FCali
> fornia%2FLos%20Angeles%23ocean"
> >
> >
> >http://www.w3.org/TR/xquery-operators/#func-escape-uri
> >
> >but the % after Los needs to be escaped, no?

> >
> >Hmm... the spec seems to special-case this:
> >
> >   The "%" character itself is escaped only if it is not followed
> >   by two hexadecimal digits (that is, 0-9, a-f, and A-F)
> >
> >I don't understand why.

RFC 2396 states (in section 2.4.2) 
"Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.  Implementers should be careful not to
   escape or unescape the same string more than once, since unescaping
   an already unescaped string might lead to misinterpreting a percent
   data character as another escaped character, or vice versa in the
   case of escaping an already escaped string."

The reason we have specified escape-uri() as we have is that if the input
string contains a "%" sign followed by two hex digits, this probably means
that escaping has already been carried out. We can't be sure, of course, but
it's a weakness of the escaping scheme that we have no way of telling. We
are following the advice "Implementers should be careful not to escape or
unescape the same string more than once".

We followed precedent here from some other spec, but I forget which it was.

> >
> >Also... what does 'when escaping an entire URI or URI 
> reference' refer 
> >to?
> 
> I assume it means escaping a URI, e.g. to be embedded inside 
> another, or 
> something like that.

The escape-uri() function has two modes, controlled by a parameter. In one
mode characters such as "/" and "?" are escaped, in the other mode they are
not. The first mode is suitable for escaping parts of a URI, for example an
individual parameter in the query string. The second mode is suitable when a
string representing an entire URI is to be escaped in a single operation.
This isn't recommended practice but is sometimes unavoidable.

I would like to make this sentence clearer if we can but I don't understand
why you had difficulty understanding it!

I have to say that I find the various RFCs on URI syntax incredibly
difficult to follow, and in many places ambiguous or contradictory. Since
there seems to be a belief that URIs are the foundation on which the web is
built, I would be much more comfortable if the specs were rock-solid rather
than shifting sand. With the escape-uri() function (and the rules for URI
escaping in XSLT serialization) we've done the best we can, but it's pretty
flakey stuff.

Michael Kay

Received on Friday, 14 March 2003 16:51:44 UTC