RE: escape-uri() issues from Kay, Michael on 2002-10-21 (public-qt-comments@w3.org from October 2002)

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Mon, 21 Oct 2002 15:42:11 +0200
To: Mike Brown <mike@skew.org>, public-qt-comments@w3.org
Cc: Ashok Malhotra <ashokma@microsoft.com>
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453DCAC@daemsg02.software-ag.de>
> 
> > I would suggest quite strongly that if you do an escape-uri 
> function 
> > in EXSLT, you base it on the proposed XPath 2.0 spec. If you think 
> > there's something badly wrong with the XPath 2.0 spec, then 
> please say 
> > so on the public comments list so we can fix it.
> 
> OK then, regarding 
> http://www.w3.org/TR/xquery-operators/#func-> escape-uri ,
> 
> 1. 
> RFC 2396 is mentioned, but RFC 2732 is not. 
> RFC 2732 adds "[" and "]" to the set of reserved characters 
> and changes the "host" part of the URI 
> grammar, to allow IPv6 addresses.

This has been pointed out before, but it seems to have slipped through the
net. It's obviously an error, we need to correct this. [Ashok, please note!]

> 
> 2. I make a case for supporting encodings other than UTF-8 at 
> http://lists.fourthought.com/pipermail/exslt/2> 002-October/000658.html
> 
> In the EXSLT proposal, I suggest using an empty string for an 
> encoding name (in an optional final argument) to indicate how 
> non-ASCII characters are to be converted to octets. Note that 
> ASCII characters are escaped based on US-ASCII, so if you 
> needed to use UTF-16BE, for example, " " would still be "%20" 
> while U+1234 would be "%12%34".

I'm not sure what your message refers to when it says that UTF-8 is defined
only for "new" schemes. I agree that like everything to do with URIs, the
lack of clarity in the specs is quite extraordinary, and I also accept there
are legacy situations where other encodings are used. But it seems to me
that UTF-8 is clearly the intended direction, and I think this is what we
should support in the W3C recommendations. There's no harm in EXSLT
supporting other encodings if you think it useful.

> 
> 3. IRIs are not supported. In an IRI, the ASCII characters 
> that would be escaped in a URI are still escaped, but 
> non-ASCII characters are not escaped (forgive me if that's an 
> overstatement). In the EXSLT proposal, I suggest using an 
> empty string for an encoding name (in an optional final 
> argument) to indicate that non-ASCII characters are not to be escaped.

I'm not sure what you're referring to as your definition of "IRI". The
Namespaces 1.1 Rec seems to have difficulty identifying a definitive
reference, so it includes its own definition:

<quote>
Some characters are disallowed in URI references, even if they are allowed
in XML; the disallowed characters, according to [RFC2396] and [RFC2732], are
the control characters #x0 to #x1F and #x7F, space #x20, the delimiters '<'
#x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|'
#x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F.

[Definition: An IRI reference is a string that can be converted to a URI
reference by escaping all disallowed characters as follows: ]

   1. Each disallowed character is converted to UTF-8 [Unicode 3.2] as one
or more bytes.
   2. The resulting bytes are escaped with the URI escaping mechanism (that
is, converted to %HH, where HH is the hexadecimal notation of the byte
value).
   3. The original character is replaced by the resulting character
sequence. 
</quote>

This seems to suggest that an IRI reference doesn't require any characters
to be escaped. But I suppose that's misleading: you still need to escape the
characters that have a special meaning (such as "#"), if you are using them
without that special meaning. So this seems to suggest that there should be
a third option, effectively escape-reserved="only", which only escapes the
reserved characters and nothing else? On the other hand, escaping other
characters seems to do no harm. I'm having trouble seeing a real use case
for additional options in a function that's already complex enough as it is.

We decided at the last meeting to add a function string-to-codepoints()
which gives you a sequence of integers representing the Unicode values of
the characters in the string. This rather takes the pressure of escape-uri()
since it will become possible for users to write implementations of encoding
variants that we've chosen not to support.

> 
> 4. I prefer the function name "uri-escape" over "escape-uri", 
> as the former implies a type of action (URI-style escaping of 
> something), and the latter implies an action on a subject 
> (some kind of escaping of a URI), which would be more accurate.

Our general style of function naming, insofar as we have one, is verb-noun.
Of course there's a certain imprecision here, in that we aren't applying
escaping to the URI, but to a string that is destined to form part of a URI,
but I've always argued that we can't pack the entire specification of the
function into its name.

Regards,

Michael Kay
Received on Monday, 21 October 2002 09:42:27 UTC