W3C home > Mailing lists > Public > public-qt-comments@w3.org > October 2002

RE: escape-uri() issues

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Mon, 21 Oct 2002 15:42:11 +0200
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453DCAC@daemsg02.software-ag.de>
To: Mike Brown <mike@skew.org>, public-qt-comments@w3.org
Cc: Ashok Malhotra <ashokma@microsoft.com>

> > I would suggest quite strongly that if you do an escape-uri 
> function 
> > in EXSLT, you base it on the proposed XPath 2.0 spec. If you think 
> > there's something badly wrong with the XPath 2.0 spec, then 
> please say 
> > so on the public comments list so we can fix it.
> OK then, regarding 
> http://www.w3.org/TR/xquery-operators/#func-> escape-uri ,
> 1. 
> RFC 2396 is mentioned, but RFC 2732 is not. 
> RFC 2732 adds "[" and "]" to the set of reserved characters 
> and changes the "host" part of the URI 
> grammar, to allow IPv6 addresses.

This has been pointed out before, but it seems to have slipped through the
net. It's obviously an error, we need to correct this. [Ashok, please note!]

> 2. I make a case for supporting encodings other than UTF-8 at 
> http://lists.fourthought.com/pipermail/exslt/2> 002-October/000658.html
> In the EXSLT proposal, I suggest using an empty string for an 
> encoding name (in an optional final argument) to indicate how 
> non-ASCII characters are to be converted to octets. Note that 
> ASCII characters are escaped based on US-ASCII, so if you 
> needed to use UTF-16BE, for example, " " would still be "%20" 
> while U+1234 would be "%12%34".

I'm not sure what your message refers to when it says that UTF-8 is defined
only for "new" schemes. I agree that like everything to do with URIs, the
lack of clarity in the specs is quite extraordinary, and I also accept there
are legacy situations where other encodings are used. But it seems to me
that UTF-8 is clearly the intended direction, and I think this is what we
should support in the W3C recommendations. There's no harm in EXSLT
supporting other encodings if you think it useful.

> 3. IRIs are not supported. In an IRI, the ASCII characters 
> that would be escaped in a URI are still escaped, but 
> non-ASCII characters are not escaped (forgive me if that's an 
> overstatement). In the EXSLT proposal, I suggest using an 
> empty string for an encoding name (in an optional final 
> argument) to indicate that non-ASCII characters are not to be escaped.

I'm not sure what you're referring to as your definition of "IRI". The
Namespaces 1.1 Rec seems to have difficulty identifying a definitive
reference, so it includes its own definition:

Some characters are disallowed in URI references, even if they are allowed
in XML; the disallowed characters, according to [RFC2396] and [RFC2732], are
the control characters #x0 to #x1F and #x7F, space #x20, the delimiters '<'
#x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|'
#x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F.

[Definition: An IRI reference is a string that can be converted to a URI
reference by escaping all disallowed characters as follows: ]

   1. Each disallowed character is converted to UTF-8 [Unicode 3.2] as one
or more bytes.
   2. The resulting bytes are escaped with the URI escaping mechanism (that
is, converted to %HH, where HH is the hexadecimal notation of the byte
   3. The original character is replaced by the resulting character

This seems to suggest that an IRI reference doesn't require any characters
to be escaped. But I suppose that's misleading: you still need to escape the
characters that have a special meaning (such as "#"), if you are using them
without that special meaning. So this seems to suggest that there should be
a third option, effectively escape-reserved="only", which only escapes the
reserved characters and nothing else? On the other hand, escaping other
characters seems to do no harm. I'm having trouble seeing a real use case
for additional options in a function that's already complex enough as it is.

We decided at the last meeting to add a function string-to-codepoints()
which gives you a sequence of integers representing the Unicode values of
the characters in the string. This rather takes the pressure of escape-uri()
since it will become possible for users to write implementations of encoding
variants that we've chosen not to support.

> 4. I prefer the function name "uri-escape" over "escape-uri", 
> as the former implies a type of action (URI-style escaping of 
> something), and the latter implies an action on a subject 
> (some kind of escaping of a URI), which would be more accurate.

Our general style of function naming, insofar as we have one, is verb-noun.
Of course there's a certain imprecision here, in that we aren't applying
escaping to the URI, but to a string that is destined to form part of a URI,
but I've always argued that we can't pack the entire specification of the
function into its name.


Michael Kay
Received on Monday, 21 October 2002 09:42:27 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:56:43 UTC