- From: Kay, Michael <Michael.Kay@softwareag.com>
- Date: Mon, 21 Oct 2002 15:42:11 +0200
- To: Mike Brown <mike@skew.org>, public-qt-comments@w3.org
- Cc: Ashok Malhotra <ashokma@microsoft.com>
> > > I would suggest quite strongly that if you do an escape-uri > function > > in EXSLT, you base it on the proposed XPath 2.0 spec. If you think > > there's something badly wrong with the XPath 2.0 spec, then > please say > > so on the public comments list so we can fix it. > > OK then, regarding > http://www.w3.org/TR/xquery-operators/#func-> escape-uri , > > 1. > RFC 2396 is mentioned, but RFC 2732 is not. > RFC 2732 adds "[" and "]" to the set of reserved characters > and changes the "host" part of the URI > grammar, to allow IPv6 addresses. This has been pointed out before, but it seems to have slipped through the net. It's obviously an error, we need to correct this. [Ashok, please note!] > > 2. I make a case for supporting encodings other than UTF-8 at > http://lists.fourthought.com/pipermail/exslt/2> 002-October/000658.html > > In the EXSLT proposal, I suggest using an empty string for an > encoding name (in an optional final argument) to indicate how > non-ASCII characters are to be converted to octets. Note that > ASCII characters are escaped based on US-ASCII, so if you > needed to use UTF-16BE, for example, " " would still be "%20" > while U+1234 would be "%12%34". I'm not sure what your message refers to when it says that UTF-8 is defined only for "new" schemes. I agree that like everything to do with URIs, the lack of clarity in the specs is quite extraordinary, and I also accept there are legacy situations where other encodings are used. But it seems to me that UTF-8 is clearly the intended direction, and I think this is what we should support in the W3C recommendations. There's no harm in EXSLT supporting other encodings if you think it useful. > > 3. IRIs are not supported. In an IRI, the ASCII characters > that would be escaped in a URI are still escaped, but > non-ASCII characters are not escaped (forgive me if that's an > overstatement). In the EXSLT proposal, I suggest using an > empty string for an encoding name (in an optional final > argument) to indicate that non-ASCII characters are not to be escaped. I'm not sure what you're referring to as your definition of "IRI". The Namespaces 1.1 Rec seems to have difficulty identifying a definitive reference, so it includes its own definition: <quote> Some characters are disallowed in URI references, even if they are allowed in XML; the disallowed characters, according to [RFC2396] and [RFC2732], are the control characters #x0 to #x1F and #x7F, space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. [Definition: An IRI reference is a string that can be converted to a URI reference by escaping all disallowed characters as follows: ] 1. Each disallowed character is converted to UTF-8 [Unicode 3.2] as one or more bytes. 2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value). 3. The original character is replaced by the resulting character sequence. </quote> This seems to suggest that an IRI reference doesn't require any characters to be escaped. But I suppose that's misleading: you still need to escape the characters that have a special meaning (such as "#"), if you are using them without that special meaning. So this seems to suggest that there should be a third option, effectively escape-reserved="only", which only escapes the reserved characters and nothing else? On the other hand, escaping other characters seems to do no harm. I'm having trouble seeing a real use case for additional options in a function that's already complex enough as it is. We decided at the last meeting to add a function string-to-codepoints() which gives you a sequence of integers representing the Unicode values of the characters in the string. This rather takes the pressure of escape-uri() since it will become possible for users to write implementations of encoding variants that we've chosen not to support. > > 4. I prefer the function name "uri-escape" over "escape-uri", > as the former implies a type of action (URI-style escaping of > something), and the latter implies an action on a subject > (some kind of escaping of a URI), which would be more accurate. Our general style of function naming, insofar as we have one, is verb-noun. Of course there's a certain imprecision here, in that we aren't applying escaping to the URI, but to a string that is destined to form part of a URI, but I've always argued that we can't pack the entire specification of the function into its name. Regards, Michael Kay
Received on Monday, 21 October 2002 09:42:27 UTC