- From: Michael Kay <mhk@mhk.me.uk>
- Date: Thu, 10 Jun 2004 09:12:12 +0100
- To: "'Jim Melton'" <jim.melton@acm.org>, "'Igor Hersht'" <igorh@ca.ibm.com>
- Cc: <ashokmalhotra@alum.mit.edu>, <public-qt-comments@w3.org>, <Stephen.Buxton@oracle.com>
- Message-Id: <20040610081252.3A0CBA09DA@frink.w3.org>
I don't think my attempt at a definition is necessarily the last word on the matter. Let's try to put this into context. We've made a lot of attempts in the language design to achieve things like making "eq" transitive, so that optimizers can do their stuff. If we want to do optimizations like using an index to support a "starts-with" query, then we need to make some assumptions about what a collation can and cannot do: a collation that returned true for the expression startswith("A", "Z", "collation-uri") would make any such optimizations fraught, to say the least. In my product I allow people to create user-defined collations as implementations of the class java.text.Collator. That class definition in itself imposes almost no constraints. I need to tell people "Saxon is making these assumptions about collations, it's your responsibility if you implement your own collation to make sure that it satisfies these constraints". For example, an antistrophic collation (one that sorts strings in alphabetical order of the reversed string - very useful for crosswords) would satisfy the ordering rules, but what should "starts-with" mean under such a collation? My attempt at a definition gives one possible answer to that question. Perhaps this is an implementation problem: I can make any rules I like for what kind of collation my implementation will handle. But I feel it's more than that. It's not enough for us to specify that starts-with(A, B, C) means whatever the collation C says it means - which is essentially what your definition is doing. If that were the case, we might as well not bother having a standard starts-with function. I think for example that users are entitled to assume that if starts-with(A, B, C) is true then contains(A, B, C) is also true, for any collation. I'm searching for a way of expressing that expectation, and others like it, in a more rigorous definition of "collation". I had assumed that Oracle's comment asking for a definition of "collation" was complaining that our current definition is too woolly. Michael Kay _____ From: Jim Melton [mailto:jim.melton@acm.org] Sent: 10 June 2004 00:28 To: Igor Hersht Cc: Michael Kay; ashokmalhotra@alum.mit.edu; public-qt-comments@w3.org; Stephen.Buxton@oracle.com Subject: RE: PLease define 'collation' Gentlepeople, My apologies for entering this discussion rather late, even though it is one of my favorite topics; I've been out of the country on business and am just now catching up on email. The most pithy definition of "collation" that I can devise would read something like this: collation: A specification of the manner in which character strings are compared and, by extension, ordered. That definition says absolutely nothing about the technology used to perform the comparisons/orderings, nor about how to specify a collation in any particular context. I think those omissions are a strength of the definition. Such a definition does not preclude collations based on the Unicode Collation Algorithm (UCA), proprietary mechanisms, or even so-called "phone book" collations. I, in agreement with the XML Query WG (and, presumably, the XSL WG), would oppose any definition that might preclude some collation that our implementations might use or that our customers might demand. Igor, you have raised some interesting points, but I don't think that we are in any disagreement about the goals. Nonetheless, I think that I do disagree with your statement that the UCA "cannot be implemented correctly when you compare or match just parts of the strings (represented by the collation units)". Perhaps it's a matter of interpretation, because I believe that such comparisons can be done, but (as I think you said) the collation units for the entire set of strings must be computed in order for them to be done. You argue that this cannot be implemented in a reasonable period of time, but others may well disagree (indeed, some may already have implemented such facilities), so this is not a useful argument against such a requirement. As Mike Kay said, "Whether a real collation actually operates in this way is irrelevant, it only needs to produce the same results as if it did so". I especially disagree with your statement that "Just anoter example from the Unicode specs which theoretically cannot be implemented (for contains or any other collation function) using just collation elements". I am convinced, after inspecting UTR #10 and spending a bit of time thinking about this, that matching such as that required by fn:contains() can readily be implemented using just collation elements. It might (or might not) be claimed that doing so would be time-consuming or perhaps inefficient, but "theoretically cannot be implemented" is very hard to swallow. That seems tantamount to a claim that the UCA cannot be implemented ("for...any other collation function"), even in theory, which is patently absurd. Surely fn:compare($arg1, $arg2, $collation) is "any...collation function". Do you really mean to imply that it is theoretically impossible to implement that function? Mike also asked about what assumptions can a system make about a collation, such as transitivity, symmetry, etc. This is an area fraught with peril, except when the nature of the collation is generally known. For example, I share your belief that collations based on the UCA are transitive, symmetrical, etc., but I can easily imagine other collations that do not share all of, perhaps any of, those properties. That makes it dangerous for a system to make universal assumptions. Of course, a partial solution (which I think I could support) to this problem is to say that the results of collations that do not support those properties is implementation-defined. (Mike, with respect, I am troubled by your lengthy, almost algorithmic, definition of a collation, in part because it seems to presume something very like the UCA, but also in part because I see no need for such detailed semantics to be included in the definition. I strongly prefer my much more terse and general definition. For similar reasons, I am uncomfortable with Ashok's proposed definition; again, it goes into too much detail and I don't think we need to provide a tutorial on the possible behaviors that collations can be built to provide.) Hope this helps, Jim ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Standards email: jim dot melton at acm dot org Sandy, UT 84093-1063 Personal email: jim at melton dot name USA Fax : +1.801.942.3345 ======================================================================== = Facts are facts. However, any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ========================================================================
Received on Thursday, 10 June 2004 04:12:52 UTC