- From: Sandro Hawke <sandro@w3.org>
- Date: Fri, 01 May 2009 12:28:05 -0400
- To: public-rif-wg@w3.org
In writing the spec for the List builtins, I've come across a difficult design choice concerning how literals are compared. (Some of this might be considered already decided, but it seems to me there's a fair amount of new information here, relevantstuff I didn't know at the F2F.) Background: - In RIF, you have two ways you can compare most literals: (1) You can use rif:equals, which is true iff the elements in the value space for the two literals are the same. Literals with types with disjoint value spaces will never compare as equal true: "01":xs^int = "1":xs^int false: "1":xs^int = "1"^xs^float false: "1":xs^double = "1"^xs^float false: "2002-04-02T12:00:00-01:00"^^xs:dateTime = "2002-04-02T17:00:00+04:00"^^xs:dateTime) false: "Strasse" = "Straße" (2) You can use a builtin comparator like numeric-equal, dateTime-equal, date-equal, time-equal, duration-equal, XMLLiteral-equal, compare, and text-compare. These builtins allow more values to be considered equal, for example: true: "1":xs^int = "1"^xs^float true: "1":xs^double = "1"^xs^float true: op:dateTime-equal( "2002-04-02T12:00:00-01:00"^^xs:dateTime, "2002-04-02T17:00:00+04:00"^^xs:dateTime) In addition, for the comparison of strings and text, an optional 'collation' parameter is available: false: 0 == compare("Strasse", "Straße") true: 0 == compare("Strasse", "Straße", "deutsch") - As I understand it, the 'collation' is an extensibility point. XPath-Functions uses the examples 'deutsch', "http://www.example.com/collations/French1", and "http://www.example.com/collations/French2", but only defines (and requires) one collation, the default: "http://www.w3.org/2005/xpath-functions/collation/codepoint" which does unicode normalization (sort of). See http://www.w3.org/TR/xpath-functions/#collations and http://www.unicode.org/unicode/reports/tr10/ - Some list builtins need to compare values. * member; this is not in XPath-Functions * index-of and distinct-values; these take a collation parameter in XPath-Function. The collation is defined to apply whenever the values are strings. * union, intersect, except; these conceptually compare elements, but in XPath-Functions they only operate on lists of nodes, so string comparison doesn't come into play, and no collation is passed; in added them as parameters in drafting the text for DTB. Also, some non-list functions in DTB uses collations: compare, substring-before, substring-after, contains, starts-with, ends-with, and text-compare. - Although formally a "collation" is a total preorder (a "compare" function, returning -1, 0, 1 for each pair of values, like a total order but with equalities), our primary use for it is merely do determine equal/not-equal, not to sort things into their order. The Question: Which type of literal comparison should the list builtins use? If, like XPath-Functions (and DTB, right now), we let users choose how strings are being compared, and there's an obvious choice to make in comparing numbers and dates, isn't it odd to not give them the same flexibility there? Specifically: Question 1: What ways, if any, do we provide for users to specify how literals are compared? Question 2: If/when users do not specify how literals are compared? Specifically do we default to rif:equal or the builtin comparators? Note also that if the rule author doesn't specify a collation, then the rule system *user* might. That is, the ruleset might not pay attention to language, but the user might say they want the French collation etc. Options for Question 1: 1. No rule-author control. Get rid of collations in the API. (This might be considered unacceptable by i18n folks; I don't know.) 2. As written in DTB now: users can offer a URI indicating how strings are to be compared; only one such URI is defined and required, so this feature can only be used within environments that implement some extension here. No way to control which kind of comparison is done for other literals. 3. Extend the notion of collations to cover all our literals. Instead of passing 'French', you could pass a collation that indicated which kind of numerical comparison should be used. (The problem is that if you do that, then what happens to the user's local use-French-collation setting?) 4. Keep 'collations' for strings, and add a similar but different comparison parameter. For example, member would be: member(item, list) member(item, list, comparator) member(item, list, comparator, collation) Here, I'm imagining comparator to be a term which for now could only be certain pre-defined rif:iris (like collations) -- one for each of the two types of RIF equality. But in dialects with higher-order functions, they could be functions defined in the ruleset. Technical notes: - we have to put the comparator argument before the collation argument, because we have no way of omitting any argument but the last one(s), and sometimes you need to supply a comparator but no collation (eg, when you want to let the user supply the collation); you never need to supply the collation and not the comparator, since we'll define a fixed value for the default comparator. (The difference is that while end-users might control collations, they're not going to be controlling comparators.) - when we let users define their own comparators, the obvious thing to ask them to define is an "equal" predicate with two parameters. This approach seriously impacts the complexity class of the builtins; for example, member has to be done as a linear search, instead of using a binary search or a hash table (in the common cases where the list is known to have some structure/ordering.) Instead, users should either define a "compare" function (returning -1, 0, or 1) so sorting can be done, or a "fold" function (returning a string which is the same for all "equal" values) so hashing can be done. This lack of higher-order function syntax is a pain, but I think we can live with it here by defining two comparator IRI's, maybe func:literal-compare and func:value-compare, which in our existing dialects can only be used as a comparator argument. If you could use them as functions, literal compare would be like the ordered version of rif:equal for literals -- I'd suggest the ordering between disjoint value spaces just be alphabetical order of the datatype IRI. Similarly, value-compare is just the big expression using every guard and then the builtin comparators for that type. It is pretty goofy to define these two functions and say you can only use them as a parameter -- you can't really call them -- but that's still the best option I see right now. Thoughts? - Sandro
Received on Friday, 1 May 2009 16:28:13 UTC