# collations, etc (a list-builtins issue)

From: Sandro Hawke <sandro@w3.org>
Date: Fri, 01 May 2009 12:28:05 -0400

Message-ID: <31304.1241195285@ubehebe>
```
In writing the spec for the List builtins, I've come across a difficult
design choice concerning how literals are compared.  (Some of this might
be considered already decided, but it seems to me there's a fair amount
of new information here, relevantstuff I didn't know at the F2F.)

Background:

- In RIF, you have two ways you can compare most literals:

(1) You can use rif:equals, which is true iff the elements in
the value space for the two literals are the same.
Literals with types with disjoint value spaces will never
compare as equal

true:    "01":xs^int = "1":xs^int
false:   "1":xs^int = "1"^xs^float
false:   "1":xs^double = "1"^xs^float
false:   "2002-04-02T12:00:00-01:00"^^xs:dateTime
= "2002-04-02T17:00:00+04:00"^^xs:dateTime)
false:   "Strasse" = "Straße"

(2) You can use a builtin comparator like numeric-equal,
dateTime-equal, date-equal, time-equal, duration-equal,
XMLLiteral-equal, compare, and text-compare.  These
builtins allow more values to be considered equal, for
example:

true:   "1":xs^int = "1"^xs^float
true:   "1":xs^double = "1"^xs^float
true:   op:dateTime-equal(
"2002-04-02T12:00:00-01:00"^^xs:dateTime,
"2002-04-02T17:00:00+04:00"^^xs:dateTime)

In addition, for the comparison of strings and text, an
optional 'collation' parameter is available:

false:   0 == compare("Strasse", "Straße")
true:    0 == compare("Strasse", "Straße", "deutsch")

- As I understand it, the 'collation' is an extensibility
point. XPath-Functions uses the examples 'deutsch',
"http://www.example.com/collations/French1", and
"http://www.example.com/collations/French2", but only defines
(and requires) one collation, the default:
"http://www.w3.org/2005/xpath-functions/collation/codepoint" which
does unicode normalization (sort of).   See
http://www.w3.org/TR/xpath-functions/#collations and
http://www.unicode.org/unicode/reports/tr10/

- Some list builtins need to compare values.

* member; this is not in XPath-Functions

* index-of and distinct-values; these take a collation
parameter in XPath-Function.  The collation is defined to
apply whenever the values are strings.

* union, intersect, except; these conceptually compare
elements, but in XPath-Functions they only operate on
lists of nodes, so string comparison doesn't come into
play, and no collation is passed; in added them as
parameters in drafting the text for DTB.

Also, some non-list functions in DTB uses collations: compare,
substring-before, substring-after, contains, starts-with,
ends-with, and text-compare.

- Although formally a "collation" is a total preorder (a "compare"
function, returning -1, 0, 1 for each pair of values, like a
total order but with equalities), our primary use for it is
merely do determine equal/not-equal, not to sort things into
their order.

The Question:

Which type of literal comparison should the list builtins use?  If,
like XPath-Functions (and DTB, right now), we let users choose how
strings are being compared, and there's an obvious choice to make
in comparing numbers and dates, isn't it odd to not give them the
same flexibility there?

Specifically:

Question 1: What ways, if any, do we provide for users
to specify how literals are compared?

Question 2: If/when users do not specify how literals are
compared?  Specifically do we default to rif:equal
or the builtin comparators?

Note also that if the rule author doesn't specify a collation, then
the rule system *user* might.  That is, the ruleset might not pay
attention to language, but the user might say they want the French
collation etc.

Options for Question 1:

1.  No rule-author control.  Get rid of collations in the API.
(This might be considered unacceptable by i18n folks; I don't
know.)

2.  As written in DTB now: users can offer a URI indicating how
strings are to be compared; only one such URI is defined and
required, so this feature can only be used within environments
that implement some extension here.  No way to control which
kind of comparison is done for other literals.

3.  Extend the notion of collations to cover all our literals.
Instead of passing 'French', you could pass a collation that
indicated which kind of numerical comparison should be used.
(The problem is that if you do that, then what happens to the
user's local use-French-collation setting?)

4.  Keep 'collations' for strings, and add a similar but different
comparison parameter.   For example, member would be:

member(item, list)
member(item, list, comparator)
member(item, list, comparator, collation)

Here, I'm imagining comparator to be a term which for now could
only be certain pre-defined rif:iris (like collations) -- one
for each of the two types of RIF equality.  But in dialects
with higher-order functions, they could be functions defined in
the ruleset.

Technical notes:

- we have to put the comparator argument before the
collation argument, because we have no way of omitting
any argument but the last one(s), and sometimes you need
to supply a comparator but no collation (eg, when you
want to let the user supply the collation); you never
need to supply the collation and not the comparator,
since we'll define a fixed value for the default
comparator.  (The difference is that while end-users
might control collations, they're not going to be
controlling comparators.)

- when we let users define their own comparators, the
obvious thing to ask them to define is an "equal"
predicate with two parameters.  This approach seriously
impacts the complexity class of the builtins; for
example, member has to be done as a linear search,
instead of using a binary search or a hash table (in the
common cases where the list is known to have some
define a "compare" function (returning -1, 0, or 1) so
sorting can be done, or a "fold" function (returning a
string which is the same for all "equal" values) so
hashing can be done.

This lack of higher-order function syntax is a pain, but I think
we can live with it here by defining two comparator IRI's, maybe
func:literal-compare and func:value-compare, which in our
existing dialects can only be used as a comparator argument.  If
you could use them as functions, literal compare would be like
the ordered version of rif:equal for literals -- I'd suggest the
ordering between disjoint value spaces just be alphabetical order
of the datatype IRI. Similarly, value-compare is just the big
expression using every guard and then the builtin comparators for
that type.

It is pretty goofy to define these two functions and say you can
only use them as a parameter -- you can't really call them -- but
that's still the best option I see right now.

Thoughts?

- Sandro
```
Received on Friday, 1 May 2009 16:28:13 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:47:55 UTC