collations, etc (a list-builtins issue)

In writing the spec for the List builtins, I've come across a difficult
design choice concerning how literals are compared.  (Some of this might
be considered already decided, but it seems to me there's a fair amount
of new information here, relevantstuff I didn't know at the F2F.)

Background:

  - In RIF, you have two ways you can compare most literals:

        (1) You can use rif:equals, which is true iff the elements in
            the value space for the two literals are the same.
            Literals with types with disjoint value spaces will never
            compare as equal

                  true:    "01":xs^int = "1":xs^int     
                  false:   "1":xs^int = "1"^xs^float
                  false:   "1":xs^double = "1"^xs^float
    false:   "2002-04-02T12:00:00-01:00"^^xs:dateTime 
                           = "2002-04-02T17:00:00+04:00"^^xs:dateTime)
                  false:   "Strasse" = "Straße"

        (2) You can use a builtin comparator like numeric-equal,
     dateTime-equal, date-equal, time-equal, duration-equal,
     XMLLiteral-equal, compare, and text-compare.  These
     builtins allow more values to be considered equal, for
     example:

                  true:   "1":xs^int = "1"^xs^float
                  true:   "1":xs^double = "1"^xs^float
    true:   op:dateTime-equal(
                             "2002-04-02T12:00:00-01:00"^^xs:dateTime, 
                             "2002-04-02T17:00:00+04:00"^^xs:dateTime)

            In addition, for the comparison of strings and text, an
            optional 'collation' parameter is available:

           false:   0 == compare("Strasse", "Straße")
           true:    0 == compare("Strasse", "Straße", "deutsch")

   - As I understand it, the 'collation' is an extensibility
     point. XPath-Functions uses the examples 'deutsch',
     "http://www.example.com/collations/French1", and
     "http://www.example.com/collations/French2", but only defines
     (and requires) one collation, the default:
     "http://www.w3.org/2005/xpath-functions/collation/codepoint" which
     does unicode normalization (sort of).   See
     http://www.w3.org/TR/xpath-functions/#collations and
     http://www.unicode.org/unicode/reports/tr10/

   - Some list builtins need to compare values. 

          * member; this is not in XPath-Functions

          * index-of and distinct-values; these take a collation
            parameter in XPath-Function.  The collation is defined to
            apply whenever the values are strings.

          * union, intersect, except; these conceptually compare
            elements, but in XPath-Functions they only operate on
            lists of nodes, so string comparison doesn't come into
            play, and no collation is passed; in added them as
            parameters in drafting the text for DTB.

      Also, some non-list functions in DTB uses collations: compare,
      substring-before, substring-after, contains, starts-with,
      ends-with, and text-compare.

   - Although formally a "collation" is a total preorder (a "compare"
     function, returning -1, 0, 1 for each pair of values, like a
     total order but with equalities), our primary use for it is
     merely do determine equal/not-equal, not to sort things into
     their order.

The Question:

   Which type of literal comparison should the list builtins use?  If,
   like XPath-Functions (and DTB, right now), we let users choose how
   strings are being compared, and there's an obvious choice to make
   in comparing numbers and dates, isn't it odd to not give them the
   same flexibility there?

   Specifically:
        
        Question 1: What ways, if any, do we provide for users
                    to specify how literals are compared?

        Question 2: If/when users do not specify how literals are
                    compared?  Specifically do we default to rif:equal
                    or the builtin comparators? 

   Note also that if the rule author doesn't specify a collation, then
   the rule system *user* might.  That is, the ruleset might not pay
   attention to language, but the user might say they want the French
   collation etc.
 
Options for Question 1:

   1.  No rule-author control.  Get rid of collations in the API.
       (This might be considered unacceptable by i18n folks; I don't
       know.)

   2.  As written in DTB now: users can offer a URI indicating how
       strings are to be compared; only one such URI is defined and
       required, so this feature can only be used within environments
       that implement some extension here.  No way to control which
       kind of comparison is done for other literals.

   3.  Extend the notion of collations to cover all our literals.
       Instead of passing 'French', you could pass a collation that
       indicated which kind of numerical comparison should be used.
       (The problem is that if you do that, then what happens to the
       user's local use-French-collation setting?)

   4.  Keep 'collations' for strings, and add a similar but different
       comparison parameter.   For example, member would be:

             member(item, list)
             member(item, list, comparator)
             member(item, list, comparator, collation)

       Here, I'm imagining comparator to be a term which for now could
       only be certain pre-defined rif:iris (like collations) -- one
       for each of the two types of RIF equality.  But in dialects
       with higher-order functions, they could be functions defined in
       the ruleset.

       Technical notes:

            - we have to put the comparator argument before the
              collation argument, because we have no way of omitting
              any argument but the last one(s), and sometimes you need
              to supply a comparator but no collation (eg, when you
              want to let the user supply the collation); you never
              need to supply the collation and not the comparator,
              since we'll define a fixed value for the default
              comparator.  (The difference is that while end-users
              might control collations, they're not going to be
              controlling comparators.)

            - when we let users define their own comparators, the
              obvious thing to ask them to define is an "equal"
              predicate with two parameters.  This approach seriously
              impacts the complexity class of the builtins; for
              example, member has to be done as a linear search,
              instead of using a binary search or a hash table (in the
              common cases where the list is known to have some
              structure/ordering.)  Instead, users should either
              define a "compare" function (returning -1, 0, or 1) so
              sorting can be done, or a "fold" function (returning a
              string which is the same for all "equal" values) so
              hashing can be done.


       This lack of higher-order function syntax is a pain, but I think
       we can live with it here by defining two comparator IRI's, maybe
       func:literal-compare and func:value-compare, which in our
       existing dialects can only be used as a comparator argument.  If
       you could use them as functions, literal compare would be like
       the ordered version of rif:equal for literals -- I'd suggest the
       ordering between disjoint value spaces just be alphabetical order
       of the datatype IRI. Similarly, value-compare is just the big
       expression using every guard and then the builtin comparators for
       that type.     

       It is pretty goofy to define these two functions and say you can
       only use them as a parameter -- you can't really call them -- but
       that's still the best option I see right now.

Thoughts?

     - Sandro

Received on Friday, 1 May 2009 16:28:13 UTC