Re: collations, etc (a list-builtins issue) from Jos de Bruijn on 2009-05-04 (public-rif-wg@w3.org from May 2009)

From: Jos de Bruijn <debruijn@inf.unibz.it>
Date: Mon, 04 May 2009 14:06:17 +0200
To: Sandro Hawke <sandro@w3.org>
CC: public-rif-wg@w3.org
Message-ID: <49FEDA39.9080803@inf.unibz.it>
I am in favor of option 1: the list operators simply work on the values
in the lists, rather than performing all kinds of conversions.  If users
want something more, they have to define the functions themselves.


Jos

Sandro Hawke wrote:
> In writing the spec for the List builtins, I've come across a difficult
> design choice concerning how literals are compared.  (Some of this might
> be considered already decided, but it seems to me there's a fair amount
> of new information here, relevantstuff I didn't know at the F2F.)
> 
> Background:
> 
>   - In RIF, you have two ways you can compare most literals:
> 
>         (1) You can use rif:equals, which is true iff the elements in
>             the value space for the two literals are the same.
>             Literals with types with disjoint value spaces will never
>             compare as equal
> 
>                   true:    "01":xs^int = "1":xs^int     
>                   false:   "1":xs^int = "1"^xs^float
>                   false:   "1":xs^double = "1"^xs^float
>     false:   "2002-04-02T12:00:00-01:00"^^xs:dateTime 
>                            = "2002-04-02T17:00:00+04:00"^^xs:dateTime)
>                   false:   "Strasse" = "Straße"
> 
>         (2) You can use a builtin comparator like numeric-equal,
>      dateTime-equal, date-equal, time-equal, duration-equal,
>      XMLLiteral-equal, compare, and text-compare.  These
>      builtins allow more values to be considered equal, for
>      example:
> 
>                   true:   "1":xs^int = "1"^xs^float
>                   true:   "1":xs^double = "1"^xs^float
>     true:   op:dateTime-equal(
>                              "2002-04-02T12:00:00-01:00"^^xs:dateTime, 
>                              "2002-04-02T17:00:00+04:00"^^xs:dateTime)
> 
>             In addition, for the comparison of strings and text, an
>             optional 'collation' parameter is available:
> 
>            false:   0 == compare("Strasse", "Straße")
>            true:    0 == compare("Strasse", "Straße", "deutsch")
> 
>    - As I understand it, the 'collation' is an extensibility
>      point. XPath-Functions uses the examples 'deutsch',
>      "http://www.example.com/collations/French1", and
>      "http://www.example.com/collations/French2", but only defines
>      (and requires) one collation, the default:
>      "http://www.w3.org/2005/xpath-functions/collation/codepoint" which
>      does unicode normalization (sort of).   See
>      http://www.w3.org/TR/xpath-functions/#collations and
>      http://www.unicode.org/unicode/reports/tr10/
> 
>    - Some list builtins need to compare values. 
> 
>           * member; this is not in XPath-Functions
> 
>           * index-of and distinct-values; these take a collation
>             parameter in XPath-Function.  The collation is defined to
>             apply whenever the values are strings.
> 
>           * union, intersect, except; these conceptually compare
>             elements, but in XPath-Functions they only operate on
>             lists of nodes, so string comparison doesn't come into
>             play, and no collation is passed; in added them as
>             parameters in drafting the text for DTB.
> 
>       Also, some non-list functions in DTB uses collations: compare,
>       substring-before, substring-after, contains, starts-with,
>       ends-with, and text-compare.
> 
>    - Although formally a "collation" is a total preorder (a "compare"
>      function, returning -1, 0, 1 for each pair of values, like a
>      total order but with equalities), our primary use for it is
>      merely do determine equal/not-equal, not to sort things into
>      their order.
> 
> The Question:
> 
>    Which type of literal comparison should the list builtins use?  If,
>    like XPath-Functions (and DTB, right now), we let users choose how
>    strings are being compared, and there's an obvious choice to make
>    in comparing numbers and dates, isn't it odd to not give them the
>    same flexibility there?
> 
>    Specifically:
>         
>         Question 1: What ways, if any, do we provide for users
>                     to specify how literals are compared?
> 
>         Question 2: If/when users do not specify how literals are
>                     compared?  Specifically do we default to rif:equal
>                     or the builtin comparators? 
> 
>    Note also that if the rule author doesn't specify a collation, then
>    the rule system *user* might.  That is, the ruleset might not pay
>    attention to language, but the user might say they want the French
>    collation etc.
>  
> Options for Question 1:
> 
>    1.  No rule-author control.  Get rid of collations in the API.
>        (This might be considered unacceptable by i18n folks; I don't
>        know.)
> 
>    2.  As written in DTB now: users can offer a URI indicating how
>        strings are to be compared; only one such URI is defined and
>        required, so this feature can only be used within environments
>        that implement some extension here.  No way to control which
>        kind of comparison is done for other literals.
> 
>    3.  Extend the notion of collations to cover all our literals.
>        Instead of passing 'French', you could pass a collation that
>        indicated which kind of numerical comparison should be used.
>        (The problem is that if you do that, then what happens to the
>        user's local use-French-collation setting?)
> 
>    4.  Keep 'collations' for strings, and add a similar but different
>        comparison parameter.   For example, member would be:
> 
>              member(item, list)
>              member(item, list, comparator)
>              member(item, list, comparator, collation)
> 
>        Here, I'm imagining comparator to be a term which for now could
>        only be certain pre-defined rif:iris (like collations) -- one
>        for each of the two types of RIF equality.  But in dialects
>        with higher-order functions, they could be functions defined in
>        the ruleset.
> 
>        Technical notes:
> 
>             - we have to put the comparator argument before the
>               collation argument, because we have no way of omitting
>               any argument but the last one(s), and sometimes you need
>               to supply a comparator but no collation (eg, when you
>               want to let the user supply the collation); you never
>               need to supply the collation and not the comparator,
>               since we'll define a fixed value for the default
>               comparator.  (The difference is that while end-users
>               might control collations, they're not going to be
>               controlling comparators.)
> 
>             - when we let users define their own comparators, the
>               obvious thing to ask them to define is an "equal"
>               predicate with two parameters.  This approach seriously
>               impacts the complexity class of the builtins; for
>               example, member has to be done as a linear search,
>               instead of using a binary search or a hash table (in the
>               common cases where the list is known to have some
>               structure/ordering.)  Instead, users should either
>               define a "compare" function (returning -1, 0, or 1) so
>               sorting can be done, or a "fold" function (returning a
>               string which is the same for all "equal" values) so
>               hashing can be done.
> 
> 
>        This lack of higher-order function syntax is a pain, but I think
>        we can live with it here by defining two comparator IRI's, maybe
>        func:literal-compare and func:value-compare, which in our
>        existing dialects can only be used as a comparator argument.  If
>        you could use them as functions, literal compare would be like
>        the ordered version of rif:equal for literals -- I'd suggest the
>        ordering between disjoint value spaces just be alphabetical order
>        of the datatype IRI. Similarly, value-compare is just the big
>        expression using every guard and then the builtin comparators for
>        that type.     
> 
>        It is pretty goofy to define these two functions and say you can
>        only use them as a parameter -- you can't really call them -- but
>        that's still the best option I see right now.
> 
> Thoughts?
> 
>      - Sandro
> 

-- 
+43 1 58801 18470        debruijn@inf.unibz.it

Jos de Bruijn,        http://www.debruijn.net/
----------------------------------------------
Many would be cowards if they had courage
enough.
  - Thomas Fuller
Received on Monday, 4 May 2009 12:07:19 UTC