Re: Language tags and valu etetsing (was: Open world value tests) from Eric Prud'hommeaux on 2006-10-23 (public-rdf-dawg@w3.org from October to December 2006)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 23 Oct 2006 14:58:14 +0200
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <20061023125814.GA4155@w3.org>
On Sat, Oct 21, 2006 at 05:53:29PM +0100, Seaborne, Andy wrote:
> 
> 
> 
> Eric Prud'hommeaux wrote:
> >On Thu, Aug 24, 2006 at 09:45:33PM +0100, Seaborne, Andy wrote:
> >>"""
> >>ACTION AndyS:
> >>Write some tests for value testing (unknown types and extensibility) to 
> >>add to
> >>2006/JulSep0086
> >>"""
> >>
> >>http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0086
> >>http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0104
> >>
> . . .
> 
> 
> >>Tests open-eq-07 to open-eq-10 work by taking a list of all possible term
> >>forms, forming the cross product and seeing which are value-equal and
> >>value-not-equal.  This is done for data which contains the same compared
> >>values and different by comparable values.  These tests are exhaustive and
> >>include literals with lang tags - because lang tags are not case 
> >>sensitive (nor is there a canonical form according to RFC3066) it seemed 
> >>reasonable to be able equate "xyz"@EN with "xyz"@en. In effect, each lang 
> >>tag defines a separate value space - can't compare or test for equality 
> >>across them, but you can with the same language.
> >>
> >>"abc"@en = "abc"@EN
> >>"xyz"@en > "abc"@en
> >>"xyz"@en > "abc"@EN

This creates the interesting conundrum that something is
simultaneously equivilent and greaterThan:
     "abc"@en = "abc"@EN ⇒ TRUE
     "abc"@en > "abc"@EN ⇒ TRUE
(and "abc"@EN < "abc"@en ⇒ TRUE)

I would favor < over =, but I guess that depends on your use cases.

> >There is no current language for case-insensitive language tags in
> >SPARQL presently. My implementation failed these both because of
> >case-sensitive language matching, and because they employed extra
> >operators not currently in SPARQL.
> 
> Is is just a matter of expanding the table to include RDF plain literals 
> with language tags? ORDER BY defers to "<" if it can.

I think "abc"@en > "abc"@EN is fully expressible with our current
functions:

  (STR(?a) != STR(?b) && STR(?a) > STR(?b))
    || 
  (STR(?a) == STR(?b) && LANG(?a) > LANG(?b))  # isn't "a" > "A" wierd?

If the above analysis is correct, one could add a shortcut syntax for
in the operator mapping table. (note: simple literal > simple literal
is currently in the table.):

[[
  ┃A > B│simple literal│simple literal│op:numeric-equal(fn:compare(A, B), 1)                 │xsd:boolean┃
+ ┃A > B│plain literal │plain literal │logical-or(
                                         logical-and(fn:not(op:numeric-equal(fn:compare(str(A), str(B)), 0)), 
                                            op:numeric-equal(fn:compare(lang(A), lang(B)), 1)), 
                                         logical-and(op:numeric-equal(fn:compare(str(A), str(B)), 0), 
                                            op:numeric-equal(fn:compare(str(A), str(B)), 1)))│xsd:boolean┃
]]
or one could add functions for each of < > <= >= ala:
[[
+ ┃A > B│plain literal │plain literal │RDFplainLiteral-greaterThan(A, B))│xsd:boolean┃

RDFplainLiteral-greaterThan
  xsd:boolean   RDFplainLiteral-greaterThan (plain literal lit1, plain literal lit2)

If the lexical values of lit1 and lit2 are identical,
RDFplainLiteral-greaterThan TRUE or FALSE depending whether
LANG(lit1) > LANG(lit2). If the lexical values are not identical,
RDFplainLiteral-greaterThan TRUE or FALSE depending whether
STR(lit1) > STR(lit2).
]]

These specifications were assuming that you wanted this sort order:
     "abb"
     "abc"
     "abc"@EN
     "abc"@eN
     "abc"@En
     "abc"@en
     "abc"@en-fr # zis iss how we speak here
     "abd"

> I tried writing things out from the current operations alone:
> 
> Some things can be written:
>   ( lang(?x) = lang(?y) ) && str(?x) > str(?y)
> but that only works cleanly for the same language tag - different would 
> cause
> false, not error which seems more natural and it's verbose.
> 
> langMatches isn't symmetric but I think:
> 
>   langMatches(lang(?x),lang(?y)) &&
>   langMatches(lang(?y),lang(?x)) &&
>   str(?x) > str(?y)
> 
> attempts to handle the case-sensitivity issue because a language tag is a 
> special case of a language range.  It becomes more verbose though - ugh.    
> Or a regex.

    REGEXP(LANG(?x), LANG(?y), 'i')

> "11.3.1 Operator Extensibility" could explicitly cover this - I can accept 
> that language tag handling is an extension if there is text that states 
> that. So far we have really been thinking of extension by datatypes.

[[
Extended SPARQL implementations may support additional associations
between operators and operator functions; this amounts to adding rows
to the table above. No additional operator support may yield a result
that replaces any result other than a type error in an unextended
implementation.
]]
I think I've convinced myself that it's extendable this way. You
are adding rows that replace the type errors you would get in an
unextended implementation.

These rules just make sure that you don't lose dawg:monotinicity over
DAWG-specified parts of the language. Ideally, people won't step on
each other's truth values too much, but I don't think we can say much
about that.
-- 
-eric

home-office: +1.617.395.1213 (usually 900-2300 CET)
     +33.1.45.35.62.14
cell:       +33.6.73.84.87.26

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
Received on Monday, 23 October 2006 12:57:21 UTC