Re: Language tags and valu etetsing

Eric Prud'hommeaux wrote:
> On Sat, Oct 21, 2006 at 05:53:29PM +0100, Seaborne, Andy wrote:
>>
>>
>> Eric Prud'hommeaux wrote:
>>> On Thu, Aug 24, 2006 at 09:45:33PM +0100, Seaborne, Andy wrote:
>>>> """
>>>> ACTION AndyS:
>>>> Write some tests for value testing (unknown types and extensibility) to 
>>>> add to
>>>> 2006/JulSep0086
>>>> """
>>>>
>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0086
>>>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0104
>>>>
>> . . .
>>
>>
>>>> Tests open-eq-07 to open-eq-10 work by taking a list of all possible term
>>>> forms, forming the cross product and seeing which are value-equal and
>>>> value-not-equal.  This is done for data which contains the same compared
>>>> values and different by comparable values.  These tests are exhaustive and
>>>> include literals with lang tags - because lang tags are not case 
>>>> sensitive (nor is there a canonical form according to RFC3066) it seemed 
>>>> reasonable to be able equate "xyz"@EN with "xyz"@en. In effect, each lang 
>>>> tag defines a separate value space - can't compare or test for equality 
>>>> across them, but you can with the same language.
>>>>
>>>> "abc"@en = "abc"@EN
>>>> "xyz"@en > "abc"@en
>>>> "xyz"@en > "abc"@EN
> 
> This creates the interesting conundrum that something is
> simultaneously equivilent and greaterThan:
>      "abc"@en = "abc"@EN ⇒ TRUE
>      "abc"@en > "abc"@EN ⇒ TRUE
> (and "abc"@EN < "abc"@en ⇒ TRUE)

Don't understand.  How can "abc"@en > "abc"@EN be true?

> 
> I would favor < over =, but I guess that depends on your use cases.
> 
>>> There is no current language for case-insensitive language tags in
>>> SPARQL presently. My implementation failed these both because of
>>> case-sensitive language matching, and because they employed extra
>>> operators not currently in SPARQL.
>> Is is just a matter of expanding the table to include RDF plain literals 
>> with language tags? ORDER BY defers to "<" if it can.
> 
> I think "abc"@en > "abc"@EN is fully expressible with our current
> functions:
> 
>   (STR(?a) != STR(?b) && STR(?a) > STR(?b))
>     || 
>   (STR(?a) == STR(?b) && LANG(?a) > LANG(?b))  # isn't "a" > "A" wierd?


I'm not proposing any ordering across language tags.

I am proposing "xyz"@en < "abc"@fr is an error.  Can't compare across language 
tags.


> 
> If the above analysis is correct, one could add a shortcut syntax for
> in the operator mapping table. (note: simple literal > simple literal
> is currently in the table.):
> 
> [[
>   ┃A > B│simple literal│simple literal│op:numeric-equal(fn:compare(A, B), 1)                 │xsd:boolean┃
> + ┃A > B│plain literal │plain literal │logical-or(
>                                          logical-and(fn:not(op:numeric-equal(fn:compare(str(A), str(B)), 0)), 
>                                             op:numeric-equal(fn:compare(lang(A), lang(B)), 1)), 
>                                          logical-and(op:numeric-equal(fn:compare(str(A), str(B)), 0), 
>                                             op:numeric-equal(fn:compare(str(A), str(B)), 1)))│xsd:boolean┃
> ]]
Something like that if lang(A) = lang(B) needs to be case insensitive.

> or one could add functions for each of < > <= >= ala:
> [[
> + ┃A > B│plain literal │plain literal │RDFplainLiteral-greaterThan(A, B))│xsd:boolean┃
> 
> RDFplainLiteral-greaterThan
>   xsd:boolean   RDFplainLiteral-greaterThan (plain literal lit1, plain literal lit2)
> 
> If the lexical values of lit1 and lit2 are identical,
> RDFplainLiteral-greaterThan TRUE or FALSE depending whether
> LANG(lit1) > LANG(lit2). If the lexical values are not identical,
> RDFplainLiteral-greaterThan TRUE or FALSE depending whether
> STR(lit1) > STR(lit2).
> ]]
> 
> These specifications were assuming that you wanted this sort order:
>      "abb"
>      "abc"
>      "abc"@EN
>      "abc"@eN
>      "abc"@En
>      "abc"@en
>      "abc"@en-fr # zis iss how we speak here
>      "abd"

Persomally, I woudl not worry about ordering of lang tags - a system may have 
lost the original form.  But codepoint is the most natural.

> 
>> I tried writing things out from the current operations alone:
>>
>> Some things can be written:
>>   ( lang(?x) = lang(?y) ) && str(?x) > str(?y)
>> but that only works cleanly for the same language tag - different would 
>> cause
>> false, not error which seems more natural and it's verbose.
>>
>> langMatches isn't symmetric but I think:
>>
>>   langMatches(lang(?x),lang(?y)) &&
>>   langMatches(lang(?y),lang(?x)) &&
>>   str(?x) > str(?y)
>>
>> attempts to handle the case-sensitivity issue because a language tag is a 
>> special case of a language range.  It becomes more verbose though - ugh.    
>> Or a regex.
> 
>     REGEXP(LANG(?x), LANG(?y), 'i')
> 
>> "11.3.1 Operator Extensibility" could explicitly cover this - I can accept 
>> that language tag handling is an extension if there is text that states 
>> that. So far we have really been thinking of extension by datatypes.
> 
> [[
> Extended SPARQL implementations may support additional associations
> between operators and operator functions; this amounts to adding rows
> to the table above. No additional operator support may yield a result
> that replaces any result other than a type error in an unextended
> implementation.
> ]]
> I think I've convinced myself that it's extendable this way. You
> are adding rows that replace the type errors you would get in an
> unextended implementation.
> 
> These rules just make sure that you don't lose dawg:monotinicity over
> DAWG-specified parts of the language. Ideally, people won't step on
> each other's truth values too much, but I don't think we can say much
> about that.

Specifically mentioning lang tags would be useful because they aren't datatypes.

[[
The consequence of this rule is that extended SPARQL implementations will 
produce at least the same solutions as an unextended implementation, and may, 
for some queries, produce more solutions.
]]
isn't true by the way - filters can be negated so more or less solutions are 
going to be possible with any kind of extensibility.

That's why "!=" should mean "not known to be unequal" and not "not(known to be 
equal)"

 Andy

Received on Monday, 23 October 2006 14:25:48 UTC