W3C home > Mailing lists > Public > public-rdf-comments@w3.org > April 2013

Re: normalization issues with Turtle spec tests

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 7 Apr 2013 09:24:58 -0400
To: Ruben Verborgh <ruben.verborgh@ugent.be>
Cc: public-rdf-comments@w3.org, Gregg Kellogg <gregg@greggkellogg.net>, gavin@carothers.name
Message-ID: <20130407132455.GA4206@w3.org>
* Ruben Verborgh <ruben.verborgh@ugent.be> [2013-04-07 08:46+0200]
> Dear Eric,
> > The tests are enforcing checking that the term generaged from e.g. '"+1"^^xsd:integer' is distinct from '1' (which is the same as '"1"^^xsd:integer').
> >  https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#dfn-literal-equality
> > This might be an opportunity to comment out some code.
> 1) Do you perhaps know the reason for this choice—and has this changed somewhere along the way?

I believe it's been this way going back to the first model and syntax REC in 1999.

> If I take
> @prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
>     <a> <b> "1"^^xsd:integer.
>     <a> <b> "+1"^^xsd:integer.
>     <a> <b> "0001"^^xsd:integer.
> and put it through cwm, I get
>     <a>     <b> 1,
>                 1,
>                 1 .

Hmm, I bet that those three "1"s are the three variants you provided as input, but normalized by the serializer.
If my presumption is correct, rules like
  { <a> <b> "1"^^xsd:integer } => { <a> <b> <c> }
  { <a> <b> "+1"^^xsd:integer } => { <a> <b> <c> }
  { <a> <b> "0001"^^xsd:integer } => { <a> <b> <c> }
would all fire on that input data, while 
  { <a> <b> "990001"^^xsd:integer } => { <a> <b> <c> }
would not.

It might be worth submitting a bug report that normalization by the serializer should be disabled or at least controlled by a command like argument.

> and if I put that again through cwm, I get
>     <a>     <b> 1 .
> However, does the section you point to means this changed so that '"1"^^xsd:integer’ and the others are no longer equivalent to ‘1’?
> 2) Directly above the “linear equality” section, https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#dfn-language-tag says:
> "The language tag must be well-formed according to section 2.2.9 of [BCP47], and must be normalized to lowercase.”
> So test langtagged_LONG_with_subtag seems wrong to answer with '@en-UK’.
> Furthermore, the definition of “literal equality” might be slightly off then. Are the following equal?
> - "test"@en-uk
> - "test"@en-UK

They are. The point, we decided eventually, is that language tags be compared insensitively.
Another comment has produced some proposed text to be incorporated into RDF Concepts.
If this text is adopted, would that satisfy your comment?

> 3) The whole equality thing makes it quite tricky to find triples. For instance, if I search for:
>    triples.find(any, any, 1), what should be returned?
> - triples with an object of ‘1'?
> - triples with an object of ‘”1"^^xsd:integer’?
> - triples with an object of ‘”+1"^^xsd:integer'?
> - triples with an object of ‘”01"^^xsd:integer’?
> - triples with an object of ‘”000001"^^xsd:integer'?
> How do existing implementations deal with this?

I'm guessing that that's like the SPARQL pattern { ?a ?b 1 } which would match the first two (‘1' and ‘”1"^^xsd:integer’), presuming the single quotes are not in the actual Turtle document.

The SPARQL test suite as a bunch of tests in this area. Don't be thrown off by the difference between the lexically-sensitive term equivalence in graph pattern:
  { ?x :p 1 . } — Equality 1-1 -- graph <http://www.w3.org/2001/sw/DataAccess/tests/r2#eq-graph-1>
vs. the value-sensitive tests in the SPARQL Operator Mapping <http://www.w3.org/TR/sparql11-query/#OperatorMapping>:
  { ?x :p ?v . FILTER ( ?v = 1 ) . }  — Equality 1-1 <http://www.w3.org/2001/sw/DataAccess/tests/r2#eq-1>

> So yes, I might comment out some code.
> But then the result will either be more difficult to work with for the library user (because of inequalities),
> or far less performant (as I’d have to index a normalized version and still store and return the original).
> I really wonder why the choice against normalization was made.

It's hard to set an upper bound on normalization.
For instance, if we included integers in V1 and added floats in V2, all the V1 tests with "01.0"^^xsd:float would break in V2.
I'm pretty confortable with this state of affairs, noting that it applies pressure on publishers to use canonical representations and provides tools to look for e.g. ints with leading '0's.

> > I use SWObjects with a command line "-d test.nt --compare ref.nt”
> Wonderful, I will try that. Seems much easier.

Let me know if you run into trouble with that. I can't remember if --compare was in the last version I uploaded. Linux will be the easiest for me to release if not.

If this message addresses your comments, please reply with "[RESOLVED]" at the beginning of the subject.
Received on Sunday, 7 April 2013 13:25:33 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:59:32 UTC