Re: [Update] [LLD] Dataset Description from Andy Seaborne on 2014-03-03 (public-semweb-lifesci@w3.org from March 2014)

From: Andy Seaborne <andy@apache.org>
Date: Mon, 03 Mar 2014 22:43:31 +0000
To: David Booth <david@dbooth.org>, "w3.hcls@gmail.com" <w3.hcls@gmail.com>, "public-semweb-lifesci@w3.org" <public-semweb-lifesci@w3.org>
Message-ID: <53150593.1060408@apache.org>

On 03/03/14 22:00, David Booth wrote:
> Hi Andy,
>
> On 03/03/2014 03:01 PM, Andy Seaborne wrote:
>> (please forward if the mailing list does not allow non-subscribers to
>> send to it)
>>
>> On 03/03/14 16:32, David Booth wrote:
>>> On 02/09/2014 05:45 PM, w3.hcls@gmail.com wrote:
>>>> Relevant docs:
>>>> - Working draft of W3C Note:
>>>> https://docs.google.com/document/d/1zGQJ9bO_dSc8taINTNHdnjYEzUyYkbjglrcuUPuoITw/edit#heading=h.wyc73yp7c8jz
>>>>
>>>>
>>>>
>>>
>>> I notice that section 6.6.1 Core statistics shows this SPARQL query for
>>> counting the number of triples:
>>>
>>>    SELECT (COUNT(*) AS ?no) { ?s ?p ?o  }
>>>
>>> However, I believe the SPARQL 1.1 standard allows duplicate triples and
>>> duplicate query solutions by default.  If so, to get an accurate count
>>> of the number of triples, the DISTINCT keyword must be used:
>>>
>>>    SELECT (COUNT(DISTINCT *) AS ?no) { ?s ?p ?o  }
>>>
>>> I'm copying Andy Seaborne to see if this is correct, since I could not
>>> easily find this information in the SPARQL 1.1 spec when I did a quick
>>> scan.   Andy, am I correct about this?
>>>
>>> Thanks,
>>> David
>>
>> Hi,
>>
>> In the case of { ?s ?p ?o }, the match is against the default graph and
>> an RDF graph is a set of triples - so there are no duplicates over the
>> ?s, ?p, ?o elements of a row.
>>
>> Because of the nature of the pattern, COUNT(*) and COUNT(DISTINCT *)
>> should be the same.

I think section 6.6.1 Core statistics is correct as is.

What does the spec say?  That's the definitive place to look.

>
> I'm particularly thinking of AllegroGraph, which (by default I believe)

I don't know what AllegroGraph does.  Sounds like a question for the 
developers.

> does not remove duplicate triples if the same triple happens to be
> loaded more than once.

bNodes?  All the RDF syntaxes, when a fie is read twice, creates 
separate bNodes.

>  If AllegroGraph returns a different count to the
> queries above (with or without DISTINCT), does that mean that
> AllegroGraph is not SPARQL 1.1 compliant?   I.e., is it a bug, or is it
> a permissible implementation variation?
>
> I had the impression that SPARQL 1.1 conformant implementations are
> permitted to have duplicate solutions in the solution set unless the
> word DISTINCT is used,

do you have a pointer to text that gave you that impression?

> and hence I would have thought that a solution
> set that is not explicitly constrained to be DISTINCT could include
> duplicates, even if that solution set is for only a { ?s ?p ?o } graph
> pattern over the default graph, but maybe I'm wrong.

I don't see how { ?s ?p ?o } can create duplicates - an RDF graph is a 
*set* of triples (that's not a SPARQL definition - it's an RDF 
definition) so subject/predicate/object is a unique combination within a 
graph.

If the graph is composed behind the scenes of other data, that's nothing 
to do with the RDF or SPARQL specs.

> OTOH, if, when
> DISTINCT is not specified, the SPARQL 1.1 standard only *sometimes*
> permits duplicates, then how can I determine which circumstances permit
> them and which don't?

It depends on the query pattern but we're talking about one specific 
pattern - { ?s ?p ?o }

In general, SPARQL results are multisets (duplicates).  Some of the 
algebra operations can cause duplicates such as projection and union but 
their cardinality is defined.

	Andy


>
> David
>

Received on Monday, 3 March 2014 22:44:01 UTC