RE: Counting, Ordering and DISTINCT from Seaborne, Andy on 2004-11-03 (public-rdf-dawg-comments@w3.org from November 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 3 Nov 2004 16:51:32 -0000
To: "Andrew Newman" <andrew@tucanatech.com>, <public-rdf-dawg-comments@w3.org>
Message-ID: <8D5B24B83C6A2E4B9E7EE5FA82627DC94D2F85@sdcexcea01.emea.cpqcorp.net>
-------- Original Message --------
> From: Andrew Newman <>
> Date: 20 October 2004 00:36
> 
> I wish to offer feedback in relation to SPARQL.

Thanks for the comments.

> 
> It seems to me that there are many frequently occurring use cases not
> covered by the current standard.
> 
> One piece of missing functionality seems to be counting and sorting
> results.  Things like "How many items of a certain type are in a
graph?"
> or "Give me the 10 highest priority items".  The lack of these
> operations seems to be an extremely negative aspect of the standard
and
> one that I believe will hamper wide user acceptance.
> 
> One of the expectations might be for users to implement their own
> sorting and counting once they receive results back from a query.  The
> most obvious problem with this is the expense of post processing
> results.  The performance of RDF stores in comparison to others is
> generally considered poor, these kinds of operations will make it much
> worse.  I also think it's unlikely the user will implement their own
> post processing more likely they will choose something that does offer
> this functionality.
> 
> So to fill this lack of utility in the query language it will require
> implementors to support their own syntax and semantics of counting and
> sorting.  This seems to then negate one of the benefits of
> standardization. 
> 
> If certain implementations feel that it's infeasible to support
counting
> and sorting then maybe it should be an optional feature of SPARQL.  So
> implementors can either offer a correct solution or leave it
> unimplemented. 

This is useful input - there is a tradeoff of features against
time-to-completion on the the recommendation.

Given that the recommendation is for a web RDF data access langauge, the
interoperability requirements are strong.  Allowing optional features
makes for the situation where some engines provide a feature and some do
not.  This makes the decision on optional features very important and
things such as counting would ideally be either core features or not
specified.

This would be a good issue to table for any future work if it does not
make this current round.

> 
> The other issue with the SPARQL is the lack of an implicit distinct.
In
> my understand of SQL, DISTINCT is optional because if your queries
work
> on normalized data and joins are based on distinct keys then the
> returned results cannot be duplicated.  If your query works on rows
with
> repeated values on the same column then you apply DISTINCT.
> 
> In RDF's data model there isn't really this problem of duplicated data
> and normalization.  SPARQL has the idea of matching statements in the
> graph.  From my understanding, RDF's data model doesn't support the
idea
> of multiple subject, predicates and/or objects with the same values.
> 
> In other words, it only seems valid that if a query matches one result
> in the graph it should return that one unique result not repeated
> multiple results.
> 
> While I can see many use cases for distinct vs non-distinct results I
am
> not aware of a reason to return non-distinct results over distinct
> results.  Have I missed something?

The working group is current looking to make results sets be sets,
including SELECT having no duplicates  While query results will not
contain duplicates, SELECT may still do so for efficient implementation
after projection but no tests will define what duplicates are
allowed/expected and it is up to implementations to balance the needs
here.  The client can force no duplicates with SELECT DISTINCT; client
libraries to SPARQL query engines (local or remote) can also enforce no
duplicates.

Duplicates might arise in a few places: 
+ projection in SELECT 
  example: SELECT ?x ?y WHERE (?x ?y ?z)
+ query over a collection of graphs
+ RDF graphs formed by runnign a rules or other system to compute
triples.

Given the test cases will be defined in terms of sets only,
implementations are making an tradeoff if they allow duplicates in
SELECT (without DISTINCT).

> 
> I work with a member of the DAWG and follow the mailing list archives
> from time to time.  I have asked him about why these features are not
in
> the standard without getting an answer that I would consider
> appropriate.  I know that a user centric view is applied to the
> development of this standard.  However, with the above functionality
in
> mind, it seems to me it has been avoided because it's difficult to
> implement rather than functionality that user's require.
Received on Wednesday, 3 November 2004 16:52:05 UTC