Counting, Ordering and DISTINCT

I wish to offer feedback in relation to SPARQL.

It seems to me that there are many frequently occurring use cases not 
covered by the current standard.

One piece of missing functionality seems to be counting and sorting 
results.  Things like "How many items of a certain type are in a graph?" 
or "Give me the 10 highest priority items".  The lack of these 
operations seems to be an extremely negative aspect of the standard and 
one that I believe will hamper wide user acceptance.

One of the expectations might be for users to implement their own 
sorting and counting once they receive results back from a query.  The 
most obvious problem with this is the expense of post processing 
results.  The performance of RDF stores in comparison to others is 
generally considered poor, these kinds of operations will make it much 
worse.  I also think it's unlikely the user will implement their own 
post processing more likely they will choose something that does offer 
this functionality.

So to fill this lack of utility in the query language it will require 
implementors to support their own syntax and semantics of counting and 
sorting.  This seems to then negate one of the benefits of standardization.

If certain implementations feel that it's infeasible to support counting 
and sorting then maybe it should be an optional feature of SPARQL.  So 
implementors can either offer a correct solution or leave it unimplemented.

The other issue with the SPARQL is the lack of an implicit distinct.  In 
my understand of SQL, DISTINCT is optional because if your queries work 
on normalized data and joins are based on distinct keys then the 
returned results cannot be duplicated.  If your query works on rows with 
repeated values on the same column then you apply DISTINCT.

In RDF's data model there isn't really this problem of duplicated data 
and normalization.  SPARQL has the idea of matching statements in the 
graph.  From my understanding, RDF's data model doesn't support the idea 
of multiple subject, predicates and/or objects with the same values.

In other words, it only seems valid that if a query matches one result 
in the graph it should return that one unique result not repeated 
multiple results.

While I can see many use cases for distinct vs non-distinct results I am 
not aware of a reason to return non-distinct results over distinct 
results.  Have I missed something?

I work with a member of the DAWG and follow the mailing list archives 
from time to time.  I have asked him about why these features are not in 
the standard without getting an answer that I would consider 
appropriate.  I know that a user centric view is applied to the 
development of this standard.  However, with the above functionality in 
mind, it seems to me it has been avoided because it's difficult to 
implement rather than functionality that user's require.

Received on Tuesday, 19 October 2004 23:37:01 UTC