RE: Comments on SPARQL draft from Seaborne, Andy on 2004-11-03 (public-rdf-dawg-comments@w3.org from November 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 3 Nov 2004 16:10:20 -0000
To: "Geoff Chappell" <geoff@sover.net>
Cc: <public-rdf-dawg-comments@w3.org>
Message-ID: <8D5B24B83C6A2E4B9E7EE5FA82627DC94D2F77@sdcexcea01.emea.cpqcorp.net>
Geoff,

Thanks for the feedback and apologies for the delay in getting back to
you on the list: comments inline.

	Andy

-------- Original Message --------
> From: Geoff Chappell <>
> Date: 17 October 2004 21:51
> 
> Here are a few comments on the SPARQL draft - hope they're helpful.
> 
> - Aggregate Graphs
> I like that the query language allows for multiple sources and that
the
> query is effectively a query against the union of the sources rather
> than a 
> union of the results of the query run against each source. I assume
> that the 
> definition of an aggregate graph as "...the RDF-merge of a number of
> subgraphs" doesn't imply anything about rdf lean-ness of the resulting
> merged graph - is that correct?

The working group is current looking to make results sets be sets,
including SELECT having no duplicates; we currently believe this will
enable efficient implementations of the aggregate graph.

While query results will not contain duplicates, SELECT may still do so
for efficient implementation after projection but no tests will define
what duplicates are allowed/expected and it is up to implementations to
balance the needs here.  The client can force no duplicates with SELECT
DISTINCT; client libraries to SPARQL query engines (local or remote) can
also enforce no duplicates.

> 
> - Graph Patterns, Constraining Values
> It's not clear to me why the triple patterns and the value constraints
> are 
> segregated. They just seem like different flavors of logical factors
> that 
> must be true in order for the query to be true. Is this distinction
> just an 
> historical artifact? It seems to me that this will only make things
more
> difficult when negation and disjunction are added (whether or not that
> happens this version, it seems inevitable that they will be).

The current syntax allows them to be mixed although this is not so clear
from the simple examples in the working draft.  The working draft does
use the AND keyword to introduce constraints as it makes the parsing of
arthimetic expressions as opposed to triple patterns easier.  This may
yet change if we find it is unnecessary.  We have not finalised the
syntax nor the grammar yet.

> 
> - Missing Value Assignment

Assigment and the switch statement you describe imply a procedural view
of query execution.  Currently, SPARQL is declarative and execution
order does not matter in functional terms.  Hopefully, this allows
implmentations the freedom in optimization.  SPARQL is primary a data
access language and does not provide facilities for presenting the data.
This is a tradeoff between time-to-rec and functionality.

As a tradeoff, this is something we may revisit but on the current
requirements it has not emerged as sufficiently important.  There are
always going to be cases where SPARQL could provide facilities to move
processing nearer the data - it's a matter of balance driven by how
important it is as a feature in a general RDF data access language.

If you have use cases, we would be delighted to see some descriptions.

> Why no ability to do value assignment? I use this feature regularly
when
> writing RDF queries (in RDF Gateway's query language). When useful
> functions 
> are added to sparql, I think the lack of this feature will be even
more
> bothersome.  For example, wouldn't you want to able to do something
like
> this?:
> 	SELECT ?domain
> 	WHERE  ( ?x rss:link  ?url ) and ?domain=regexp(?url, ....)
> 
> - Negation (Unsaid)
> I think it would be a mistake not to include some form of negation
> (especially since you're already paying the complexity price of
> OPTIONAL - 
> arguably a back-door form of negation). I'll make a suggestion in this
> regard - we have a switch/case construct in RDF Gateway's query
language
> that serves somewhat the same purpose as OPTIONAL, plus under the hood
> provides a mechanism for negation. It works like this:
> 
> Select ?x ?title where {[rdf:type] ?x [rdfs:Class]}
> 	and switch (?x)
> 	(
> 		case {[rdfs:label] ?x ?l}:
> 			?title=?l
> 		case {[rdfs:comment] ?x ?c}:
> 			?title=?c
> 		default:
> 			?title=''
> 	);
> 
> throw in the functions succeed() and fail() and you can do negation -
a
> la: 
> 
> select ?x where {[rdf:type] ?x [rdfs:Class]}
> 	and switch(?x)(
> 		case {[rdfs:subClassOf] ?x ?a}:
> 			fail()
> 		default:
> 			succeed()
> 	);
> 
> is the same as:
> 	 {[rdf:type] ?c [rdfs:Class]} and not {[rdfs:subClassOf] ?c ?a}
> 
> I suggest looking to see if the OPTIONAL construct could be expanded
in
> a 
> similar manner - so you could support exclusive alternatives as well
as
> negation.
> 
> - Disjunction
> You can usually work around the absence of disjunction in the query
> language, but it puts more of a burden on the query author/programmer.
> Why 
> pay that price thousands/millions/? of times down the road just so a
> small 
> number of sparql implementers can avoid a little work now. I suggest
> that if 
> you don't manage to include it in the first sparql version, you at
least
> give some thought to how it would be included in a later version to
> avoid 
> creating an OR-unfriendly syntax.
> 
> - Distinct
> I think it would be a mistake for the query language to take a
position
> on 
> whether or not query result sets could contain duplicate rows (or if
it
> did 
> take a position, I'd want it to be that they couldn't!) From a selfish
> perspective, I worry that we'll have to de-tune RDF Gateway's query
> evaluation in order to allow duplicate rows to exist in a resultset
> (after 
> all if a user wants duplicate rows, they can merely select out the
> variable(s) that make those rows distinguishable). Perhaps the issue
of
> duplicate rows could be implementation specific?

See above.  We plan to make results a set and make any projection in
SELECT also be a set with a note that, for SELECT only, implementations
may return duplicates as an efficiency tradeoff unless the client forces
with SELECT DISTINCT (e.g. no need to retain memory to the end of
request after streaming particular results just for the purposes of
duplicate suppression).

Test cases will all be in terms of sets (no duplicates in SELECT).

> 
> - Typing
> I may be jumping the gun here since there's not yet much specified
about
> typing and value comparisons, but please keep query performance
against
> large triple stores in mind when specifying the behavior of comparison
> operators such as > and <. If those operators are too type lenient
> (e.g. if 
> they're allowed to operate on plain literals), it makes it very
> difficult to 
> do an efficient indexed query.

This is an area that the working draft does not cover.  Thank you for
pointing out the efficiency issue.

> 
> - Query Syntax
> I imagine you're past this decision point, but thought I'd add my two
> cents 
> anyway. Please consider using something other than <> to delimit URIs
-
> it's 
> painful having to always encode these chars in html and xml. We use []
> in 
> RDF Gateway for exactly this reason. On a similar note, why use parens
> around triples? seems like it just confuses things when you also use
> parens 
> for grouping. Again, we use {} for this reason in Gateway's query
> language. 
> 
> 
> Please let me know if anything here is unclear or if you'd like me to
go
> into more detail on anything.
> 
> Thanks,
> 
> Geoff Chappell
Received on Wednesday, 3 November 2004 16:10:53 UTC