ACTION item(s): definition(s) and an algorithm for postprocessing minimization from Bijan Parsia on 2006-08-19 (public-rdf-dawg@w3.org from July to September 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Sat, 19 Aug 2006 18:52:30 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <816A1377-D8DD-4857-B443-F93509A02C32@cs.man.ac.uk>

I believe that the qualms Andy had about what happens in intermediate  
steps was do the different interpretations of what the minimized  
final answer set would look like. That is, his understanding was  
close to 1 (like pat) and mine were entirely different. However,  
algorithms for computing 2 and 3 are easy enough, so I'm doing the  
action item anyway.

So, first qualification: These algorithms are only minimizing with  
respect to BNodes. You have to plug in your own account of literals.  
I use "row" for "answer" avoid confusion:

DEFINITION 1: Answer graph template
	Let A be an answer set and Avar be the set of column headings of A.

	The answer graph template of A is the set of triple patterns, such  
that:
		{tp | _:row ('http://var.org/#" ++ var) var. & var \in Avar}
	

DEFINITION 2: Answer set graph.
	An answer set graph ASG for answer set A applying the CONSTRUCT  
operation on A with the answer graph template of A.

DEFINITION 3: REALLYREALLYDISTINCT
	An answer set is REALLYREALLYDISTINCT just in case all its  
corresponding answer set graphs are lean.

DEFINITION4: PAIRWISE-DISTINCT
	An answer set, A,  is PAIRWISE-DISTINCT just in case for every pair  
of rows, Ri, Rj, in A, the answer set graph for Ri does not simply  
entail the answer set graph for Rj. In other words, that their union  
is lean.

I trust algorithms for computing REALLYREALLYDISTINCT and PAIRWISE- 
DISTINCT answer sets is obvious. I don't like PAIRWISE-DISTINCTness,  
but it's cheaper to check, in general.

A REALLYREALLYDISTINCT answer set need not be PAIRWISE-DISTINCT. For  
example:

	?x	?y
	_:a	"1"
	_:a	"2"
	b	"1"

This is not PAIRWISE-DISTINCT ({_:row var:x b. _:row var:y "1"}  
simply entails {_:row var:x _:a. _:row var:y "1".}), but it is   
REALLYREALLYDISTINCT (
	{_:row1 var:x _:a. "_:row1 var:y "1" .
	_:row2 var:x _:a. _:row2 var:y "2" .
	_:row3 var:x b. _:row var:y "1".}
is lean). This shows that coreference of BNodes between answer sets  
is significant to REALLYREALLYDISTINCT but not to PAIRWISE-DISTINCT.

In general, either will be generally cheaper than Pat's proposal,  
assuming you have to lean the data graph first (or maybe even check  
leanness). Answer sets tend to be smaller than graphs and you only  
have to check relevant data (though, as we've seen, there are  
opposite cases; e.g., the graph is lean and the answer set not either  
of the distincts defined above). Also, the data is more structured.

Implementation burden, well, for the raw functionality, it's roughly  
the same since you have to implement leaning. I think leaning answers  
will be easier to implement for the scale and kind of queries people do.

There is also a close correspondence between a REALLYREALLYDISTINCT  
answer set and a CONSTRUCT of that answer set. I conjecture that if  
the answer set is REALLYREALLYDISTINCT, that any CONSTRUCTion of it  
on a lean template will be lean. Oh, hmm. There are additional  
conditions. You have to use all the variables (or it's like a  
projection) and maybe only once?

Cheers,
Bijan.

Received on Saturday, 19 August 2006 17:57:41 UTC