Re: Graphs Design 6.2

On Wed, 2012-04-25 at 22:32 +0100, Richard Cyganiak wrote:
> Sandro,
> 
> I have a few questions about the kinds of problems that you're trying to solve with the named graphs design. My apologies if you have answered them before — I've skipped some of the named graphs threads in the last few weeks.

No worries, glad to have you back.

> Let's say I find a couple of TriG files on the Web. This being the Web, I don't trust them fully.
> 
> I want to load all of them into my SPARQL store so that I can query them with SPARQL. But I want to load them in a way so that I can still change my mind about what to trust or distrust after having loaded them. So I need to keep track of who said what.
> 
> Would you consider this a reasonable thing to do?

Yes, absolutely.  At first, most of what the RDF crawlers find is going
to be just graphs, but I think it's quite possible that if we do our
work well we'll start to see more and more datasets out there.  There's
a real appeal to publishing separated metadata with your data, I think,
and sometimes publishing multiple distinct collections inside one
document.

> (You might answer: No, don't load TriG files that you don't trust! Or you might answer: No, you'd need a more powerful, yet-to-be-developed, query language, which is out of scope for us, if you want to query data across multiple TriG files.)
> 
> Assuming you consider it reasonable, then please consider the following sub-cases:
> 
> 1) I may not fully trust everything that's said in some of the named graphs in these TriG files. (That is, I trust the source of the TriG file and the metadata in the default graphs, but don't trust some of the other sources quoted in the named graphs.)
> 
> 2) I may not fully trust everything that's said in the default graphs of these TriG files. (For example, the metadata in the default graphs might be horribly outdated, or the TriG files use @union and I don't trust some of the NGs.)
> 
> 3) I may not fully trust the association between graph IRIs and graphs in some of the TriG files (That is, I suspect they might be lying or mistaken when ascribing statements to certain source IRIs in their named graphs).
> 
> Which of these sub-cases must a successful design handle to meet your requirements?

All of them.

> Bonus question: Which of them can be handled by the 6.2 design, and how? You may leave this one as an exercise to the reader of your response — what I really want to know is which of the problems above you want to solve.

I think 6.2 can handle them all.   I'm now thinking in terms of 6.3
which I'm still writing up.   The main difference is that I've come up
with a name for the class of things denoted by graph labels (namely,
"Graph Resources") and I'm letting go of rdf:Graph, rdf:GraphContainer,
and rdf:hasGraph.    I think those kinds of things can be defined later,
as the product of research.   The main thing we'd be adding that's not
in SPARQL is a story about what it means when you use a graph label in
your RDF.   When we start to look at inference, change-over-time, etc, I
think that will matter.

Since I'm hoping for an A+, here's an answer to the reader exercise:

You said:
> I want to load all of them into my SPARQL store so that I can query
> them with SPARQL. But I want to load them in a way so that I can still
> change my mind about what to trust or distrust after having loaded
> them. So I need to keep track of who said what.

Okay, so maybe we have:

=== D1, from http://example.com/alice ===
@prefix : <http://example.com/>
{ :a :b 1 }
:g1 { :a :b 2 }
==========

and

=== D2, from http://example.com/bob ===
@prefix : <http://example.com/>
{ :a :b 3 }
:g1 { :a :b 4 }
:g2 { :a :b 5,6 }
==========

Since we're not sure we want to trust them, we can't just merge them
into our SPARQL store; we have to quote them in some way.  Perhaps we'll
end up with something like this (leaving out some data we'd probably
want for cache management, for now):

=== D3, our store after the harvest ===
@prefix : <http://example.com/>
@prefix cr: <http://example.com/crawler/>
# all the graphs we encounter, given new names
cr:g9971 { :a :b 1 }
cr:g9972 { :a :b 2 }
cr:g9973 { :a :b 3 }
cr:g9974 { :a :b 4 }
cr:g9975 { :a :b 5,6 }
# all the stuff we figured out for ourselves, and thus trust
{ [] a :DatasetRead;
     :from :alice;
     :defaultGraph cr:g9971;
     :entry [  :name :g1; :graph cr:g9972 ].    
  [] a :DatasetRead;
     :from :bob;
     :defaultGraph cr:g9973;
     :entry [  :name :g1; :graph cr:g9974 ]
     :entry [  :name :g2; :graph cr:g9975 ].
}
============

Hopefully, this is about what you'd expect.  It's a little ugly, but
offhand I can't think of a simpler way to make it work within the
confines of SPARQL.   (It would be quite different in N3, but that's
probably not relevant.)

A demo for this would be to include a lot of foaf files in the crawl,
and then I would query for something (eg people whose names match a
regexp), but ask for only results I "trust".  I'd define trusted
sources as anyone who is within my 3rd degree circle of foaf:knows.  I
think that's doable with this structure, although probably not in a
single SPARQL query.   

The demo would show what happens someone inside the circle lies vs.
someone outside the circle.   The advanced demo would have people
including owl:sameAs arcs in their foaf file, and those arcs being used
properly.  It would also show what happens as different people change
their foaf files, and old data and no-longer-supported inferences go
away.   We might find the object of :name has to be string.

   -- Sandro

Received on Thursday, 26 April 2012 17:28:59 UTC