Re: trusting quads from Steve Harris on 2012-05-09 (public-rdf-wg@w3.org from May 2012)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 9 May 2012 13:02:20 -0700
To: Sandro Hawke <sandro@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <7E42A7DD-D34D-4024-8986-958B2E752E0D@garlik.com>
On 9 May 2012, at 12:02, Sandro Hawke wrote:

> On Wed, 2012-05-09 at 11:26 -0700, Steve Harris wrote:
>> 
>> Right. The whole reason quads were implemented was to be able to track
>> what *triples* appears in what documents (typically found on the web,
>> but file: is good too). 
> 
> Speak for yourself, please, Steve.   I've seen several implementations
> of quads that were used for other purposes and it's quite possible they
> predated yours.  In general, I think the motivation for quads/datasets
> is to work with a bunch of triples at the same time, in one system,
> while still keeping them in distinct groupings, so they can have their
> own metadata, source information, dependency tracking, etc.

I don't think so. When I started work on quads-based systems it was relatively uncommon (around 2000, maybe early 2001). The first time I saw quads / named graphs being *published* was many years later, and the first time I saw anything other than web-crawled sets of graphs / database dumps being published was many years after that.

<history-lesson>

The triplestore I designed for the AKT project in 2001 had it's query language (originally OKBC, later RDQL) extended to allow queries over the 4th slot. It wasn't the first one, but it was probably close to it. The motivation was 100% to allow storage of multiple graphs, keeping them separate. RDF Triplestore authors were a pretty small community at that point.

3store was GPL'd in October 2002, by which time it was already quite mature, and had quads from the start. I don't remember when I started work on it, probably late 2001 / early 2002.

TriG dates from 2007, and NQuads from 2008. There was the idea of quoting in N3 much earlier, but I don't think I've ever seen that used in the wild. 

</history-lesson>

"keeping them in distinct groupings, so they can have their own metadata, source information, dependency tracking, etc." is exactly the issue here.

>> If you allow/encourage web documents to circumvent this, then you
>> break that. 
> 
> I don't understand how you handle a triple at arms length, without
> taking it as gospel, but you can't do that with a quad.

Because of the 4th slot. You say things like "without taking it as gospel" because your perspective is of some giant logic system. My perspective is of databases - I don't "believe" the things in my databases, it's all about the context. If you ask a user to enter their name, you don't "believe" the answer they give, you just store it. You can still query things you don't believe as long as you know the how / why / who says so. That's what the 4th slot was created for.

> If you get a triple you want to store, but not trust, you put it off in
> a separate space (aka a named graph), where you wont accidentally query
> it when you're querying the stuff you trust.

That might be what you do. I tag it with some metadata so that I can exclude it from queries if I want to, or, more commonly I can also return the "provenance" (in the loosest sense of the word) with the results.

> If you get a quad you want to store, but not trust, you do a little
> rewrite of the names, so you wont accidentally query it when you're
> querying the stuff you trust.

I don't see how that helps. Now you have some data that you've mangled, which may or may not have been valid before you mangled it.

If I had to handle that case I would use quints, but I'm not going there.

> In any case, people *are* going to be publishing quads.  What are you
> going to do about that?   Any why can't you apply that technique to any
> situation where someone is publishing something that might be triples or
> might be quads?

They already are, and have been for some time, but it's very rare - I just ignore them. The vast majority are dataset dumps anyway, e.g. more copies of wikipedia.

Now, you might suppose that that's going to change, but it's pure speculation.

- Steve

-- 
Steve Harris, CTO
Garlik, a part of Experian 
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 653331 VAT # 887 1335 93
Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
Received on Wednesday, 9 May 2012 20:03:03 UTC