Re: trusting quads from Sandro Hawke on 2012-05-10 (public-rdf-wg@w3.org from May 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Wed, 09 May 2012 21:30:01 -0400
To: Steve Harris <steve.harris@garlik.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-ID: <1336613401.2368.102.camel@waldron>
On Wed, 2012-05-09 at 13:02 -0700, Steve Harris wrote:
> On 9 May 2012, at 12:02, Sandro Hawke wrote:
> 
> > On Wed, 2012-05-09 at 11:26 -0700, Steve Harris wrote:
> >> 
> >> Right. The whole reason quads were implemented was to be able to track
> >> what *triples* appears in what documents (typically found on the web,
> >> but file: is good too). 
> > 
> > Speak for yourself, please, Steve.   I've seen several implementations
> > of quads that were used for other purposes and it's quite possible they
> > predated yours.  In general, I think the motivation for quads/datasets
> > is to work with a bunch of triples at the same time, in one system,
> > while still keeping them in distinct groupings, so they can have their
> > own metadata, source information, dependency tracking, etc.
> 
> I don't think so. When I started work on quads-based systems it was relatively uncommon (around 2000, maybe early 2001). The first time I saw quads / named graphs being *published* was many years later, and the first time I saw anything other than web-crawled sets of graphs / database dumps being published was many years after that.
> 
> <history-lesson>
> 
> The triplestore I designed for the AKT project in 2001 had it's query language (originally OKBC, later RDQL) extended to allow queries over the 4th slot. It wasn't the first one, but it was probably close to it. The motivation was 100% to allow storage of multiple graphs, keeping them separate. RDF Triplestore authors were a pretty small community at that point.
> 
> 3store was GPL'd in October 2002, by which time it was already quite mature, and had quads from the start. I don't remember when I started work on it, probably late 2001 / early 2002.
> 
> TriG dates from 2007, and NQuads from 2008. There was the idea of quoting in N3 much earlier, but I don't think I've ever seen that used in the wild. 
> 
> </history-lesson>
> 
> "keeping them in distinct groupings, so they can have their own metadata, source information, dependency tracking, etc." is exactly the issue here.

Does "exactly the issue here" mean you agree or disagree with my claim
that this is basically what quads are for?

> >> If you allow/encourage web documents to circumvent this, then you
> >> break that. 
> > 
> > I don't understand how you handle a triple at arms length, without
> > taking it as gospel, but you can't do that with a quad.
> 
> Because of the 4th slot. You say things like "without taking it as gospel" because your perspective is of some giant logic system. My perspective is of databases - I don't "believe" the things in my databases, it's all about the context. If you ask a user to enter their name, you don't "believe" the answer they give, you just store it. You can still query things you don't believe as long as you know the how / why / who says so. That's what the 4th slot was created for.

I believe I'm fluent in both the database and logic perspective (and a
few others).  I don't always manage to use the right language in the
right context, through.

> > If you get a triple you want to store, but not trust, you put it off in
> > a separate space (aka a named graph), where you wont accidentally query
> > it when you're querying the stuff you trust.
> 
> That might be what you do. I tag it with some metadata so that I can exclude it from queries if I want to, or, more commonly I can also return the "provenance" (in the loosest sense of the word) with the results.

As far as I know, you can't tag it with metadata unless it's in a
different space.  Triplestores, per se, don't do metadata.  It sounds
like you're doing the same thing as I would, here.

> > If you get a quad you want to store, but not trust, you do a little
> > rewrite of the names, so you wont accidentally query it when you're
> > querying the stuff you trust.
> 
> I don't see how that helps. Now you have some data that you've mangled, which may or may not have been valid before you mangled it.

It helps because the "mangling" -- an operation logically comparable to
character escaping, like one does to prevent injection attacks -- keeps
the data you fetched from interfering with your own data.

> If I had to handle that case I would use quints, but I'm not going there.
> 
> > In any case, people *are* going to be publishing quads.  What are you
> > going to do about that?   Any why can't you apply that technique to any
> > situation where someone is publishing something that might be triples or
> > might be quads?
> 
> They already are, and have been for some time, but it's very rare - I just ignore them. The vast majority are dataset dumps anyway, e.g. more copies of wikipedia.
> 
> Now, you might suppose that that's going to change, but it's pure speculation.

You said you were *strongly* opposed to finding quads mixed in with
triples, that you needed to know at con-neg time whether there might
possibly be quads coming from this source.

Now you're saying you almost never find quads, and if you do, you ignore
them (and, I think, the whole resource that contains them).  And you're
not really expecting the ration of source-of-triples to sources-of-quads
to change in the future.

If you're going to ignore the quads and/or any data source that uses
quads, then why would it be such a problem for you if, hypothetically,
quads were allowed in Turtle.    The difference is that there would be
some noise (quads) in the file you'd have to skip over, or you'd have to
reject a source when you hit the first quad.   So, there's some
additional network bandwidth and parsing cost if people mix in quads,
perhaps comparable to them including comments and having the occasional
syntax error that invalidates the file.   In any case, the bandwidth and
parsing costs would be a small fraction of your total parsing and
bandwidth costs, because of the small number of sources-of-quads.

(I'm not quite sure what to do about the multiple copies of dbpedia.
Maybe there are ways to recognize and skip over things like that.)    

This doesn't seem like a strong argument, so perhaps I'm missing some
important part of it.   (That's why I'm continuing the thread -- I don't
think the schedule is going to allow us to include quads in Turtle
anyway.  It might also affect whether Trig is an extension of Turtle or
disjoint from Turtle.)

   -- Sandro



> - Steve
> 
> -- 
> Steve Harris, CTO
> Garlik, a part of Experian 
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 653331 VAT # 887 1335 93
> Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
> 
>
Received on Thursday, 10 May 2012 01:30:13 UTC