Re: trusting quads from Steve Harris on 2012-05-10 (public-rdf-wg@w3.org from May 2012)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 10 May 2012 07:11:53 -0700
To: Sandro Hawke <sandro@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <77D51354-30B3-4F70-A312-A10DE4CD59CA@garlik.com>
On 9 May 2012, at 18:30, Sandro Hawke wrote:

> On Wed, 2012-05-09 at 13:02 -0700, Steve Harris wrote:
>> On 9 May 2012, at 12:02, Sandro Hawke wrote:
>> 
>>> On Wed, 2012-05-09 at 11:26 -0700, Steve Harris wrote:
>>>> 
>>>> Right. The whole reason quads were implemented was to be able to track
>>>> what *triples* appears in what documents (typically found on the web,
>>>> but file: is good too). 
>>> 
>>> Speak for yourself, please, Steve.   I've seen several implementations
>>> of quads that were used for other purposes and it's quite possible they
>>> predated yours.  In general, I think the motivation for quads/datasets
>>> is to work with a bunch of triples at the same time, in one system,
>>> while still keeping them in distinct groupings, so they can have their
>>> own metadata, source information, dependency tracking, etc.
>> 
>> I don't think so. When I started work on quads-based systems it was relatively uncommon (around 2000, maybe early 2001). The first time I saw quads / named graphs being *published* was many years later, and the first time I saw anything other than web-crawled sets of graphs / database dumps being published was many years after that.
>> 
>> <history-lesson>
>> 
>> The triplestore I designed for the AKT project in 2001 had it's query language (originally OKBC, later RDQL) extended to allow queries over the 4th slot. It wasn't the first one, but it was probably close to it. The motivation was 100% to allow storage of multiple graphs, keeping them separate. RDF Triplestore authors were a pretty small community at that point.
>> 
>> 3store was GPL'd in October 2002, by which time it was already quite mature, and had quads from the start. I don't remember when I started work on it, probably late 2001 / early 2002.
>> 
>> TriG dates from 2007, and NQuads from 2008. There was the idea of quoting in N3 much earlier, but I don't think I've ever seen that used in the wild. 
>> 
>> </history-lesson>
>> 
>> "keeping them in distinct groupings, so they can have their own metadata, source information, dependency tracking, etc." is exactly the issue here.
> 
> Does "exactly the issue here" mean you agree or disagree with my claim
> that this is basically what quads are for?

I agree, but it leads me to a different conclusion.

>>>> If you allow/encourage web documents to circumvent this, then you
>>>> break that. 
>>> 
>>> I don't understand how you handle a triple at arms length, without
>>> taking it as gospel, but you can't do that with a quad.
>> 
>> Because of the 4th slot. You say things like "without taking it as gospel" because your perspective is of some giant logic system. My perspective is of databases - I don't "believe" the things in my databases, it's all about the context. If you ask a user to enter their name, you don't "believe" the answer they give, you just store it. You can still query things you don't believe as long as you know the how / why / who says so. That's what the 4th slot was created for.
> 
> I believe I'm fluent in both the database and logic perspective (and a
> few others).  I don't always manage to use the right language in the
> right context, through.

I was just trying to highlight the difference in approach, from my perspective there's no difference between a triple in the "default graph" and one in a named graph - it's just the metadata that's different - as triples in the default graph can't have any.

That said, I've never been a believer in the explicit "default graph", I think it's a horrible idea.

>>> If you get a triple you want to store, but not trust, you put it off in
>>> a separate space (aka a named graph), where you wont accidentally query
>>> it when you're querying the stuff you trust.
>> 
>> That might be what you do. I tag it with some metadata so that I can exclude it from queries if I want to, or, more commonly I can also return the "provenance" (in the loosest sense of the word) with the results.
> 
> As far as I know, you can't tag it with metadata unless it's in a
> different space.  Triplestores, per se, don't do metadata.  It sounds
> like you're doing the same thing as I would, here.

Right, triplestores can't hold metadata, but quad stores can. You just have a different graph that holds the metadata about X, e.g. if the data you pulled from <X> is in graph <X>, you can put the metadata in graph <X#meta>, with ideally some triple in <X#meta> pointing back to <X> to make querying it easier.

>>> If you get a quad you want to store, but not trust, you do a little
>>> rewrite of the names, so you wont accidentally query it when you're
>>> querying the stuff you trust.
>> 
>> I don't see how that helps. Now you have some data that you've mangled, which may or may not have been valid before you mangled it.
> 
> It helps because the "mangling" -- an operation logically comparable to
> character escaping, like one does to prevent injection attacks -- keeps
> the data you fetched from interfering with your own data.

I think I have an issue with the idea of "your own data" - (potentially) everything needs metadata, what if you have a rogue process that was spitting out incorrect data for some period, you want to be able to ignore, or delete that data. I agree that hypothetically you might have some data which conforms to some provably ideal state of correctness, but I've never encountered it ;-)

I don't see how you can escape RDF triples (or especially quads) without changing the data, from a SPARQL perspective at least, and what about the impact in data management operations? E.g. deleting stale data.

If I pull some data like

<x> { <x> a myapp:Page }
<y> { <y> a myapp:Page }

And try and store that in a quad store, I don't see how I can escape it without making it useless.

>> If I had to handle that case I would use quints, but I'm not going there.
>> 
>>> In any case, people *are* going to be publishing quads.  What are you
>>> going to do about that?   Any why can't you apply that technique to any
>>> situation where someone is publishing something that might be triples or
>>> might be quads?
>> 
>> They already are, and have been for some time, but it's very rare - I just ignore them. The vast majority are dataset dumps anyway, e.g. more copies of wikipedia.
>> 
>> Now, you might suppose that that's going to change, but it's pure speculation.
> 
> You said you were *strongly* opposed to finding quads mixed in with
> triples, that you needed to know at con-neg time whether there might
> possibly be quads coming from this source.

Correct.

> Now you're saying you almost never find quads, and if you do, you ignore
> them (and, I think, the whole resource that contains them).  And you're
> not really expecting the ration of source-of-triples to sources-of-quads
> to change in the future.

I don't think I said I'm not expecting it to change, I just said I don't know that it will.

> If you're going to ignore the quads and/or any data source that uses
> quads, then why would it be such a problem for you if, hypothetically,
> quads were allowed in Turtle.    The difference is that there would be

Because I wouldn't know until I ran across them partway through a parse.

> some noise (quads) in the file you'd have to skip over, or you'd have to
> reject a source when you hit the first quad.   So, there's some
> additional network bandwidth and parsing cost if people mix in quads,
> perhaps comparable to them including comments and having the occasional
> syntax error that invalidates the file.   In any case, the bandwidth and
> parsing costs would be a small fraction of your total parsing and
> bandwidth costs, because of the small number of sources-of-quads.

Yeah, I'd have to rollback the entire import, that's a very expensive operation.

IF it becomes common to include a few random quads in e.g. FOAF files, then it wouldn't be the end of the world, like you say. But, if people started publishing database dumps in Turtle, then that's an issue. I could potentially have parsed gigabytes of triple data before hitting the first quad, and then what? Do I back out the triples, leave them in the named graph and abort, squash all the triples in the named graphs into the graph for the URI I'm dereferencing? All of those options seem bad.

[datapoint] We managed to save thousands of dollars a month in bandwidth (not to mention power) just by identifying flash videos that had been incorrectly marked as text/html (or whatever) in the Content-Type: header, and rejecting them early. People screw this stuff up, a lot. The semantic web won't change that.

> (I'm not quite sure what to do about the multiple copies of dbpedia.
> Maybe there are ways to recognize and skip over things like that.)    

Well, the easiest way to recognise them is to use a different mime-type, just like we do now. Of course, sometimes people use the wrong mime-type, so it would be better if the syntaxes were disjoint too. Just requiring the triples in the default graph to be flagged would solve that. DEFAULT { … } or somesuch.

> This doesn't seem like a strong argument, so perhaps I'm missing some
> important part of it.   (That's why I'm continuing the thread -- I don't
> think the schedule is going to allow us to include quads in Turtle
> anyway.  It might also affect whether Trig is an extension of Turtle or
> disjoint from Turtle.)

It should be disjoint, but I think we've covered this already.

The behaviour (in 4store, Jena, Sesame(?) etc) when reading TriG v's Turtle is fundamentally different.

When reading Turtle, the triples go into a graph named after the resource you're dereferencing (or some other, indicated by the user, not excluding the default graph). When parsing TriG, the triples go into the default graph, and the quads get written in as-is.

This isn't some accident of history, it's evolved behaviour based on user expectations.

- Steve

-- 
Steve Harris, CTO
Garlik, a part of Experian 
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 653331 VAT # 887 1335 93
Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
Received on Thursday, 10 May 2012 14:12:30 UTC