Re: Blank Node Identifiers and RDF Dataset Normalization from Steve Harris on 2013-02-26 (public-linked-json@w3.org from February 2013)

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 26 Feb 2013 11:10:55 +0000
To: Pat Hayes <phayes@ihmc.us>
Cc: Markus Lanthaler <markus.lanthaler@gmx.net>, "'William Waites'" <ww@styx.org>, <msporny@digitalbazaar.com>, <public-rdf-wg@w3.org>, <public-linked-json@w3.org>
Message-Id: <480F9AC7-F242-4365-9F03-9B30B7B42C76@garlik.com>
[ TL;DR Pat has some very good points, but I think they're only real issues if you pull in RDF Datasets from unknown and untrusted sources, and combine them, which I think is a terrible idea regardless ] 

On 2013-02-25, at 23:27, Pat Hayes <phayes@ihmc.us> wrote:
> 
> On Feb 25, 2013, at 9:45 AM, Steve Harris wrote:
> 
>> On 2013-02-25, at 13:00, Markus Lanthaler <markus.lanthaler@gmx.net> wrote:
>> 
>>>> For example:
>>>> 
>>>> SELECT * WHERE {
>>>> ?g dc:date ?d .
>>>> GRAPH ?g { ?x a foaf:Person }
>>>> }
>>> 
>>> Given that it has been decided that graph labels do *not* denote the graph,
>> 
>> I believe it would be more correct to say that graph labels do not HAVE to demote the graph, they're allowed to if you want them to.
> 
> True, but we have no way to convey such a "want to" in RDF syntax. So whatever it is that the writer wanted, the reader has no way to know that. If the Web were telepathic, we would not need information transmission standards at all, as you could mind-project your desired meaning of all your byte streams. In the real world, however, we usually have to rely on specificaitons to provide us a clue as to how to interpret the things we read. According to our current specifications, when you read some RDF in a dataset which uses a URI which is also used as a graph label, you have no way to know whether or not the first use of the IRI is supposed to be related in meaning to the second use. 

Agreed. Someone (Sandro maybe?) had a suggestion for rdf:type-ing the graph labels to indicate their relationship to the graph, which seems like it could be useful. However, the situation in early 2013 is that the vast majority of the datasets published online are dumps of systems with particular semantics - until we've have more experience of dealing with merging multiple datasets we don't know what issues will come up. Privately, I suspect (and hope) this will never become common - I have absolutely no desire to design a quint store.

We have to be a little careful with terminology here - there's no issue with RDF Graphs, only with RDF Datasets, which aren't even a defined thing yet (until we get the new docs to PR).

>> Regardless, the example is valid regardless on whatever graph labelling semantics are being used - within some system with a known relationship between graph labels and metadata.
> 
> But the entire point of RDF, why it was invented in the first place, was to allow information to be conveyed across the Web and used at the point of reading, without having to know any conventions in use at its point of creation. If we have an RDF convention that depends on the RDF being used "within some system", then we are mis-using RDF. We have created a design that cannot be used in RDF which is being used for its primary purpose, and in so doing, have destroyed any possibility of having a coherent semantics for the basic SPARQL construct. This is an epic failure, especially when we were chartered to provide a semantics for datasets. 

RDF yes, but I didn't imply that this was a collection of datasets, just a collection of graphs. IMHO SPARQL can't safely deal with merged/combined datasets until we have quint stores, and some extra layer of syntax like

DATASET ?d {
   GRAPH ?g {
      ?s ?p ?o
   }
}

Colour me unenthusiastic.

Will we then need a semantics for collections of Datasets? That's a non-terminating process.

That assumes we want to publish, and combine/merge multiple RDF Datasets - I think that's a terrible idea. If you don't do that, then there's no issue.

>> If the graph label refers to the document which was parsed, and the metadata refers to the parsing (which is a very common situation), then the example is equally valid.
> 
> I have no problem with that. But what if it refers to a person or a time, and not to the graph/graph-source/g-box/document at all? 

I'm not a fan of that, but people are doing it, and it appears to work for them.

FWIW, we use <uri-of-source>#<utc-datetime>.

I don't know if that's legit, or common, but it's been effective.

>> I think you may be attaching too much important to the idea of denoting.
> 
> Denoting is simply a synonym for "naming" or "referring to". It's not an exotic idea. If you are using names (IRIs) in RDF, you are using them to denote. 

Exactly. I think Markus was attaching some much deeper meaning to it.

>>> I find such example especially confusing. You use the same variable (?g) in
>>> the subject position and as a graph label knowing that they do not refer to
>>> the same. Semantically, the two have nothing in common at all. ?g could
>>> denote a person, a document, an event, whatever. The graph ?g is a
>>> completely different "thing". Effectively you could say they use the same
>>> IRI by coincidence. I think it are these kind of examples that lead to the
>>> current situation. Contrast that with a query like and assume the IRI would
>>> denote the graph
>>> 
>>> SELECT * WHERE {
>>> ?someone_thing :stated ?g .
>>> GRAPH ?g { ?x a foaf:Person }
>>> }
>>> 
>>> 
>>> I think at the very least, the effects of the decision that graph labels do
>>> not denote the graph should be made clearer in RDF Concepts. I don't know
>>> how but maybe an example helps to illustrate the problem. That information
>>> also shouldn't be put in a non-normative note IMHO.
>> 
>> Well, first we'd have to find a problem with it…
> 
> Imagine a scenario where information from a number of sources is being integrated into one datastore, all about authorship of RDF graphs. The goal is to have a dataset with a default graph recording authorship information using triples
> 
> :personIRI :authorOf :graphLabel .
> 
> where :graphLabel identifies a graph in the dataset using the graph label convention. But suppose one of the sources being mashed has cleverly taken advantage of the graph-label-denoting-something-else freedom to simply label each graph with an IRI denoting its author. Then we will get triples like
> 
> :personIRI :authorOf :personIRI .
> 
> which can only be interpreted as something (be it a person or graph) authoring itself. Which is nonsense, and probably will cause an inconsistency with some data model or ontology defining :authorOf. 

Yes, but this can only happen if you merge multiple datasets, right? Otherwise no-one gets to write anything into the "default graph" against the will of the dataset maintainer.

This is related to the reason why I find the idea of having a single format that can express both Graphs and Datasets so scary - you can bring this kind of situation on yourself without any prior warning. Very bad idea.

Also, this is (I think most people are agreed) bad practice. I could equally well publish a dataset which misuses someone else's graph URIs. That would be at least as destructive, and equally just bad practice, there's no technical measure to protect against it (without quint stores, see above).

<http://dbpedia.org> {
   <http://dbpedia.org/resource/Paris> a <http://dbpedia.org/ontology/Person> .
}

[ there's also the issue about whether the real dbpedia triples should be in a graph called <http://dbpedia.org> - isn't that a website? they are according to http://dbpedia.org/sparql … anyway ]

> You can work a similar problem with graph labels referring to just about anything other than the graph (or graph document).

Right, but there are many other ways to do Bad Things™, as soon as you try to combine datasets. If you live in a world of just taking in graphs from the outside world, then you control the "default graph", and can ensure that nothing scary happens with the graph labels.

Merely enforcing (somehow, magically?) that graph labels denote graphs won't save you.

>> I suspect a world where graph labels always denote graphs would be much more confusing and counter-intuative to the average developer.
> 
> Why would it be confusing and counter-intuitive for something called a "graph label" to be the name of the thing it is labelling? Isn't it normal, even for the average developer, to think of identifiers as identifying something, and to feel a slight frisson of concern when they are obliged to use the same identifier to mean two different things at the same time? 

Well, it would be against the rules for e.g. to store the parsed triples from http://plugin.org.uk/swh.xrdf in a graph called <http://plugin.org.uk/swh.xrdf> - this is common, and somewhat natural.

   <http://plugin.org.uk/swh.xrdf> a foaf:PersonalProfileDocument .

I don't think it can be both a PPD, and an RDF Graph, can it?

- Steve

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL
Received on Tuesday, 26 February 2013 11:11:26 UTC