Re: bNodes as graph identifiers from Steve Harris on 2013-06-03 (public-rdf-comments@w3.org from June 2013)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 3 Jun 2013 10:52:33 +0100
To: Pat Hayes <phayes@ihmc.us>
Cc: RDF WG <public-rdf-wg@w3.org>, "public-rdf-comments@w3.org Comments" <public-rdf-comments@w3.org>
Message-Id: <BE9F5F9F-EF71-4CE3-B5AD-AE864394A375@garlik.com>
On 2013-06-01, at 03:46, Pat Hayes <phayes@ihmc.us> wrote:
> 
> On May 29, 2013, at 12:47 PM, Steve Harris wrote:
> 
>> [ as a side note I find it bizarre that I'm having to advocate NOT changing a 14
>> year old, industrially deployed spec, at the 11th hour of the standardisation
>> process, to add a feature that's used by a tiny minority of deployed systems -
>> if anything was to strike an outsider as peculiar about this WGs process, it
>> would surely be this feature ]
> 
> This topic - using bnodes as identifiers in datasets - has been under active and energetic discussion within the WG for some time now, and Sandro apparently felt that he has new information which is relevant to the decision. This matter is quite within the (highly constrained) WG charter. So I don't see anything bizarre about this at all. 
> 
>> 
>> TL;DR: don't mess.
>> 
>> We know that bNode graph identifiers are possible (I've designed a system myself
>> that had them) and that there are usecases that are addressed by it, but I've
>> not heard anything yet that can't already be addressed using RDF/SPARQL as it
>> stands.
> 
> But that is not the point. There are people who feel strongly that their use cases are best handled using blank node identifiers. The fact that you might have decided to code their software differently is not an argument that they should be forced to adopt your implementation strategies. 

I agree 100%. More to the point, their decision on how to code their software should force me to do anything.

There's two cases:

1) One/two organisations have to step outside the spec in order to use their chosen implementation technique. Or, use one that's inside the spec that they don't particularly like.

2) Every organisation that implements RDF has to change their implementation, if they want to stay compliant, whether they intend to use the extension or not.

Clearly it's a matter of opinion which is preferable.

>> It is the opinion of some people that bNodes as graph identifiers
>> address it better, in some way, but that's another matter.
> 
> And that is the matter that is relevant to our work as a standard-writing WG.
> 
>> There are however some costs to extending RDF (datasets) to require that bNodes
>> be usable as graph identifiers:
>> 
>> * We (Experian) have invested millions of dollars in our RDF engine - it's very
>> tightly optimised to the current specs, and opening up the space of graph
>> identifiers from a single class (URIs) to two classes (URIs and bNodes) would
>> have a significant engineering, and storage cost. Put simply, we wouldn't do it,
>> and would just step away from later RDF specs, becoming an RDF/SPARQL flavoured
>> graph database.
> 
> OK, that is a decision Experian would have to make. I don't see this as centrally relevant to the WG decision process.

It's not - I was asked to write a mail explaining why I didn't support using bNodes to identify graphs.

>> * RDF is already too complex for people coming into it to learn easily. Every
>> time we add a new feature to the language we increase the barrier to entry.
> 
> First, this does not change RDF. Second, allowing bnodes as graph labels in datasets is not a "new feature", it is simply removing a restriction. Arguably, it simplifies dataset syntax.

If RDF tutorials, or specs don't mention this at all then we have no need of this discussion.

If they do, then it changes RDF.

Saying it doesn't change RDF because the change in the logic is small/non-existant is missing the point entirely.

>> * There's no practical way to refer to long lived bNodes in SPARQL (without
>> enforced skolemisation), people will import datasets with bNode graphs, and then
>> realise they can't isolate their data (presumably after posting on stack
>> overflow or similar :) ).
>> The following will not retrieve your original data* and this will just promote
>> more confusion:
>> 
>> 	Data:
>> 	_:abc { :s :p :o }
>> 
>> 	Query:
>> 	SELECT * WHERE { GRAPH _:abc { :s ?p ?o } }
> 
> Of course it won't. Bnode ID scopes are limited to the document. But surely people are going to understand the idea of a local name, aren't they? This is not rocket science for anyone with even a slight acquaintance with formal notations. 
> 
>> 
>> 	you could possibly do something like:
>> 
>> 	SELECT * WHERE { GRAPH ?g { :s ?p ?o } } FILTER(STR(?g) =abc")
>> 
>> 	That's pretty inconvenient, in many ways, and isn't required to work by SPARQL 1.1.
>> 	It is only possible at all in systems that preserve bNode labels, which is not
>> required.
> 
> Right. Expecting to be able to access bnode IDs from outside their scope is something we should strongly discourage. Even if it works, it is Bad Practice. 

Right, but that *severely* limits the utility of this feature.

I'm willing to bet that if you presented this view to the JSON-LD people (who requested anonymous graphs in the first place) they wouldn't be keen on this as a solution to their representational issue.

> But consider the case where a dataset uses the default graph to hold metadata about the named graphs, and what you want to see is something extracted from the graph whose metadata says it was written on a certain date. You really don't care what the name of the graph is. It can be a bnode, and nobody will care at all. 

It's a possible solution to that usecase, but I don't feel it's a very good one. 

In that situation (bar we don't use the default graph to hold metadata) we need stable identifiers for our graphs.

>> * Confusion with bNodes-in-graphs, and bNodes-as-graph-identifiers - the
>> discussion seems to assume that they're separate kinds of thing, maybe with
>> identifier bNodes not being existential variables?
> 
> No. They can be exactly the same kind of thing, and their semantics is just the same as it is in RDF, with one extra condition, which is that when a bnode _:X is used as a graph label, that is just like also asserting the equation _:X = <the named graph>. Which is surely exactly what one would expect the graph labelling to mean: it is the URI case that is exceptional (and wierd). 
> 
> This would have one consequence, a modification to the notion of 'instance'. Once used as a graph label, a bnode is no longer *simply* an existential assertion *and nothing else*; it has been 'used up' and its referent is fixed to be the graph. So it can't be substituted by any other bnode or IRI. So call this a 'fixed' bnode, and exclude fixed bnodes from the definition of instance. Then everything else works just as at present. (It is fine to swap fixed bnodes for other bnodes as long as you also change the label to match, of course.) 
> 
>> Which ever way it goes the
>> relationship between bNodes-in-graphs, and bNodes-identifying-graphs is going to
>> be complex.
> 
> It is trivially simple. They are the same thing, but when its identifying a graph, its no longer availble to be instantiated in some other way. 
> 
>> 
>> * Of all the extensions that are implemented by a small number of systems as an
>> extension, this seems like an odd one to pick. IMHO there are far more serious
>> problems with RDF. There is a cost (to this group, and the wider community) of
>> any changes, so lets pick our battles wisely.
> 
> Well, we have chosen defeat in a number of battles already. We don't allow literal subjects or bnode properties, both of them semantically transparent and expressively useful, for example. 

Those were (explicitly in one case) out of scope for this WG.

>> * There's very little implementation experience - compared to the other things
>> we're standardising: URI quads, bNode skolemisation, Turtle, NQuads. It's not
>> clear how far the existential variable-ness should extend - do we sanction graph
>> leaning?
> 
> Leaning is already a problem when graphs share bnodes, which they can do in a dataset. (I need to add a comment in Semantics about that, thanks for reminding me.) Basically, graph leaning only makes sense when you take *all* the triples that might share the bnode into account. 
> 
>> Do URI-identified graphs infer identical graphs identified by bNodes?
> 
> No. 
> 
>> If not, why not?
> 
> Simple answer, because datasets don't have truth conditions, so inference is meaningless. Longer answer, because the equality condition does not hold when URIs are used as graph labels. But note, this is a condition on *datasets*, not graphs.
> 
>> What do bNodes with a given label, in graphs identified by a
>> bNode with a different label refer to, etc.
> 
> ? I don't understand what you are asking here. The label of the graph does not affect the truth of triples in the graph (unless they themselves use the label bnode, of course.)
> 
>> 
>> 	_:abc {
>> } _:def (
>> 
>> } One graph, or two, or undefined?
> 
> Two named graphs, both consisting of the empty graph with a label, would be my answer. But this is exactly similar to what you would get if you used two UIRs as labels, so the answer should parallel that. 
> 
>> I don't think we know the right answer yet.
>> So, in summary, I think the cost is high, and the benefit is vanishingly small.
> 
> But the benefit is not vanishingly small. We have a standing request from another WG for this feature, which alone makes it quite a lot more significant than "vanishingly small".
> 
>> Nothing stops people that feel they really need it adding them to RDF systems,
>> as they have in the past. One counter argument is that JSON-LD will do it
>> anyway, but that's fine - if it is widely used, it can be adopted into RDF 1.2,
>> with plenty of implementation experience. In the meantime JSON-LD serialisers
>> can skolemise when transforming JSON-LD into RDF - there's other places where
>> the transform is lossy anyway, as far as I understand it. - Steve * this was
>> possibly an error in the SPARQL 1.0 spec, but sadly the bNodes as variables
>> feature is quite widely used
> 
> How can you reconcile that fact that this is widely used, with the claim that to allow it is a "vanishingly small" advantage? I honestly cannot understand how you can rationally believe this combination. 

It's not widely used - did I say that? If so it was an error.

There were/are a small number of systems that implemented it. That's not the same as using it.

- Steve

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL
Received on Monday, 3 June 2013 09:53:07 UTC