Re: bNodes as graph identifiers (ISSUE-131) from Steve Harris on 2013-06-03 (public-rdf-comments@w3.org from June 2013)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 3 Jun 2013 10:40:17 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: RDF WG <public-rdf-wg@w3.org>, "public-rdf-comments@w3.org Comments" <public-rdf-comments@w3.org>
Message-Id: <B0C53863-ABB3-4919-83C6-79058569D73C@garlik.com>
On 2013-05-31, at 17:00, Sandro Hawke <sandro@w3.org> wrote:

> On 05/29/2013 01:47 PM, Steve Harris wrote:
>> [ as a side note I find it bizarre that I'm having to advocate NOT changing a 14
>> year old, industrially deployed spec, at the 11th hour of the standardisation
>> process, to add a feature that's used by a tiny minority of deployed systems -
>> if anything was to strike an outsider as peculiar about this WGs process, it
>> would surely be this feature ]
> 
> I don't understand this complaint at all.  This Working Group is chartered to provide a standard mechanism for working with and sharing multiple graphs.   In the chartering process in 2010, our various inputs all said this was a very high priority.   A lot of folks said to add Named Graphs or fix reification or something like that.

It's not a complaint, it's an observation.

> See:
> 
> http://www.w3.org/2010/06/rdf-work-items/table (top two items are about "graph identification")
> 
> http://www.w3.org/2009/12/rdf-ws/Report.html ("The work items which had strong support and no stated opposition were: (1) Adding
> support for graph identification (such as with named graphs); (2)..."
> 
> http://www.w3.org/2002/09/wbs/1/rdf-2010/results#xq12

Graph *identification*. Surely by it's very definition, using bNodes for graph labelling (anonymous graphs? what's a useful term?) is not identification. I don't recall bNode graph labels coming up in the pre-WG discussions, though it may have done.

> That's what we're supposed to be doing here, as I understand it. (along with Turtle, JSON, and general improvements.)
> 
> I raised ISSUE-131 because as I've started working more with the "Web Platform" (aka html5, aka the main thing W3C is doing these days)  I realized the solution we'd been drafting only solved the problem well for software running on Web servers or coupled in an application-dependent way with Web servers.   At least some tweak was needed, it seemed to me, to allow code running in browsers to properly use Datasets, as we had been specifying them.     I wish I'd noticed this earlier, but I'm glad I noticed it in time, before our design even reached Last Call, let alone became a W3C Recommendation.
> 
> Since I raised 131, another use case has come up, with Google announcing they'll be consuming JSON-LD that's included in email for gmail users.   I think email clients will generally have a hard time coming up with proper graph names.
> 
> I know of three technical solutions to this problem, as I outlined in the issue statement: use random URIs, use relative URIs, and use blank nodes.    My sense so far is that random (UUID) URIs are amazingly awkward, blank nodes are pretty good, and relative URIs might work okay, but have more unknowns than blank nodes.    (What's the base URI of an email message?   I suppose it could be understood to be the mid:...  -- that's got nice uniqueness properties, at least -- but since it's not hierarchical, I don't think one can use them to make graph names.)
> 
> Your email seems to count pretty equally against both relative URIs and blank nodes (as graph labels), so I'll leave the differences between them to another thread.
> 
>> TL;DR: don't mess.
>> 
>> We know that bNode graph identifiers are possible (I've designed a system myself
>> that had them) and that there are usecases that are addressed by it, but I've
>> not heard anything yet that can't already be addressed using RDF/SPARQL as it
>> stands. It is the opinion of some people that bNodes as graph identifiers
>> address it better, in some way, but that's another matter.
> 
> How can software running in a browser or an email client create a proper dataset?
> 
> It wants to say:
> 
> { :s1 :p1 :g }
> :g { :s2 :p2 :o2 }
> 
> where the s, p, and o terms are well-known URIs.    What can it use for :g if it's NOT communicating in some application-specific way with a Web Server?

One of is under a serious misapprehension about the capabilities of a 2013-era web client.

Why should it be harder to mint a unique identifier when creating a document, then it is when parsing it? Client consume as well as create data.

>> There are however some costs to extending RDF (datasets) to require that bNodes
>> be usable as graph identifiers:
>> 
>> * We (Experian) have invested millions of dollars in our RDF engine - it's very
>> tightly optimised to the current specs, and opening up the space of graph
>> identifiers from a single class (URIs) to two classes (URIs and bNodes) would
>> have a significant engineering, and storage cost. Put simply, we wouldn't do it,
>> and would just step away from later RDF specs, becoming an RDF/SPARQL flavoured
>> graph database.
> 
> I don't think I've heard where your code consumes datasets from external sources.  If that never happens, then none of this matters to your code.

We get lots of data, from lots of (external) places, the vast majority isn't RDF, some small proportion is.

> If you do consume datasets from others, then (as Charles observed), it seems like you can just convert the graph names into some suitable internal identifiers during parsing.       If we follow the path you outline in your conclusion, you'll have to do this when you consume JSON-LD.   Why not also do it when you consume TriG and N-Quads?

I don't foresee a future where we have to consume JSON-LD. If we do we'll cross that bridge when we come to it.

>> * RDF is already too complex for people coming into it to learn easily. Every
>> time we add a new feature to the language we increase the barrier to entry.
> 
> I think it's arguable whether it's more confusing to allow or disallow blank nodes to be used as graph labels.
> 
> I think late-binding of relative URIs probably is more confusing. We can't really forbid it, but saying it's the main/only way to create datasets in a client means a larger group of people have to deeply understand it.

That's just a matter of opinion, in both directions.

More rope is more rope however.

>> * There's no practical way to refer to long lived bNodes in SPARQL (without
>> enforced skolemisation), people will import datasets with bNode graphs, and then
>> realise they can't isolate their data (presumably after posting on stack
>> overflow or similar :) ).
>> The following will not retrieve your original data* and this will just promote
>> more confusion:
>> 
>> 	Data:
>> 	_:abc { :s :p :o }
>> 
>> 	Query:
>> 	SELECT * WHERE { GRAPH _:abc { :s ?p ?o } }
>> 
>> 	you could possibly do something like:
>> 
>> 	SELECT * WHERE { GRAPH ?g { :s ?p ?o } } FILTER(STR(?g) =abc")
>> 
>> 	That's pretty inconvenient, in many ways, and isn't required to work by SPARQL 1.1.
>> 	It is only possible at all in systems that preserve bNode labels, which is not
>> required.
> 
> Absolutely.   Blank nodes are a huge pain.   But sometimes they're better than the alternatives, like UUIDs and never-answered HTTP URLs.

Right.

It would be nice is we had a sensible alternative to blank nodes, but we don't, yet.

>> * Confusion with bNodes-in-graphs, and bNodes-as-graph-identifiers - the
>> discussion seems to assume that they're separate kinds of thing, maybe with
>> identifier bNodes not being existential variables? Which ever way it goes the
>> relationship between bNodes-in-graphs, and bNodes-identifying-graphs is going to
>> be complex.
> 
> I don't see any confusion.    The semantics of how blank nodes work as graph labels seems to me exactly the same as how IRIs do.
> 
> (That IS kind of confusing, but I believe it works.   The "graph name" denotes something in the domain of discourse which is paired with a particular RDF Graph by the dataset.   The pairing might be the identity relation, but we don't say what it is in general.  I'm convinced this is workable, because the vocabularies in use can convey what that relation is.)

… I understand there was some long discussion on this point, which I think just illustrates the issues …

>> * Of all the extensions that are implemented by a small number of systems as an
>> extension, this seems like an odd one to pick. IMHO there are far more serious
>> problems with RDF. There is a cost (to this group, and the wider community) of
>> any changes, so lets pick our battles wisely.
> 
> (See top of this email.)
> 
>> * There's very little implementation experience - compared to the other things
>> we're standardising: URI quads, bNode skolemisation, Turtle, NQuads. It's not
>> clear how far the existential variable-ness should extend - do we sanction graph
>> leaning? Do URI-identified graphs infer identical graphs identified by bNodes?
>> If not, why not? What do bNodes with a given label, in graphs identified by a
>> bNode with a different label refer to, etc.
>> 
>> 	_:abc {
>> } _:def (
>> 
>> } One graph, or two, or undefined? I don't think we know the right answer yet.
> 
> I think the answers all fall out of our definition of datasets.

If we had one, yes.

> Clearly in your example, there are two different blank nodes, which are each being used in this dataset as "names" for the empty graph.    What those blank nodes might denote is not specified.
> 
> It's exactly the same as with IRIs:
> 
> :abc { }
> :def { }
> 
> we have two different IRIs which are each being used in this dataset as "names" for the empty graph.  What those IRIs might denote is not specified.
> 
> 
>> So, in summary, I think the cost is high, and the benefit is vanishingly small.
> 
> Obviously I disagree.

Sure :)

>> Nothing stops people that feel they really need it adding them to RDF systems,
>> as they have in the past. One counter argument is that JSON-LD will do it
>> anyway, but that's fine - if it is widely used, it can be adopted into RDF 1.2,
>> with plenty of implementation experience. In the meantime JSON-LD serialisers
>> can skolemise when transforming JSON-LD into RDF - there's other places where
>> the transform is lossy anyway, as far as I understand it. - Steve
> 
> I think there's a huge cost to having "RDF" languages which extend the model instead of being merely alternative serializations.

I've not been following too closely, but I thought that JSON-LD already covers things that are explicitly out-of-scope for the RDF-WG? I don't really see it as being an issue - if people use those features they can (and will) be included in RDF 1.2.

As may be obvious I'm very much in favour of testing out changes to the language in the real world before committing them to specs. I (obviously) understand that not everyone agrees, but that's an opinion I've formed from having worked with various SW-related specs over the years, and being involved in some of the groups.

The cost of non-standard extensions to the specs is pretty small - look at SPARQL 1.0, practically everyone extended that, but the extensions were brought together quite successfully in SPARQL 1.1.

The counter cost, of including badly thought out features (arguably like bNodes-as-variables themselves) is pretty high.

There are often features that are no-brainers, and everyone agrees should have been in earlier specs, but clearly this isn't one of those.

- Steve

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL
Received on Monday, 3 June 2013 09:40:48 UTC