Re: Blank Node Identifiers and RDF Dataset Normalization from Dave Longley on 2013-02-25 (public-linked-json@w3.org from February 2013)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Mon, 25 Feb 2013 10:20:02 -0500
To: Steve Harris <steve.harris@garlik.com>
CC: public-linked-json@w3.org
Message-ID: <512B8122.1020303@digitalbazaar.com>
On 02/25/2013 07:27 AM, Steve Harris wrote:
> [ TL;DR - stop messing ]
>
> Some systems (specifically 4store and 5store that I'm aware of, but I expect others) use the fact that graph labels have to be URIs as a source of optimisation.
>
> For example:
>
> SELECT * WHERE {
>     ?g dc:date ?d .
>     GRAPH ?g { ?x a foaf:Person }
> }
>
> You can restrict your search to values for ?g to URIs under current RDF semantics. Often you would want to bind dc:date first - e.g. if dc:date predicates with URI subjects in the "default graph" were rarer than graphs containing foaf:Person-s.
>
> Specifically 5store has no index space for quads where the graph label isn't a URI - this again is an optimisation (but 4store doesn't do that). Changing that would involve a significant amount of effort, and is not something we would commit to for a feature that would be of no benefit. SPARQL explicitly states that graph labels must be URIs, so this is legit.
>
> Also, it's highly subjective whether having bNodes as graph identifiers is a "good thing", I have evidence from 3store that it's not, users found it generally weird (this was in the days before graph-spanning bNodes were common, perhaps that's a factor?), and didn't like the fact there there were graphs without stable identifiers. You *can* preserve bNode labels between (de)serialisations, but not many systems do, and you're not required to.
>
> However, neither of those really the issue - I think people in the community should recognise that RDF is now a deployed system with many implementations. I believe it does serious harm to RDFs image as a "real" technology if we go about making deep changes like this for no particularly good reason.
>
> We should have moved way beyond the time where RDF is an "emerging tech" only suitable for early-stage startups and academics. Lets start to act like we believe that.
>
> </rant>

Would these systems need to change if a new "special" kind of URI is 
used that takes on, in effect, the same attributes and meaning as a 
blank node identifier? From what I can tell, there are two workable 
proposals for generating identifiers for graphs on the table here:

1. Use a blank node identifier.

2. Use a special IRI that is prefixed with "tag:w3.org,2013:dsid:". This 
identifier will look like an IRI but otherwise function exactly like a 
blank node identifier does (a document-local ID that is 
parsed/understood to have special meaning apart from other IRIs... it is 
not a "global" ID).

It seems to me that systems would have to change to accommodate either 
of these proposals. In the first case, those systems that were written 
to reject illegal values would reject any data containing blank node 
identifiers as graph labels until they were updated. This is an 
annoyance, but seems preferable, IMO, to the second case. In the second 
case, existing systems that did not make appropriate changes could be 
susceptible to what I would consider data corruption. It sounds like, in 
your case, you have a particular index that might not be utilized 
properly -- because it would be incorrectly associating the special URI 
with multiple documents (merging graphs)... when it is a document-local 
identifier. Perhaps I am misunderstanding the current state of these 
systems, but I currently fail to see how the second option is preferable 
over the first from an existing systems standpoint.

-Dave


>
> - Steve
>
> On 2013-02-25, at 11:21, William Waites <ww@styx.org> wrote:
>
>> Some RDF databases use the fact that the number of different
>> predicates will be small compared to the number of different nodes in
>> the subject or object position as a source of optimisation. Allowing
>> blank nodes as predicates, though it would be convenient and in some
>> respects more elegant would tend to break this assumption to the
>> detriment of the databases that are affected. This is a very real
>> concern.
>>
>> Allowing blank nodes in the graph position would not, as far as I am
>> aware, have a similar impact on existing implementations. My
>> impression from the previous discussion is that it's an easy patch to
>> the standards documents as well.


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Monday, 25 February 2013 15:19:19 UTC