Re: Blank Node Identifiers and RDF Dataset Normalization from Dave Longley on 2013-02-25 (public-linked-json@w3.org from February 2013)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Mon, 25 Feb 2013 10:56:04 -0500
To: Steve Harris <steve.harris@garlik.com>
CC: public-linked-json@w3.org
Message-ID: <512B8994.9080609@digitalbazaar.com>
On 02/25/2013 10:34 AM, Steve Harris wrote:
> On 2013-02-25, at 15:20, Dave Longley <dlongley@digitalbazaar.com> wrote:
>
>> On 02/25/2013 07:27 AM, Steve Harris wrote:
>>> [ TL;DR - stop messing ]
>>>
>>> Some systems (specifically 4store and 5store that I'm aware of, but I expect others) use the fact that graph labels have to be URIs as a source of optimisation.
>>>
>>> For example:
>>>
>>> SELECT * WHERE {
>>>     ?g dc:date ?d .
>>>     GRAPH ?g { ?x a foaf:Person }
>>> }
>>>
>>> You can restrict your search to values for ?g to URIs under current RDF semantics. Often you would want to bind dc:date first - e.g. if dc:date predicates with URI subjects in the "default graph" were rarer than graphs containing foaf:Person-s.
>>>
>>> Specifically 5store has no index space for quads where the graph label isn't a URI - this again is an optimisation (but 4store doesn't do that). Changing that would involve a significant amount of effort, and is not something we would commit to for a feature that would be of no benefit. SPARQL explicitly states that graph labels must be URIs, so this is legit.
>>>
>>> Also, it's highly subjective whether having bNodes as graph identifiers is a "good thing", I have evidence from 3store that it's not, users found it generally weird (this was in the days before graph-spanning bNodes were common, perhaps that's a factor?), and didn't like the fact there there were graphs without stable identifiers. You *can* preserve bNode labels between (de)serialisations, but not many systems do, and you're not required to.
>>>
>>> However, neither of those really the issue - I think people in the community should recognise that RDF is now a deployed system with many implementations. I believe it does serious harm to RDFs image as a "real" technology if we go about making deep changes like this for no particularly good reason.
>>>
>>> We should have moved way beyond the time where RDF is an "emerging tech" only suitable for early-stage startups and academics. Lets start to act like we believe that.
>>>
>>> </rant>
>> Would these systems need to change if a new "special" kind of URI is used that takes on, in effect, the same attributes and meaning as a blank node identifier? From what I can tell, there are two workable proposals for generating identifiers for graphs on the table here:
> No, that would be just a URI as far as I can tell. I was specifically responding to the idea of changing the definition of SPARQL Datasets.

So your current system, if queried for <tag:w3.org,2013:dsid:1>, would 
behave in exactly the same way now as it would if it were modified to 
work with blank node identifiers in the graph position... and then 
instead queried for _:b1? It seems to me that blank node identifiers or 
"blank node-like identifiers" cannot be simply treated as opaque values 
and queried as if stable ... graphs would be improperly merged and you 
would have incorrect matches in your indexes (a deoptimization at best). 
This is not the case? If it is, then those special URIs must be treated 
in a similar fashion ... not just like any old URI. The special 
document-local attributes and behavior of blank node identifiers must 
bleed into the URI space. Is this all pushed somehow to the query 
language, and even so, aren't your indexes still polluted?


>
> - Steve
>
>> 1. Use a blank node identifier.
>>
>> 2. Use a special IRI that is prefixed with "tag:w3.org,2013:dsid:". This identifier will look like an IRI but otherwise function exactly like a blank node identifier does (a document-local ID that is parsed/understood to have special meaning apart from other IRIs... it is not a "global" ID).
>>
>> It seems to me that systems would have to change to accommodate either of these proposals. In the first case, those systems that were written to reject illegal values would reject any data containing blank node identifiers as graph labels until they were updated. This is an annoyance, but seems preferable, IMO, to the second case. In the second case, existing systems that did not make appropriate changes could be susceptible to what I would consider data corruption. It sounds like, in your case, you have a particular index that might not be utilized properly -- because it would be incorrectly associating the special URI with multiple documents (merging graphs)... when it is a document-local identifier. Perhaps I am misunderstanding the current state of these systems, but I currently fail to see how the second option is preferable over the first from an existing systems standpoint.
>>
>> -Dave
>>
>>
>>> - Steve
>>>
>>> On 2013-02-25, at 11:21, William Waites <ww@styx.org> wrote:
>>>
>>>> Some RDF databases use the fact that the number of different
>>>> predicates will be small compared to the number of different nodes in
>>>> the subject or object position as a source of optimisation. Allowing
>>>> blank nodes as predicates, though it would be convenient and in some
>>>> respects more elegant would tend to break this assumption to the
>>>> detriment of the databases that are affected. This is a very real
>>>> concern.
>>>>
>>>> Allowing blank nodes in the graph position would not, as far as I am
>>>> aware, have a similar impact on existing implementations. My
>>>> impression from the previous discussion is that it's an easy patch to
>>>> the standards documents as well.
>>
>> -- 
>> Dave Longley
>> CTO
>> Digital Bazaar, Inc.
>> http://digitalbazaar.com
>>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Monday, 25 February 2013 15:55:19 UTC