W3C home > Mailing lists > Public > public-linked-json@w3.org > February 2013

Re: Blank Node Identifiers and RDF Dataset Normalization

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Mon, 25 Feb 2013 11:40:13 -0500
Message-ID: <512B93ED.2020201@digitalbazaar.com>
To: Steve Harris <steve.harris@garlik.com>
CC: public-linked-json@w3.org
On 02/25/2013 11:21 AM, Steve Harris wrote:
> On 2013-02-25, at 15:56, Dave Longley <dlongley@digitalbazaar.com> wrote:
>
>> On 02/25/2013 10:34 AM, Steve Harris wrote:
>>> On 2013-02-25, at 15:20, Dave Longley <dlongley@digitalbazaar.com> wrote:
>>>
>>>> On 02/25/2013 07:27 AM, Steve Harris wrote:
>>>>> [ TL;DR - stop messing ]
>>>>>
>>>>> Some systems (specifically 4store and 5store that I'm aware of, but I expect others) use the fact that graph labels have to be URIs as a source of optimisation.
>>>>>
>>>>> For example:
>>>>>
>>>>> SELECT * WHERE {
>>>>>     ?g dc:date ?d .
>>>>>     GRAPH ?g { ?x a foaf:Person }
>>>>> }
>>>>>
>>>>> You can restrict your search to values for ?g to URIs under current RDF semantics. Often you would want to bind dc:date first - e.g. if dc:date predicates with URI subjects in the "default graph" were rarer than graphs containing foaf:Person-s.
>>>>>
>>>>> Specifically 5store has no index space for quads where the graph label isn't a URI - this again is an optimisation (but 4store doesn't do that). Changing that would involve a significant amount of effort, and is not something we would commit to for a feature that would be of no benefit. SPARQL explicitly states that graph labels must be URIs, so this is legit.
>>>>>
>>>>> Also, it's highly subjective whether having bNodes as graph identifiers is a "good thing", I have evidence from 3store that it's not, users found it generally weird (this was in the days before graph-spanning bNodes were common, perhaps that's a factor?), and didn't like the fact there there were graphs without stable identifiers. You *can* preserve bNode labels between (de)serialisations, but not many systems do, and you're not required to.
>>>>>
>>>>> However, neither of those really the issue - I think people in the community should recognise that RDF is now a deployed system with many implementations. I believe it does serious harm to RDFs image as a "real" technology if we go about making deep changes like this for no particularly good reason.
>>>>>
>>>>> We should have moved way beyond the time where RDF is an "emerging tech" only suitable for early-stage startups and academics. Lets start to act like we believe that.
>>>>>
>>>>> </rant>
>>>> Would these systems need to change if a new "special" kind of URI is used that takes on, in effect, the same attributes and meaning as a blank node identifier? From what I can tell, there are two workable proposals for generating identifiers for graphs on the table here:
>>> No, that would be just a URI as far as I can tell. I was specifically responding to the idea of changing the definition of SPARQL Datasets.
>> So your current system, if queried for <tag:w3.org,2013:dsid:1>, would behave in exactly the same way now as it would if it were modified to work with blank node identifiers in the graph position... and then instead queried for _:b1? It seems to
> I don't think I understand the question.
>
> SPARQL has no syntax for querying for _:b1 - the bNode label syntax is SPARQL is used for non-projecting variables.
>
> Blank node identifiers are a syntax issue only.
>
>> me that blank node identifiers or "blank node-like identifiers" cannot be simply treated as opaque values and queried as if stable ... graphs would be improperly merged and you would have incorrect matches in your indexes (a deoptimization at best). This is not the case? If it is, then those special URIs must be treated in a similar fashion ... not just like any old URI. The special document-local attributes and behavior of blank node identifiers must bleed into the URI space. Is this all pushed somehow to the query language, and even so, aren't your indexes still polluted?
> I'm expecting that those possible magic URIs would only be generated by, and special for JSON-LD parsers, and we don't intend to use JSON-LD.
>
> If they were somehow encountered in Turtle (for e.g.) then as far as I can see, they can be treated just as URIs. c.f. Skolem URIs.

This is what I'm getting at. They are not just "URIs"; they are not 
simply Skolem URIs. They look like Skolem URIs but have the semantics of 
blank node identifiers... they are rewritable and document-local. They 
do not provide a unique name in the global context.

I would expect this to affect existing systems and that simply reusing 
blank node identifiers would be preferable. I understand that your 
strongest preference is for no changes at all, but I think it's 
important to understand what's really being proposed as an alternative 
to the blank node identifier solution. No solution at all doesn't cover 
the JSON-LD use case, blank node identifiers do... and so does 
something, that, IMO, is worse: a special URI that must be understood to 
be rewritable, document-local, and specifically *not* semantically the 
same as a run-of-the-mill Skolem URI.


>
> Random strange features in JSON-LD is not an issue for us, if it starts messing up the semantics of RDF, then we have a problem with it.
>
> - Steve
>
>>>> 1. Use a blank node identifier.
>>>>
>>>> 2. Use a special IRI that is prefixed with "tag:w3.org,2013:dsid:". This identifier will look like an IRI but otherwise function exactly like a blank node identifier does (a document-local ID that is parsed/understood to have special meaning apart from other IRIs... it is not a "global" ID).
>>>>
>>>> It seems to me that systems would have to change to accommodate either of these proposals. In the first case, those systems that were written to reject illegal values would reject any data containing blank node identifiers as graph labels until they were updated. This is an annoyance, but seems preferable, IMO, to the second case. In the second case, existing systems that did not make appropriate changes could be susceptible to what I would consider data corruption. It sounds like, in your case, you have a particular index that might not be utilized properly -- because it would be incorrectly associating the special URI with multiple documents (merging graphs)... when it is a document-local identifier. Perhaps I am misunderstanding the current state of these systems, but I currently fail to see how the second option is preferable over the first from an existing systems standpoint.
>>>>
>>>> -Dave
>>>>
>>>>
>>>>> - Steve
>>>>>
>>>>> On 2013-02-25, at 11:21, William Waites <ww@styx.org> wrote:
>>>>>
>>>>>> Some RDF databases use the fact that the number of different
>>>>>> predicates will be small compared to the number of different nodes in
>>>>>> the subject or object position as a source of optimisation. Allowing
>>>>>> blank nodes as predicates, though it would be convenient and in some
>>>>>> respects more elegant would tend to break this assumption to the
>>>>>> detriment of the databases that are affected. This is a very real
>>>>>> concern.
>>>>>>
>>>>>> Allowing blank nodes in the graph position would not, as far as I am
>>>>>> aware, have a similar impact on existing implementations. My
>>>>>> impression from the previous discussion is that it's an easy patch to
>>>>>> the standards documents as well.
>>>> -- 
>>>> Dave Longley
>>>> CTO
>>>> Digital Bazaar, Inc.
>>>> http://digitalbazaar.com
>>>>
>>
>> -- 
>> Dave Longley
>> CTO
>> Digital Bazaar, Inc.
>> http://digitalbazaar.com
>>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Monday, 25 February 2013 16:39:30 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 16:25:39 GMT