Re: Blank Node Identifiers and RDF Dataset Normalization from Steve Harris on 2013-02-25 (public-linked-json@w3.org from February 2013)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 25 Feb 2013 17:09:56 +0000
To: Dave Longley <dlongley@digitalbazaar.com>
Cc: public-linked-json@w3.org
Message-Id: <67E06B09-4AA7-4679-B716-6AFAA490307C@garlik.com>
On 2013-02-25, at 16:40, Dave Longley <dlongley@digitalbazaar.com> wrote:
> On 02/25/2013 11:21 AM, Steve Harris wrote:
>> On 2013-02-25, at 15:56, Dave Longley <dlongley@digitalbazaar.com> wrote:
>> 
>>> On 02/25/2013 10:34 AM, Steve Harris wrote:
>>>> On 2013-02-25, at 15:20, Dave Longley <dlongley@digitalbazaar.com> wrote:
>>>> 
>>>>> On 02/25/2013 07:27 AM, Steve Harris wrote:
>>>>>> [ TL;DR - stop messing ]
>>>>>> 
>>>>>> Some systems (specifically 4store and 5store that I'm aware of, but I expect others) use the fact that graph labels have to be URIs as a source of optimisation.
>>>>>> 
>>>>>> For example:
>>>>>> 
>>>>>> SELECT * WHERE {
>>>>>>    ?g dc:date ?d .
>>>>>>    GRAPH ?g { ?x a foaf:Person }
>>>>>> }
>>>>>> 
>>>>>> You can restrict your search to values for ?g to URIs under current RDF semantics. Often you would want to bind dc:date first - e.g. if dc:date predicates with URI subjects in the "default graph" were rarer than graphs containing foaf:Person-s.
>>>>>> 
>>>>>> Specifically 5store has no index space for quads where the graph label isn't a URI - this again is an optimisation (but 4store doesn't do that). Changing that would involve a significant amount of effort, and is not something we would commit to for a feature that would be of no benefit. SPARQL explicitly states that graph labels must be URIs, so this is legit.
>>>>>> 
>>>>>> Also, it's highly subjective whether having bNodes as graph identifiers is a "good thing", I have evidence from 3store that it's not, users found it generally weird (this was in the days before graph-spanning bNodes were common, perhaps that's a factor?), and didn't like the fact there there were graphs without stable identifiers. You *can* preserve bNode labels between (de)serialisations, but not many systems do, and you're not required to.
>>>>>> 
>>>>>> However, neither of those really the issue - I think people in the community should recognise that RDF is now a deployed system with many implementations. I believe it does serious harm to RDFs image as a "real" technology if we go about making deep changes like this for no particularly good reason.
>>>>>> 
>>>>>> We should have moved way beyond the time where RDF is an "emerging tech" only suitable for early-stage startups and academics. Lets start to act like we believe that.
>>>>>> 
>>>>>> </rant>
>>>>> Would these systems need to change if a new "special" kind of URI is used that takes on, in effect, the same attributes and meaning as a blank node identifier? From what I can tell, there are two workable proposals for generating identifiers for graphs on the table here:
>>>> No, that would be just a URI as far as I can tell. I was specifically responding to the idea of changing the definition of SPARQL Datasets.
>>> So your current system, if queried for <tag:w3.org,2013:dsid:1>, would behave in exactly the same way now as it would if it were modified to work with blank node identifiers in the graph position... and then instead queried for _:b1? It seems to
>> I don't think I understand the question.
>> 
>> SPARQL has no syntax for querying for _:b1 - the bNode label syntax is SPARQL is used for non-projecting variables.
>> 
>> Blank node identifiers are a syntax issue only.
>> 
>>> me that blank node identifiers or "blank node-like identifiers" cannot be simply treated as opaque values and queried as if stable ... graphs would be improperly merged and you would have incorrect matches in your indexes (a deoptimization at best). This is not the case? If it is, then those special URIs must be treated in a similar fashion ... not just like any old URI. The special document-local attributes and behavior of blank node identifiers must bleed into the URI space. Is this all pushed somehow to the query language, and even so, aren't your indexes still polluted?
>> I'm expecting that those possible magic URIs would only be generated by, and special for JSON-LD parsers, and we don't intend to use JSON-LD.
>> 
>> If they were somehow encountered in Turtle (for e.g.) then as far as I can see, they can be treated just as URIs. c.f. Skolem URIs.
> 
> This is what I'm getting at. They are not just "URIs"; they are not simply Skolem URIs. They look like Skolem URIs but have the semantics of blank node identifiers... they are rewritable and document-local. They do not provide a unique name in the global context.

I don't think that RFC4151 has provision for that - your system (or all JSON-LD systems) are welcome to do with them as it pleases, but it wouldn't be universal. http://www.ietf.org/rfc/rfc4151.txt - specifically §2.4.

> I would expect this to affect existing systems and that simply reusing blank node identifiers would be preferable. I understand that your strongest preference is for no changes at all, but I think it's important to understand what's really being proposed as an alternative to the blank node identifier solution. No solution at all doesn't cover the JSON-LD use case, blank node identifiers do... and so does something, that, IMO, is worse: a special URI that must be understood to be rewritable, document-local, and specifically *not* semantically the same as a run-of-the-mill Skolem URI.

I don't see how it would be different from a Skolem URI.

The rewriting behaviour (or whatever mechanism you choose) has to be optional, otherwise it will break URI behaviour.

I also don't see how this is an issue for RDF in a global sense, my impression is that it's a side effect of some very specific modelling choices that someone in your organisation has made.

There is categorically no valid argument that something along these lines is essential for such-and-such usecase, frankly that's nonsense as those usecases are already addressed by production systems in much more demanding environments, without those features.

There are plenty of things that *could* be done to make some specific usecases slightly simpler for implementors or deployers, or whoever. That does not mean that they should be written into the specifications.

I understand your desire to have the spec sanction every single quirk of your system (really), but it's just not practical when it makes so much work for everyone else, and complicates the specs to this extent.

- Steve

>> Random strange features in JSON-LD is not an issue for us, if it starts messing up the semantics of RDF, then we have a problem with it.
>> 
>> - Steve
>> 
>>>>> 1. Use a blank node identifier.
>>>>> 
>>>>> 2. Use a special IRI that is prefixed with "tag:w3.org,2013:dsid:". This identifier will look like an IRI but otherwise function exactly like a blank node identifier does (a document-local ID that is parsed/understood to have special meaning apart from other IRIs... it is not a "global" ID).
>>>>> 
>>>>> It seems to me that systems would have to change to accommodate either of these proposals. In the first case, those systems that were written to reject illegal values would reject any data containing blank node identifiers as graph labels until they were updated. This is an annoyance, but seems preferable, IMO, to the second case. In the second case, existing systems that did not make appropriate changes could be susceptible to what I would consider data corruption. It sounds like, in your case, you have a particular index that might not be utilized properly -- because it would be incorrectly associating the special URI with multiple documents (merging graphs)... when it is a document-local identifier. Perhaps I am misunderstanding the current state of these systems, but I currently fail to see how the second option is preferable over the first from an existing systems standpoint.
>>>>> 
>>>>> -Dave
>>>>> 
>>>>> 
>>>>>> - Steve
>>>>>> 
>>>>>> On 2013-02-25, at 11:21, William Waites <ww@styx.org> wrote:
>>>>>> 
>>>>>>> Some RDF databases use the fact that the number of different
>>>>>>> predicates will be small compared to the number of different nodes in
>>>>>>> the subject or object position as a source of optimisation. Allowing
>>>>>>> blank nodes as predicates, though it would be convenient and in some
>>>>>>> respects more elegant would tend to break this assumption to the
>>>>>>> detriment of the databases that are affected. This is a very real
>>>>>>> concern.
>>>>>>> 
>>>>>>> Allowing blank nodes in the graph position would not, as far as I am
>>>>>>> aware, have a similar impact on existing implementations. My
>>>>>>> impression from the previous discussion is that it's an easy patch to
>>>>>>> the standards documents as well.
>>>>> -- 
>>>>> Dave Longley
>>>>> CTO
>>>>> Digital Bazaar, Inc.
>>>>> http://digitalbazaar.com
>>>>> 
>>> 
>>> -- 
>>> Dave Longley
>>> CTO
>>> Digital Bazaar, Inc.
>>> http://digitalbazaar.com
>>> 
> 
> 
> -- 
> Dave Longley
> CTO
> Digital Bazaar, Inc.
> http://digitalbazaar.com
> 

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL
Received on Monday, 25 February 2013 17:10:25 UTC