Re: Blank Nodes Re: Toward easier RDF: a proposal from Aidan Hogan on 2018-11-26 (semantic-web@w3.org from November 2018)

From: Aidan Hogan <aidhog@gmail.com>
Date: Mon, 26 Nov 2018 19:44:47 -0300
To: SW-forum <semantic-web@w3.org>
Message-ID: <72babe6c-1184-7b40-023d-18c42895855c@gmail.com>
Interesting discussion!

I personally think the best way forward is to put aside dogma and do 
useful things with the standards we've proposed ... to lead by example, 
as the folks at Wikidata have been doing (for instance), and as people 
have been doing in projects like Linking Open Data.

... easier said than done of course.


In any case, I guess I am best positioned to comment on the blank nodes 
issue. The tl;dr version of my stance regarding blank nodes is *no 
action required*. This stance is the polar opposite of what I had at the 
start of the journey a few years back.

Comments below ...

On 22-11-2018 0:42, David Booth wrote:
> On 11/21/18 6:59 PM, Thomas Passin wrote:
>> the problem of blank nodes is that there seems to be no general way to 
>> assign them ids in a reproducible way. 
> 
> Currently, for arbitrary RDF with unrestricted blank nodes, that seems 
> to be the case.  However, if we are willing to live with some very 
> modest restrictions on blank node usage, such that there are no blank 
> node cycles, then it does become possible to canonicalize RDF.  And 
> that, in turn, leads to the possibility of assigning IDs in a 
> reproducible way.  For example, a blank node that is used to connect the 
> attribute of an n-ary relation could be replaced by a Skolem IRI that is 
> constructed (recursively) from the entities that it relates.  Also, 
> conventions could require each n-ary relation to specify a (possibly 
> composite) key.

It is possible (and I could even argue very "efficient") to assign ids 
to blank nodes in a reproducible way without any further restrictions on 
RDF. After seeing a similar discussion to this on the list about four 
years ago [1], and having had a good bit of experience working with 
blank nodes in various settings, I started working on precisely this 
algorithm. Results over millions of RDF graphs crawled from the Web show 
that "canonical labelling" of blank nodes (assigning IDs in a 
reproducible way) takes milliseconds in almost all cases. The worst 
real-world case took around 30 seconds, but it was "big" rather than 
"hard" with 7.3 million triples/254 thousand blank nodes [2,3].

The papers also show that one can create more difficult synthetic cases, 
but these are what I would call "adversarial cases", meaning that they 
would only likely occur because an attacker has designed them. While it 
is important to be aware that such adversarial cases might exist, 
thereafter, I don't consider them to be any bigger of a problem in 
practice than the idea of an "adversarial SPARQL query" sent to an 
endpoint, for example. Such cases can be rejected, left to timeout, etc.

This work has not been the only work on this topic, as Ivan mentions; 
there's a number of practical solutions to this problem, some of which 
were worked on before or in parallel with what I was doing ...

On 22-11-2018 14:08, Ivan Herman wrote<snip>
>>> Some recent good progress on canonicalization: JSON-LD
>>> https://json-ld.github.io/normalization/spec/ .  However, the
>>> current JSON-LD canonicalization draft (called "normalization")
>>> is focused only on the digital signatures use case, and
>>> needs improvement to better address the diff use case, in
>>> which small, localized graph changes should result in small,
>>> localized differences in the canonicalized graph.
>>>
>> 
>> There has been some discussions around this lately. If you are 
>> interested, look at:
>> 
>> https://github.com/w3c/strategy/issues/116
>> 
>> In particular (specific comments as well as links from those comments):
>> 
>> https://github.com/w3c/strategy/issues/116#issuecomment-383875628
>> https://github.com/w3c/strategy/issues/116#issuecomment-384160630
>> https://github.com/w3c/strategy/issues/116#issuecomment-395791130
>> https://github.com/w3c/strategy/issues/116#issuecomment-435920927
>> 
>> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>> http://aidanhogan.com/docs/rdf-canonicalisation.pdf
>> http://json-ld.github.io/normalization/spec/index.html
>> https://github.com/iherman/canonical_rdf
>> https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html
>> 
>> It is still not clear how exactly we will move forward, but I have some 
>> hopes that this will happen sometimes in 2019. It depends on the 
>> availability of the people involved; the path to get this done is now 
>> relatively clear.
>> 
>> All that being said: David's point is well taken on blank nodes. If 
>> there was no blank nodes around, it would be obvious. Looking at the 
>> details of the two available solutions (see points above) it is also 
>> true that there may be a middle ground: if the usage of blank nodes was 
>> somehow restricted avoiding circular patterns. I *think* (but I am not 
>> 100% sure) that if all blank nodes could be expressed by [] in turtle 
>> without any need for explicit bnode identifiers then both algorithms 
>> referred to above would become way simper.

Yep, things would become a bit simpler without cycles in blank nodes. 
It's worth noting for context that maybe six or seven years back we 
informally/briefly made such an observation/proposal [4; just before 
Section 6.1], and David followed up on this with the Well Behaved RDF 
proposal [5]. At the time obviously I was on board (up until 2015 when I 
worked on [2] and changed my mind).

Now I am no longer convinced by the need for such restrictions, or at 
least it seems like a case of premature optimisation. The strategy I 
favour would be to not restrict RDF until we are clear that this would 
be necessary, or indeed, that various practical tools/use-cases decide 
to do this independently (and then maybe pave the cow-paths). We have 
tools that work fine in all but some adversarial cases without having to 
introduce restrictions, so I'm not sure what introducing 
(overly-simplified?) restrictions would buy.

Put another way, the algorithms for canonical labelling that I have seen 
or worked on myself would probably be about as efficient (as makes no 
difference) on cases where blank nodes have no cycles as dedicated 
algorithms that assume such restrictions.

Plus there are now implementations of such algorithms for the 
unrestricted case.

> On 26-11-2018 6:18, Ivan Herman wrote:> 
>> 
>>> On 26 Nov 2018, at 07:26, David Booth <david@dbooth.org 
>>> I like this line of thought.  I would much rather have auto-generated 
>>> URIs, that are predictable and distinguishable as auto-generated, than 
>>> blank nodes.  And even better, those auto-generated URIs could be 
>>> generated using a standard algorithm, so that all tools would generate 
>>> them the same way.
>>>
>>> As Aiden Hogen et all point out in "Everything You Always Wanted to 
>>> Know About Blank Nodes": "the vast majority of blank nodes form tree 
>>> structures", i.e., they do not contain blank node cycles.
>>> http://www.websemanticsjournal.org/index.php/ps/article/download/365/387
>>>
>>> If blank node cycles were prohibited in RDF, then predictable URIs 
>>> could be automatically generated for those blank nodes, bottom-up 
>>> recursively based on the tree structure.  And prohibiting blank node 
>>> cycles would not be a huge loss, because even the few cases that do 
>>> use blank node cycles could be brought into conformance by replacing a 
>>> few of the blank nodes with URIs, to break the cycles.
>> 
>> It also means that the problem of canonicalization/signature/etc would 
>> become way easier. The algorithms that I referred to in[1] are getting 
>> complicated due to those b-node cycles (I hope that Aiden, if he reads 
>> this, agrees with me). They would be way simpler if this restrictions 
>> was in place.

It depends on what we mean by "simpler" I guess.

Here's a three-line description of an algorithm that gives a canonical 
labelling of blank nodes for any (unrestricted) RDF graph G:

1) Count the unique blank nodes in G; call the result k
2) Define a fresh set of blank nodes B := { _:b1, ..., _:bk }
3) Find the (one-to-one) renaming of blank nodes in G to B that gives 
the lowest RDF graph in some order*. This is the canonical graph of G.

* Since we are fixing the blank node labels, we could define such an 
order as: serialise each graph as canonical N-Triples, apply UNIX sort, 
and then compare the graphs byte-by-byte (just to give a concrete idea).

For me at least, this algorithm is pretty simple. It is also sound, 
complete and guaranteed to terminate on RDF graphs without restriction. 
In fact, this is probably still the simplest algorithm to define even in 
cases where you have restricted RDF graphs.

And even though it is brute force and inefficient, with some tweaks in 
how you define the order (i.e., think about ordering in terms of the 
result of a hash, rather than byte-by-byte comparison), it can be made 
very efficient for most cases (and, arguably, all real-world cases).

So it is not necessarily simplicity that these restrictions would bring, 
nor efficiency as argued in [2,3]. (In terms of worst-case complexity, 
not having to deal with cycles does make life easier, because the worst 
cases have cycles; but many cases with cycles are also trivial.)


I think as a first step, we would need to be clearer on what practical 
problem we would be solving by adding such restrictions on RDF/blank nodes.

Cheers,
Aidan

[1] https://lists.w3.org/Archives/Public/semantic-web/2014Sep/0103.html
[2] http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
[3] http://aidanhogan.com/docs/rdf-canonicalisation.pdf
[4] http://aidanhogan.com/docs/bnodes.pdf
[5] http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
Received on Monday, 26 November 2018 22:45:13 UTC