Re: Blank Nodes Re: Toward easier RDF: a proposal from David Booth on 2018-11-27 (semantic-web@w3.org from November 2018)

From: David Booth <david@dbooth.org>
Date: Mon, 26 Nov 2018 21:49:28 -0500
To: semantic-web@w3.org
Cc: Aiden Hogan <aidhog@gmail.com>
Message-ID: <d65aee6d-82a9-efab-4870-3ad65908952b@dbooth.org>
Hi Aiden,

Thanks for your insights!  Comments below . . .

On 11/26/18 5:44 PM, Aidan Hogan wrote:
> Interesting discussion!
> 
> I personally think the best way forward is to put aside dogma and do 
> useful things with the standards we've proposed ... to lead by example, 
> as the folks at Wikidata have been doing (for instance), and as people 
> have been doing in projects like Linking Open Data.
> 
> ... easier said than done of course.

Exactly.  And the whole point of this discussion is to make the RDF 
ecosystem easier to use.

> In any case, I guess I am best positioned to comment on the blank nodes 
> issue. The tl;dr version of my stance regarding blank nodes is *no 
> action required*. This stance is the polar opposite of what I had at the 
> start of the journey a few years back.
> 
> Comments below ...
> 
> On 22-11-2018 0:42, David Booth wrote:
>> On 11/21/18 6:59 PM, Thomas Passin wrote:
>>> the problem of blank nodes is that there seems to be no general way 
>>> to assign them ids in a reproducible way. 
>>
>> Currently, for arbitrary RDF with unrestricted blank nodes, that seems 
>> to be the case.  However, if we are willing to live with some very 
>> modest restrictions on blank node usage, such that there are no blank 
>> node cycles, then it does become possible to canonicalize RDF.  And 
>> that, in turn, leads to the possibility of assigning IDs in a 
>> reproducible way.  For example, a blank node that is used to connect 
>> the attribute of an n-ary relation could be replaced by a Skolem IRI 
>> that is constructed (recursively) from the entities that it relates.  
>> Also, conventions could require each n-ary relation to specify a 
>> (possibly composite) key.
> 
> It is possible (and I could even argue very "efficient") to assign ids 
> to blank nodes in a reproducible way without any further restrictions on 
> RDF. After seeing a similar discussion to this on the list about four 
> years ago [1], and having had a good bit of experience working with 
> blank nodes in various settings, I started working on precisely this 
> algorithm. Results over millions of RDF graphs crawled from the Web show 
> that "canonical labelling" of blank nodes (assigning IDs in a 
> reproducible way) takes milliseconds in almost all cases. The worst 
> real-world case took around 30 seconds, but it was "big" rather than 
> "hard" with 7.3 million triples/254 thousand blank nodes [2,3].
> 
> The papers also show that one can create more difficult synthetic cases, 
> but these are what I would call "adversarial cases", meaning that they 
> would only likely occur because an attacker has designed them. While it 
> is important to be aware that such adversarial cases might exist, 
> thereafter, I don't consider them to be any bigger of a problem in 
> practice than the idea of an "adversarial SPARQL query" sent to an 
> endpoint, for example. Such cases can be rejected, left to timeout, etc.

That sounds like an excellent technical achievement, but I do not find 
it very reassuring, for a couple of reasons:

1. Developers generally have *much* greater control over their SPARQL 
queries than they do over their data, so I don't think it's a good 
comparison.  *Very* few applications would open up an unrestricted 
SPARQL endpoint to the world.  But *many* applications use externally 
created data.  Externally supplied data normally must be sanitized, for 
safety, but it isn't obvious how an application could sanitize incoming 
RDF such that some blank nodes cycles are permitted but "adversarial" 
cycles are avoided.

2. It is *never* very reassuring to be told that some software that you 
are considering "usually" works.

"Don't worry, this new automated aircraft usually doesn't crash.  It is 
perfectly safe with most common flight path data.  And even if it 
receives unusual 'adversarial' flight path data it could be programmed 
to safely ground itself instead of crashing."

Caveats like that create fear, uncertainty and doubt rather than 
inspiring confidence.

> 
> This work has not been the only work on this topic, as Ivan mentions; 
> there's a number of practical solutions to this problem, some of which 
> were worked on before or in parallel with what I was doing ...
> 
> On 22-11-2018 14:08, Ivan Herman wrote<snip>
>>>> Some recent good progress on canonicalization: JSON-LD
>>>> https://json-ld.github.io/normalization/spec/ .  However, the
>>>> current JSON-LD canonicalization draft (called "normalization")
>>>> is focused only on the digital signatures use case, and
>>>> needs improvement to better address the diff use case, in
>>>> which small, localized graph changes should result in small,
>>>> localized differences in the canonicalized graph.
>>>>
>>>
>>> There has been some discussions around this lately. If you are 
>>> interested, look at:
>>>
>>> https://github.com/w3c/strategy/issues/116
>>>
>>> In particular (specific comments as well as links from those comments):
>>>
>>> https://github.com/w3c/strategy/issues/116#issuecomment-383875628
>>> https://github.com/w3c/strategy/issues/116#issuecomment-384160630
>>> https://github.com/w3c/strategy/issues/116#issuecomment-395791130
>>> https://github.com/w3c/strategy/issues/116#issuecomment-435920927
>>>
>>> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>> http://aidanhogan.com/docs/rdf-canonicalisation.pdf
>>> http://json-ld.github.io/normalization/spec/index.html
>>> https://github.com/iherman/canonical_rdf
>>> https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html
>>>
>>> It is still not clear how exactly we will move forward, but I have 
>>> some hopes that this will happen sometimes in 2019. It depends on the 
>>> availability of the people involved; the path to get this done is now 
>>> relatively clear.
>>>
>>> All that being said: David's point is well taken on blank nodes. If 
>>> there was no blank nodes around, it would be obvious. Looking at the 
>>> details of the two available solutions (see points above) it is also 
>>> true that there may be a middle ground: if the usage of blank nodes 
>>> was somehow restricted avoiding circular patterns. I *think* (but I 
>>> am not 100% sure) that if all blank nodes could be expressed by [] in 
>>> turtle without any need for explicit bnode identifiers then both 
>>> algorithms referred to above would become way simper.
> 
> Yep, things would become a bit simpler without cycles in blank nodes. 
> It's worth noting for context that maybe six or seven years back we 
> informally/briefly made such an observation/proposal [4; just before 
> Section 6.1], and David followed up on this with the Well Behaved RDF 
> proposal [5]. At the time obviously I was on board (up until 2015 when I 
> worked on [2] and changed my mind).
> 
> Now I am no longer convinced by the need for such restrictions, or at 
> least it seems like a case of premature optimisation. The strategy I 
> favour would be to not restrict RDF until we are clear that this would 
> be necessary, or indeed, that various practical tools/use-cases decide 
> to do this independently (and then maybe pave the cow-paths). We have 
> tools that work fine in all but some adversarial cases without having to 
> introduce restrictions, so I'm not sure what introducing 
> (overly-simplified?) restrictions would buy.

I think a significant source of complexity and confusion to *average* 
developers could be eliminated if:

(a) explicit blank nodes were eliminated from visible RDF, i.e., bnodes 
that cannot be expressed using Turtle/N3 square brackets [] would be 
prohibited; and

(b) predictable URIs were auto-generated for implicit blank nodes.

I do not doubt the effectiveness of the algorithms that you and others 
have devised, but I may be focused on different evaluation criteria than 
you.

When I proposed the no-explicit-blank-nodes restriction in my earlier 
Well Behaved RDF proposal[5], I was well aware that it was slightly more 
restrictive than it theoretically needed to be if the only goal were to 
avoid "adversarial" cases.  But the reason I proposed it that way is 
that it makes the rule *very* simple for users to understand -- so 
simple that they would not even have to be aware of the rule, because 
Turtle syntax could automatically prevent them from writing any RDF that 
could violate the prohibition.  *That* is the kind of simplicity that we 
should strive to achieve.

TimBL earlier pointed out that curly braces in JSON are directly 
analogous to implicit blank nodes in RDF.  But we do not have to teach 
developers to be careful not to write JSON that contains a blank node 
cycle, because the syntax of JSON already makes it impossible to create 
cycles.

In constrast, the mere existence of explicit blank nodes opens a can of 
worms in RDF: identifiers that are not really identifiers; duplicate 
triples; follow-up SPARQL queries that don't work; canonicalization 
algorithms that usually work fine, but *might* hang or barf on unusual 
data; etc.

All of this complexity can just disappear if explicit blank nodes are 
eliminated, with *minimal* loss of expressiveness.  I think the gains 
would far outweigh that loss.

> 
> Put another way, the algorithms for canonical labelling that I have seen 
> or worked on myself would probably be about as efficient (as makes no 
> difference) on cases where blank nodes have no cycles as dedicated 
> algorithms that assume such restrictions.
> 
> Plus there are now implementations of such algorithms for the 
> unrestricted case.
> 
>> On 26-11-2018 6:18, Ivan Herman wrote:>
>>>
>>>> On 26 Nov 2018, at 07:26, David Booth <david@dbooth.org I like this 
>>>> line of thought.  I would much rather have auto-generated URIs, that 
>>>> are predictable and distinguishable as auto-generated, than blank 
>>>> nodes.  And even better, those auto-generated URIs could be 
>>>> generated using a standard algorithm, so that all tools would 
>>>> generate them the same way.
>>>>
>>>> As Aiden Hogen et all point out in "Everything You Always Wanted to 
>>>> Know About Blank Nodes": "the vast majority of blank nodes form tree 
>>>> structures", i.e., they do not contain blank node cycles.
>>>> http://www.websemanticsjournal.org/index.php/ps/article/download/365/387 
>>>>
>>>>
>>>> If blank node cycles were prohibited in RDF, then predictable URIs 
>>>> could be automatically generated for those blank nodes, bottom-up 
>>>> recursively based on the tree structure.  And prohibiting blank node 
>>>> cycles would not be a huge loss, because even the few cases that do 
>>>> use blank node cycles could be brought into conformance by replacing 
>>>> a few of the blank nodes with URIs, to break the cycles.
>>>
>>> It also means that the problem of canonicalization/signature/etc 
>>> would become way easier. The algorithms that I referred to in[1] are 
>>> getting complicated due to those b-node cycles (I hope that Aiden, if 
>>> he reads this, agrees with me). They would be way simpler if this 
>>> restrictions was in place.
> 
> It depends on what we mean by "simpler" I guess.
> 
> Here's a three-line description of an algorithm that gives a canonical 
> labelling of blank nodes for any (unrestricted) RDF graph G:
> 
> 1) Count the unique blank nodes in G; call the result k
> 2) Define a fresh set of blank nodes B := { _:b1, ..., _:bk }
> 3) Find the (one-to-one) renaming of blank nodes in G to B that gives 
> the lowest RDF graph in some order*. This is the canonical graph of G.
> 
> * Since we are fixing the blank node labels, we could define such an 
> order as: serialise each graph as canonical N-Triples, apply UNIX sort, 
> and then compare the graphs byte-by-byte (just to give a concrete idea).

To quote the famous S. Harris cartoon,
https://i.pinimg.com/736x/bf/dc/ed/bfdcedf4eebcf6069d61264ea8fcc08c--math-jokes-math-humor.jpg
I think you should be more explicit here in step three.

It is not at all obvious how to efficiently perform the "renaming of 
blank nodes in G to B that gives the lowest RDF graph in some order" 
when blank node cycles are permitted, though I'm glad that you and a few 
others have devised ways that usually work efficiently.

In contrast, as Ivan Herman noted earlier, the task becomes trivial if 
we have a tree, and the algorithm *always* works efficiently.

> 
> For me at least, this algorithm is pretty simple. It is also sound, 
> complete and guaranteed to terminate on RDF graphs without restriction. 
> In fact, this is probably still the simplest algorithm to define even in 
> cases where you have restricted RDF graphs.
> 
> And even though it is brute force and inefficient, with some tweaks in 
> how you define the order (i.e., think about ordering in terms of the 
> result of a hash, rather than byte-by-byte comparison), it can be made 
> very efficient for most cases (and, arguably, all real-world cases).
> 
> So it is not necessarily simplicity that these restrictions would bring, 
> nor efficiency as argued in [2,3]. (In terms of worst-case complexity, 
> not having to deal with cycles does make life easier, because the worst 
> cases have cycles; but many cases with cycles are also trivial.)
> 
> 
> I think as a first step, we would need to be clearer on what practical 
> problem we would be solving by adding such restrictions on RDF/blank nodes.

The practical problem is that RDF is too hard for *average* developers, 
and blank node issues are a significant contributor.  Compare the 
simplicity of JSON.  It's a world of difference.

It is hard for very smart people to see why concepts that are simple to 
them are *not* so simple for others who have significantly less 
intellectual horsepower.  Typically it is not any one concept that makes 
  a subject too hard, but the totality of the interaction between 
multiple concepts, each with its own exceptions and caveats, that pushes 
the user over his/her mental threshold of "too hard".

Somehow we need to reduce the total complexity of the RDF ecosystem. 
Blank nodes certainly are not the only problem, but they are one 
significant contributor.  And, as experience has shown, we really don't 
*need* them -- at least not explicit blank nodes.

Thanks,
David Booth

> 
> Cheers,
> Aidan
> 
> [1] https://lists.w3.org/Archives/Public/semantic-web/2014Sep/0103.html
> [2] http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
> [3] http://aidanhogan.com/docs/rdf-canonicalisation.pdf
> [4] http://aidanhogan.com/docs/bnodes.pdf
> [5] http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
> 
> 
>
Received on Tuesday, 27 November 2018 02:49:51 UTC