- From: Aidan Hogan <aidhog@gmail.com>
- Date: Mon, 26 Nov 2018 19:44:47 -0300
- To: SW-forum <semantic-web@w3.org>
Interesting discussion! I personally think the best way forward is to put aside dogma and do useful things with the standards we've proposed ... to lead by example, as the folks at Wikidata have been doing (for instance), and as people have been doing in projects like Linking Open Data. ... easier said than done of course. In any case, I guess I am best positioned to comment on the blank nodes issue. The tl;dr version of my stance regarding blank nodes is *no action required*. This stance is the polar opposite of what I had at the start of the journey a few years back. Comments below ... On 22-11-2018 0:42, David Booth wrote: > On 11/21/18 6:59 PM, Thomas Passin wrote: >> the problem of blank nodes is that there seems to be no general way to >> assign them ids in a reproducible way. > > Currently, for arbitrary RDF with unrestricted blank nodes, that seems > to be the case. However, if we are willing to live with some very > modest restrictions on blank node usage, such that there are no blank > node cycles, then it does become possible to canonicalize RDF. And > that, in turn, leads to the possibility of assigning IDs in a > reproducible way. For example, a blank node that is used to connect the > attribute of an n-ary relation could be replaced by a Skolem IRI that is > constructed (recursively) from the entities that it relates. Also, > conventions could require each n-ary relation to specify a (possibly > composite) key. It is possible (and I could even argue very "efficient") to assign ids to blank nodes in a reproducible way without any further restrictions on RDF. After seeing a similar discussion to this on the list about four years ago [1], and having had a good bit of experience working with blank nodes in various settings, I started working on precisely this algorithm. Results over millions of RDF graphs crawled from the Web show that "canonical labelling" of blank nodes (assigning IDs in a reproducible way) takes milliseconds in almost all cases. The worst real-world case took around 30 seconds, but it was "big" rather than "hard" with 7.3 million triples/254 thousand blank nodes [2,3]. The papers also show that one can create more difficult synthetic cases, but these are what I would call "adversarial cases", meaning that they would only likely occur because an attacker has designed them. While it is important to be aware that such adversarial cases might exist, thereafter, I don't consider them to be any bigger of a problem in practice than the idea of an "adversarial SPARQL query" sent to an endpoint, for example. Such cases can be rejected, left to timeout, etc. This work has not been the only work on this topic, as Ivan mentions; there's a number of practical solutions to this problem, some of which were worked on before or in parallel with what I was doing ... On 22-11-2018 14:08, Ivan Herman wrote<snip> >>> Some recent good progress on canonicalization: JSON-LD >>> https://json-ld.github.io/normalization/spec/ . However, the >>> current JSON-LD canonicalization draft (called "normalization") >>> is focused only on the digital signatures use case, and >>> needs improvement to better address the diff use case, in >>> which small, localized graph changes should result in small, >>> localized differences in the canonicalized graph. >>> >> >> There has been some discussions around this lately. If you are >> interested, look at: >> >> https://github.com/w3c/strategy/issues/116 >> >> In particular (specific comments as well as links from those comments): >> >> https://github.com/w3c/strategy/issues/116#issuecomment-383875628 >> https://github.com/w3c/strategy/issues/116#issuecomment-384160630 >> https://github.com/w3c/strategy/issues/116#issuecomment-395791130 >> https://github.com/w3c/strategy/issues/116#issuecomment-435920927 >> >> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf >> http://aidanhogan.com/docs/rdf-canonicalisation.pdf >> http://json-ld.github.io/normalization/spec/index.html >> https://github.com/iherman/canonical_rdf >> https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html >> >> It is still not clear how exactly we will move forward, but I have some >> hopes that this will happen sometimes in 2019. It depends on the >> availability of the people involved; the path to get this done is now >> relatively clear. >> >> All that being said: David's point is well taken on blank nodes. If >> there was no blank nodes around, it would be obvious. Looking at the >> details of the two available solutions (see points above) it is also >> true that there may be a middle ground: if the usage of blank nodes was >> somehow restricted avoiding circular patterns. I *think* (but I am not >> 100% sure) that if all blank nodes could be expressed by [] in turtle >> without any need for explicit bnode identifiers then both algorithms >> referred to above would become way simper. Yep, things would become a bit simpler without cycles in blank nodes. It's worth noting for context that maybe six or seven years back we informally/briefly made such an observation/proposal [4; just before Section 6.1], and David followed up on this with the Well Behaved RDF proposal [5]. At the time obviously I was on board (up until 2015 when I worked on [2] and changed my mind). Now I am no longer convinced by the need for such restrictions, or at least it seems like a case of premature optimisation. The strategy I favour would be to not restrict RDF until we are clear that this would be necessary, or indeed, that various practical tools/use-cases decide to do this independently (and then maybe pave the cow-paths). We have tools that work fine in all but some adversarial cases without having to introduce restrictions, so I'm not sure what introducing (overly-simplified?) restrictions would buy. Put another way, the algorithms for canonical labelling that I have seen or worked on myself would probably be about as efficient (as makes no difference) on cases where blank nodes have no cycles as dedicated algorithms that assume such restrictions. Plus there are now implementations of such algorithms for the unrestricted case. > On 26-11-2018 6:18, Ivan Herman wrote:> >> >>> On 26 Nov 2018, at 07:26, David Booth <david@dbooth.org >>> I like this line of thought. I would much rather have auto-generated >>> URIs, that are predictable and distinguishable as auto-generated, than >>> blank nodes. And even better, those auto-generated URIs could be >>> generated using a standard algorithm, so that all tools would generate >>> them the same way. >>> >>> As Aiden Hogen et all point out in "Everything You Always Wanted to >>> Know About Blank Nodes": "the vast majority of blank nodes form tree >>> structures", i.e., they do not contain blank node cycles. >>> http://www.websemanticsjournal.org/index.php/ps/article/download/365/387 >>> >>> If blank node cycles were prohibited in RDF, then predictable URIs >>> could be automatically generated for those blank nodes, bottom-up >>> recursively based on the tree structure. And prohibiting blank node >>> cycles would not be a huge loss, because even the few cases that do >>> use blank node cycles could be brought into conformance by replacing a >>> few of the blank nodes with URIs, to break the cycles. >> >> It also means that the problem of canonicalization/signature/etc would >> become way easier. The algorithms that I referred to in[1] are getting >> complicated due to those b-node cycles (I hope that Aiden, if he reads >> this, agrees with me). They would be way simpler if this restrictions >> was in place. It depends on what we mean by "simpler" I guess. Here's a three-line description of an algorithm that gives a canonical labelling of blank nodes for any (unrestricted) RDF graph G: 1) Count the unique blank nodes in G; call the result k 2) Define a fresh set of blank nodes B := { _:b1, ..., _:bk } 3) Find the (one-to-one) renaming of blank nodes in G to B that gives the lowest RDF graph in some order*. This is the canonical graph of G. * Since we are fixing the blank node labels, we could define such an order as: serialise each graph as canonical N-Triples, apply UNIX sort, and then compare the graphs byte-by-byte (just to give a concrete idea). For me at least, this algorithm is pretty simple. It is also sound, complete and guaranteed to terminate on RDF graphs without restriction. In fact, this is probably still the simplest algorithm to define even in cases where you have restricted RDF graphs. And even though it is brute force and inefficient, with some tweaks in how you define the order (i.e., think about ordering in terms of the result of a hash, rather than byte-by-byte comparison), it can be made very efficient for most cases (and, arguably, all real-world cases). So it is not necessarily simplicity that these restrictions would bring, nor efficiency as argued in [2,3]. (In terms of worst-case complexity, not having to deal with cycles does make life easier, because the worst cases have cycles; but many cases with cycles are also trivial.) I think as a first step, we would need to be clearer on what practical problem we would be solving by adding such restrictions on RDF/blank nodes. Cheers, Aidan [1] https://lists.w3.org/Archives/Public/semantic-web/2014Sep/0103.html [2] http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf [3] http://aidanhogan.com/docs/rdf-canonicalisation.pdf [4] http://aidanhogan.com/docs/bnodes.pdf [5] http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
Received on Monday, 26 November 2018 22:45:13 UTC