Re: deterministic naming of blank nodes

Just so readers of this thread are aware, the W3C Credentials Community Group has been fostering an RDF Dataset Normalization specification [1] which was previously initiated as part of JSON-LD, although it concerns itself exclusively with RDF Datasets and canonical N-Quads. The goal is to use this as a basis for a future Recommendation.

The algorithm was created by Dave Longley, and I understand that it is substantially the same as Aidan’s (it would be good to have an independent analysis). The specification needs a lot of work to motivate specific use cases and create normative language for describing the particular branches necessary to resolve isomorphism ambiguities and challenging graph topographies. There is a fair test suite already created (see [1]). Ideally, Aidan’s paper could form the basis of a formal description of the algorithm, with the existing algorithms described in an appendix, to the degree not required for interoperability.

I’ve just completed my own implementation in Ruby [2] to join other implementations in Python and JavaScript [3], [4].

We would welcome feedback and collaborators, as the expertise required for this is outside of the normal domain for the Credentials Community Group.



> On May 12, 2015, at 11:14 PM, wrote:
> Hi,
> Sorry for resurrecting old threads but this discussion sort of prompted me
> to work on the topic of "deterministic naming of blank nodes" and I
> thought that it might be time (six months or so later) to report back on
> some findings.
> The result is the paper here:
> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
> WWW, Florence, Italy, May 18–22, 2015 (to appear).
> Available from:
> I'll be presenting it next Friday if anyone happens to be around the
> Florence area.
> Roughly speaking, the paper presents an algorithm for deterministically
> labelling blank nodes of any RDF graph such that the labels of two
> isomorphic RDF graphs will "correspond". This could be used to find
> isomorphic RDF graphs in a large set of RDF graphs (without needing
> pairwise checks), for computing signatures, or for Skolemising in such a
> way that the output of two RDF graphs will be equal if and only if they
> are isomorphic.
> The algorithm again works for general RDF graphs and doesn't assume any
> well-behavedness or similar. The algorithm is exponential (based roughly
> on a Nauty-style recursion) but the experiments show that, in terms of
> isomorphism at least, you would have to have really really really *really*
> weird RDF graphs to run into any real trouble (we're talking cliques of
> blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the
> individual graphs in the BTC-14 collection ... the slowest took 31–40
> seconds, mainly due to its size: 7.3 million triples with 254 thousand
> blank nodes.
> One drawback of the current approach is that at the moment you have to
> load the RDF graph into memory, which may be prohibitive for very very
> large graphs (of connected blank nodes), but I'm not sure how often that
> happens in practice. (It wasn't a problem for any of the graphs in the
> BTC-14 dataset for example.)
> There's also a software package available here:
> Of course the usual disclaimers about research code apply!
> Anyways, I thought these results might be of interest and hopefully might
> have applications in the context of Skolemisation or signing RDF graphs or
> similar.
> Best,
> Aidan
> On 21/10/2014 04:30, Laurens Rietveld wrote:
>> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
>> local time, sala 300), which might be suitable for your experiment.
>> SampLD is a scalable approach for sampling RDF graphs, solely using the
>> topology of the graph:
>>    The Linked Data cloud has grown to become the largest knowledge base
>>    ever constructed.
>>    Its size is now turning into a major bottleneck for many
>>    applications. In order to facilitate access to this structured
>>    information, this paper proposes an automatic sampling method
>>    targeted at maximizing answer coverage for applications using SPARQL
>>    querying.
>>    The approach presented in this paper is novel: no similar RDF
>>    sampling approach exist. Additionally, the concept of creating a
>>    sample aimed at maximizing SPARQL answer coverage, is unique.
>>    We empirically show that the relevance of triples for sampling (a
>>    semantic notion) is influenced by the topology of the graph (purely
>>    structural), and can be determined
>>    without prior knowledge of the queries.
>>    Experiments show a significantly higher recall of topology based
>>    sampling methods over random and naive baseline approaches (e.g. up
>>    to 90% for Open-BioMed at a sample size of 6%).
>> And, to support LOD-scale experiments, you might be interested in the
>> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
>> Here, we (re)publish clean, sanitized versions of Linked Data sets as
>> sorted gzipped n-triples (13 billion triples and counting), with
>> corresponding meta-data (e.g. the number of blank nodes in this
> dataset[4]).
>> Best, Laurens
>> [1]
>> [2]
>> [3]
>> [4]
>> --
>> VU University Amsterdam____
>> Faculty of Exact Sciences____
>> Department of Computer Science____
>> De Boelelaan 1081 A____
>> 1081 HV Amsterdam____
>> The Netherlands
>> <>
>> <>
>> Visiting address:
>> De Boelelaan 1081____
>> Science Building Room T312
>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <
>> <>> wrote:
>>    On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>>> On 2014-10-07, Sandro Hawke wrote:
>>>> That may be true, but it is hard for me to see how any benefit this
>>>> could bring would outweigh the absolute pain in the ass it would be
>>>> for everyone to change their RDF stacks.
>>    It was not me who said that.   That was Alan Ruttenberg.
>>> So, why not subdivide the process? It ought to be easy and efficient
>>> enough to detect a rather expansive subset of graphs which do admit
>>> unique and efficient labeling. At the very least graphs which only
> use
>>> a blank node precisely twice (to define something and to refer to it
>>> once as in the bracket notation) are pretty simple, using a simple
>>> hash table with counters -- that perhaps being the commonest case as
>>> well.
>>> If the test succeeds, define a unique labeling based on the rest of
>>> the attributes of the triple and lexical ordering; if not, ask the
>>> user whether general graph isomorphism comparison is wanted, and if
>>> so, do that, somehow signaling that it really went that far (perhaps
>>> inband in the format of the labels? or out of band as the case may
>>> be); if not, or if you can't do graph isomorphism in your code, then
>>> slap on nonunique labels, again differentiating them somehow from the
>>> first two cases.
>>> That is certainly not an easy or clean solution, but it doesn't break
>>> the stack, and it works in most of the places where you want to do
>>> fast path processing under the assumption that in fact the labels are
>>> canonical, and can be relied upon to have 1-1 correpondence from
>>> syntax to node.
>>    I agree.   Does anyone have a good sampling of the LOD cloud we could
>>    easily use for this experiment?
>>           -- Sandro

Received on Wednesday, 20 May 2015 19:38:14 UTC