Re: deterministic naming of blank nodes from Gregg Kellogg on 2015-05-20 (semantic-web@w3.org from May 2015)

From: Gregg Kellogg <greggkellogg@gmail.com>
Date: Wed, 20 May 2015 12:37:42 -0700
To: ahogan@dcc.uchile.cl
Cc: W3C Semantic Web IG <semantic-web@w3.org>, David Longley <dlongley@digitalbazaar.com>
Message-Id: <748075BD-3F87-4158-B713-C51C0D1EFF18@gmail.com>
Just so readers of this thread are aware, the W3C Credentials Community Group has been fostering an RDF Dataset Normalization specification [1] which was previously initiated as part of JSON-LD, although it concerns itself exclusively with RDF Datasets and canonical N-Quads. The goal is to use this as a basis for a future Recommendation.

The algorithm was created by Dave Longley, and I understand that it is substantially the same as Aidan’s (it would be good to have an independent analysis). The specification needs a lot of work to motivate specific use cases and create normative language for describing the particular branches necessary to resolve isomorphism ambiguities and challenging graph topographies. There is a fair test suite already created (see [1]). Ideally, Aidan’s paper could form the basis of a formal description of the algorithm, with the existing algorithms described in an appendix, to the degree not required for interoperability.

I’ve just completed my own implementation in Ruby [2] to join other implementations in Python and JavaScript [3], [4].

We would welcome feedback and collaborators, as the expertise required for this is outside of the normal domain for the Credentials Community Group.

Gregg

[1] http://json-ld.github.io/normalization/
[2] https://github.com/ruby-rdf/rdf-normalize
[3] https://github.com/digitalbazaar/pyld
[4] https://github.com/digitalbazaar/jsonld.js

> On May 12, 2015, at 11:14 PM, ahogan@dcc.uchile.cl wrote:
> 
> Hi,
> 
> Sorry for resurrecting old threads but this discussion sort of prompted me
> to work on the topic of "deterministic naming of blank nodes" and I
> thought that it might be time (six months or so later) to report back on
> some findings.
> 
> The result is the paper here:
> 
> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
> WWW, Florence, Italy, May 1822, 2015 (to appear).
> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
> 
> I'll be presenting it next Friday if anyone happens to be around the
> Florence area.
> 
> Roughly speaking, the paper presents an algorithm for deterministically
> labelling blank nodes of any RDF graph such that the labels of two
> isomorphic RDF graphs will "correspond". This could be used to find
> isomorphic RDF graphs in a large set of RDF graphs (without needing
> pairwise checks), for computing signatures, or for Skolemising in such a
> way that the output of two RDF graphs will be equal if and only if they
> are isomorphic.
> 
> The algorithm again works for general RDF graphs and doesn't assume any
> well-behavedness or similar. The algorithm is exponential (based roughly
> on a Nauty-style recursion) but the experiments show that, in terms of
> isomorphism at least, you would have to have really really really *really*
> weird RDF graphs to run into any real trouble (we're talking cliques of
> blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the
> individual graphs in the BTC-14 collection ... the slowest took 3140
> seconds, mainly due to its size: 7.3 million triples with 254 thousand
> blank nodes.
> 
> One drawback of the current approach is that at the moment you have to
> load the RDF graph into memory, which may be prohibitive for very very
> large graphs (of connected blank nodes), but I'm not sure how often that
> happens in practice. (It wasn't a problem for any of the graphs in the
> BTC-14 dataset for example.)
> 
> There's also a software package available here:
> 
> https://github.com/aidhog/blabel/wiki
> 
> Of course the usual disclaimers about research code apply!
> 
> Anyways, I thought these results might be of interest and hopefully might
> have applications in the context of Skolemisation or signing RDF graphs or
> similar.
> 
> Best,
> Aidan
> 
> On 21/10/2014 04:30, Laurens Rietveld wrote:
>> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
>> local time, sala 300), which might be suitable for your experiment.
>> SampLD is a scalable approach for sampling RDF graphs, solely using the
>> topology of the graph:
>> 
>>    The Linked Data cloud has grown to become the largest knowledge base
>>    ever constructed.
>>    Its size is now turning into a major bottleneck for many
>>    applications. In order to facilitate access to this structured
>>    information, this paper proposes an automatic sampling method
>>    targeted at maximizing answer coverage for applications using SPARQL
>>    querying.
>>    The approach presented in this paper is novel: no similar RDF
>>    sampling approach exist. Additionally, the concept of creating a
>>    sample aimed at maximizing SPARQL answer coverage, is unique.
>>    We empirically show that the relevance of triples for sampling (a
>>    semantic notion) is influenced by the topology of the graph (purely
>>    structural), and can be determined
>>    without prior knowledge of the queries.
>>    Experiments show a significantly higher recall of topology based
>>    sampling methods over random and naive baseline approaches (e.g. up
>>    to 90% for Open-BioMed at a sample size of 6%).
>> 
>> 
>> And, to support LOD-scale experiments, you might be interested in the
>> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
>> Here, we (re)publish clean, sanitized versions of Linked Data sets as
>> sorted gzipped n-triples (13 billion triples and counting), with
>> corresponding meta-data (e.g. the number of blank nodes in this
> dataset[4]).
>> 
>> Best, Laurens
>> 
>> [1]
>> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
>> [2]
>> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
>> [3] http://lodlaundromat.org
>> [4]
>> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
>> --
>> 
>> VU University Amsterdam____
>> 
>> Faculty of Exact Sciences____
>> 
>> Department of Computer Science____
>> 
>> De Boelelaan 1081 A____
>> 
>> 1081 HV Amsterdam____
>> 
>> The Netherlands
>> 
>> www.laurensrietveld.nl <http://www.laurensrietveld.nl>
>> 
>> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl>
>> 
>> Visiting address:
>> 
>> De Boelelaan 1081____
>> 
>> Science Building Room T312
>> 
>> 
>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org
>> <mailto:sandro@w3.org>> wrote:
>> 
>>    On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>>> On 2014-10-07, Sandro Hawke wrote:
>>> 
>>>> That may be true, but it is hard for me to see how any benefit this
>>>> could bring would outweigh the absolute pain in the ass it would be
>>>> for everyone to change their RDF stacks.
>> 
>>    It was not me who said that.   That was Alan Ruttenberg.
>> 
>>> 
>>> So, why not subdivide the process? It ought to be easy and efficient
>>> enough to detect a rather expansive subset of graphs which do admit
>>> unique and efficient labeling. At the very least graphs which only
> use
>>> a blank node precisely twice (to define something and to refer to it
>>> once as in the bracket notation) are pretty simple, using a simple
>>> hash table with counters -- that perhaps being the commonest case as
>>> well.
>>> 
>>> If the test succeeds, define a unique labeling based on the rest of
>>> the attributes of the triple and lexical ordering; if not, ask the
>>> user whether general graph isomorphism comparison is wanted, and if
>>> so, do that, somehow signaling that it really went that far (perhaps
>>> inband in the format of the labels? or out of band as the case may
>>> be); if not, or if you can't do graph isomorphism in your code, then
>>> slap on nonunique labels, again differentiating them somehow from the
>>> first two cases.
>>> 
>>> That is certainly not an easy or clean solution, but it doesn't break
>>> the stack, and it works in most of the places where you want to do
>>> fast path processing under the assumption that in fact the labels are
>>> canonical, and can be relied upon to have 1-1 correpondence from
>>> syntax to node.
>> 
>>    I agree.   Does anyone have a good sampling of the LOD cloud we could
>>    easily use for this experiment?
>> 
>>           -- Sandro
> 
> 
>
Received on Wednesday, 20 May 2015 19:38:14 UTC