Re: deterministic naming of blank nodes from Gregg Kellogg on 2015-05-26 (semantic-web@w3.org from May 2015)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Tue, 26 May 2015 14:14:40 -0700
To: Aidan Hogan <ahogan@dcc.uchile.cl>
Cc: W3C Semantic Web IG <semantic-web@w3.org>, David Longley <dlongley@digitalbazaar.com>
Message-Id: <9666C66D-B4D2-486F-9328-7AE08DDCD066@greggkellogg.net>
> On May 26, 2015, at 2:00 PM, Aidan Hogan <ahogan@dcc.uchile.cl> wrote:
> 
> Hey Gregg, David,
> 
> It's good to see that there's some ongoing work in the W3C on this topic!
> 
> I had a quick look though the proposals in question.
> 
> http://json-ld.github.io/normalization/spec/index.html
> 
> Hmm ... but to be honest, I could only get a very high level understanding of the algorithms in 4.4.2/4.8.2. It seems on a high-level like the algorithms are similar to what's in the paper but I can't tell for sure ... maybe a more formal type of pseudo-code would be better for presenting these algorithms? (Indeed I can definitely appreciate it's challenging to get this sort of stuff on paper and keep it readable.) From the gist though, both seem to go in the same direction.

Aidan, thanks for taking a look. Yes, the sections need to be written in terms of normative requirements, and there needs to be theoretical aspects to discuss the behavior at a higher level. As a practical consideration, the existing algorithm specification (or something similar) may be required to do an implementation and interoperate, but they should be in the form of a non-normative section or appendix, IMO.

Dave did a better analysis comparing the two, and he should be able to provide better illumination on how the algorithms match up against your paper.

We also need some example walk-throughs for different graph shapes to indicate the behavior of the algorithm, IMO. Much still to do.

Gregg

> Cheers,
> Aidan
> 
> On 20/05/2015 16:37, Gregg Kellogg wrote:
>> Just so readers of this thread are aware, the W3C Credentials Community Group has been fostering an RDF Dataset Normalization specification [1] which was previously initiated as part of JSON-LD, although it concerns itself exclusively with RDF Datasets and canonical N-Quads. The goal is to use this as a basis for a future Recommendation.
>> 
>> The algorithm was created by Dave Longley, and I understand that it is substantially the same as Aidan’s (it would be good to have an independent analysis). The specification needs a lot of work to motivate specific use cases and create normative language for describing the particular branches necessary to resolve isomorphism ambiguities and challenging graph topographies. There is a fair test suite already created (see [1]). Ideally, Aidan’s paper could form the basis of a formal description of the algorithm, with the existing algorithms described in an appendix, to the degree not required for interoperability.
>> 
>> I’ve just completed my own implementation in Ruby [2] to join other implementations in Python and JavaScript [3], [4].
>> 
>> We would welcome feedback and collaborators, as the expertise required for this is outside of the normal domain for the Credentials Community Group.
>> 
>> Gregg
>> 
>> [1] http://json-ld.github.io/normalization/
>> [2] https://github.com/ruby-rdf/rdf-normalize
>> [3] https://github.com/digitalbazaar/pyld
>> [4] https://github.com/digitalbazaar/jsonld.js
>> 
>>> On May 12, 2015, at 11:14 PM, ahogan@dcc.uchile.cl wrote:
>>> 
>>> Hi,
>>> 
>>> Sorry for resurrecting old threads but this discussion sort of prompted me
>>> to work on the topic of "deterministic naming of blank nodes" and I
>>> thought that it might be time (six months or so later) to report back on
>>> some findings.
>>> 
>>> The result is the paper here:
>>> 
>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>> WWW, Florence, Italy, May 1822, 2015 (to appear).
>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>> 
>>> I'll be presenting it next Friday if anyone happens to be around the
>>> Florence area.
>>> 
>>> Roughly speaking, the paper presents an algorithm for deterministically
>>> labelling blank nodes of any RDF graph such that the labels of two
>>> isomorphic RDF graphs will "correspond". This could be used to find
>>> isomorphic RDF graphs in a large set of RDF graphs (without needing
>>> pairwise checks), for computing signatures, or for Skolemising in such a
>>> way that the output of two RDF graphs will be equal if and only if they
>>> are isomorphic.
>>> 
>>> The algorithm again works for general RDF graphs and doesn't assume any
>>> well-behavedness or similar. The algorithm is exponential (based roughly
>>> on a Nauty-style recursion) but the experiments show that, in terms of
>>> isomorphism at least, you would have to have really really really *really*
>>> weird RDF graphs to run into any real trouble (we're talking cliques of
>>> blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the
>>> individual graphs in the BTC-14 collection ... the slowest took 3140
>>> seconds, mainly due to its size: 7.3 million triples with 254 thousand
>>> blank nodes.
>>> 
>>> One drawback of the current approach is that at the moment you have to
>>> load the RDF graph into memory, which may be prohibitive for very very
>>> large graphs (of connected blank nodes), but I'm not sure how often that
>>> happens in practice. (It wasn't a problem for any of the graphs in the
>>> BTC-14 dataset for example.)
>>> 
>>> There's also a software package available here:
>>> 
>>> https://github.com/aidhog/blabel/wiki
>>> 
>>> Of course the usual disclaimers about research code apply!
>>> 
>>> Anyways, I thought these results might be of interest and hopefully might
>>> have applications in the context of Skolemisation or signing RDF graphs or
>>> similar.
>>> 
>>> Best,
>>> Aidan
>>> 
>>> On 21/10/2014 04:30, Laurens Rietveld wrote:
>>>> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
>>>> local time, sala 300), which might be suitable for your experiment.
>>>> SampLD is a scalable approach for sampling RDF graphs, solely using the
>>>> topology of the graph:
>>>> 
>>>>    The Linked Data cloud has grown to become the largest knowledge base
>>>>    ever constructed.
>>>>    Its size is now turning into a major bottleneck for many
>>>>    applications. In order to facilitate access to this structured
>>>>    information, this paper proposes an automatic sampling method
>>>>    targeted at maximizing answer coverage for applications using SPARQL
>>>>    querying.
>>>>    The approach presented in this paper is novel: no similar RDF
>>>>    sampling approach exist. Additionally, the concept of creating a
>>>>    sample aimed at maximizing SPARQL answer coverage, is unique.
>>>>    We empirically show that the relevance of triples for sampling (a
>>>>    semantic notion) is influenced by the topology of the graph (purely
>>>>    structural), and can be determined
>>>>    without prior knowledge of the queries.
>>>>    Experiments show a significantly higher recall of topology based
>>>>    sampling methods over random and naive baseline approaches (e.g. up
>>>>    to 90% for Open-BioMed at a sample size of 6%).
>>>> 
>>>> 
>>>> And, to support LOD-scale experiments, you might be interested in the
>>>> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
>>>> Here, we (re)publish clean, sanitized versions of Linked Data sets as
>>>> sorted gzipped n-triples (13 billion triples and counting), with
>>>> corresponding meta-data (e.g. the number of blank nodes in this
>>> dataset[4]).
>>>> 
>>>> Best, Laurens
>>>> 
>>>> [1]
>>>> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
>>>> [2]
>>>> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
>>>> [3] http://lodlaundromat.org
>>>> [4]
>>>> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
>>>> --
>>>> 
>>>> VU University Amsterdam____
>>>> 
>>>> Faculty of Exact Sciences____
>>>> 
>>>> Department of Computer Science____
>>>> 
>>>> De Boelelaan 1081 A____
>>>> 
>>>> 1081 HV Amsterdam____
>>>> 
>>>> The Netherlands
>>>> 
>>>> www.laurensrietveld.nl <http://www.laurensrietveld.nl>
>>>> 
>>>> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl>
>>>> 
>>>> Visiting address:
>>>> 
>>>> De Boelelaan 1081____
>>>> 
>>>> Science Building Room T312
>>>> 
>>>> 
>>>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org
>>>> <mailto:sandro@w3.org>> wrote:
>>>> 
>>>>    On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>>>>> On 2014-10-07, Sandro Hawke wrote:
>>>>> 
>>>>>> That may be true, but it is hard for me to see how any benefit this
>>>>>> could bring would outweigh the absolute pain in the ass it would be
>>>>>> for everyone to change their RDF stacks.
>>>> 
>>>>    It was not me who said that.   That was Alan Ruttenberg.
>>>> 
>>>>> 
>>>>> So, why not subdivide the process? It ought to be easy and efficient
>>>>> enough to detect a rather expansive subset of graphs which do admit
>>>>> unique and efficient labeling. At the very least graphs which only
>>> use
>>>>> a blank node precisely twice (to define something and to refer to it
>>>>> once as in the bracket notation) are pretty simple, using a simple
>>>>> hash table with counters -- that perhaps being the commonest case as
>>>>> well.
>>>>> 
>>>>> If the test succeeds, define a unique labeling based on the rest of
>>>>> the attributes of the triple and lexical ordering; if not, ask the
>>>>> user whether general graph isomorphism comparison is wanted, and if
>>>>> so, do that, somehow signaling that it really went that far (perhaps
>>>>> inband in the format of the labels? or out of band as the case may
>>>>> be); if not, or if you can't do graph isomorphism in your code, then
>>>>> slap on nonunique labels, again differentiating them somehow from the
>>>>> first two cases.
>>>>> 
>>>>> That is certainly not an easy or clean solution, but it doesn't break
>>>>> the stack, and it works in most of the places where you want to do
>>>>> fast path processing under the assumption that in fact the labels are
>>>>> canonical, and can be relied upon to have 1-1 correpondence from
>>>>> syntax to node.
>>>> 
>>>>    I agree.   Does anyone have a good sampling of the LOD cloud we could
>>>>    easily use for this experiment?
>>>> 
>>>>           -- Sandro
>>> 
>>> 
>>> 
>>
Received on Tuesday, 26 May 2015 21:15:10 UTC