Re: Fwd: deterministic naming of blank nodes from Dave Longley on 2015-05-13 (public-credentials@w3.org from May 2015)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Wed, 13 May 2015 11:20:17 -0400
To: Gregg Kellogg <gregg@greggkellogg.com>, Credentials Community Group <public-credentials@w3.org>, ahogan@dcc.uchile.cl
Message-ID: <55536BB1.7010505@digitalbazaar.com>
On 05/13/2015 02:47 AM, Gregg Kellogg wrote:
> Worth looking at for a comparison.

I think Aidan Hogan's paper actually describes a nearly identical 
algorithm to what we created. Which is a good thing. His paper does a 
much better job of laying it out in mathematical terms.

>
> Gregg Kellogg
>
> Sent from my iPad
>
> Begin forwarded message:
>
>> *Resent-From:* semantic-web@w3.org <mailto:semantic-web@w3.org>
>> *From:* ahogan@dcc.uchile.cl <mailto:ahogan@dcc.uchile.cl>
>> *Date:* May 12, 2015 at 11:14:50 PM PDT
>> *To:* semantic-web@w3.org <mailto:semantic-web@w3.org>
>> *Subject:* *Re: deterministic naming of blank nodes*
>>
>> Hi,
>>
>> Sorry for resurrecting old threads but this discussion sort of 
>> prompted me
>> to work on the topic of "deterministic naming of blank nodes" and I
>> thought that it might be time (six months or so later) to report back on
>> some findings.
>>
>> The result is the paper here:
>>
>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>
>> I'll be presenting it next Friday if anyone happens to be around the
>> Florence area.
>>
>> Roughly speaking, the paper presents an algorithm for deterministically
>> labelling blank nodes of any RDF graph such that the labels of two
>> isomorphic RDF graphs will "correspond". This could be used to find
>> isomorphic RDF graphs in a large set of RDF graphs (without needing
>> pairwise checks), for computing signatures, or for Skolemising in such a
>> way that the output of two RDF graphs will be equal if and only if they
>> are isomorphic.
>>
>> The algorithm again works for general RDF graphs and doesn't assume any
>> well-behavedness or similar. The algorithm is exponential (based roughly
>> on a Nauty-style recursion) but the experiments show that, in terms of
>> isomorphism at least, you would have to have really really really 
>> *really*
>> weird RDF graphs to run into any real trouble (we're talking cliques of
>> blank nodes with k>55 or Miyazki graphs or similar). We skolemised 
>> all the
>> individual graphs in the BTC-14 collection ... the slowest took 31–40
>> seconds, mainly due to its size: 7.3 million triples with 254 thousand
>> blank nodes.
>>
>> One drawback of the current approach is that at the moment you have to
>> load the RDF graph into memory, which may be prohibitive for very very
>> large graphs (of connected blank nodes), but I'm not sure how often that
>> happens in practice. (It wasn't a problem for any of the graphs in the
>> BTC-14 dataset for example.)
>>
>> There's also a software package available here:
>>
>> https://github.com/aidhog/blabel/wiki
>>
>> Of course the usual disclaimers about research code apply!
>>
>> Anyways, I thought these results might be of interest and hopefully might
>> have applications in the context of Skolemisation or signing RDF 
>> graphs or
>> similar.
>>
>> Best,
>> Aidan
>>
>> On 21/10/2014 04:30, Laurens Rietveld wrote:
>>> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
>>> local time, sala 300), which might be suitable for your experiment.
>>> SampLD is a scalable approach for sampling RDF graphs, solely using the
>>> topology of the graph:
>>>
>>>    The Linked Data cloud has grown to become the largest knowledge base
>>>    ever constructed.
>>>    Its size is now turning into a major bottleneck for many
>>>    applications. In order to facilitate access to this structured
>>>    information, this paper proposes an automatic sampling method
>>>    targeted at maximizing answer coverage for applications using SPARQL
>>>    querying.
>>>    The approach presented in this paper is novel: no similar RDF
>>>    sampling approach exist. Additionally, the concept of creating a
>>>    sample aimed at maximizing SPARQL answer coverage, is unique.
>>>    We empirically show that the relevance of triples for sampling (a
>>>    semantic notion) is influenced by the topology of the graph (purely
>>>    structural), and can be determined
>>>    without prior knowledge of the queries.
>>>    Experiments show a significantly higher recall of topology based
>>>    sampling methods over random and naive baseline approaches (e.g. up
>>>    to 90% for Open-BioMed at a sample size of 6%).
>>>
>>>
>>> And, to support LOD-scale experiments, you might be interested in the
>>> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
>>> Here, we (re)publish clean, sanitized versions of Linked Data sets as
>>> sorted gzipped n-triples (13 billion triples and counting), with
>>> corresponding meta-data (e.g. the number of blank nodes in this
>> dataset[4]).
>>>
>>> Best, Laurens
>>>
>>> [1]
>>> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
>>> [2]
>>> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
>>> [3] http://lodlaundromat.org
>>> [4]
>>> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100 
>>> <http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC%28%3FnumBnodes%29+LIMIT+100>
>>> --
>>>
>>> VU University Amsterdam____
>>>
>>> Faculty of Exact Sciences____
>>>
>>> Department of Computer Science____
>>>
>>> De Boelelaan 1081 A____
>>>
>>> 1081 HV Amsterdam____
>>>
>>> The Netherlands
>>>
>>> www.laurensrietveld.nl <http://www.laurensrietveld.nl> 
>>> <http://www.laurensrietveld.nl>
>>>
>>> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl> 
>>> <mailto:laurens.rietveld@vu.nl>
>>>
>>> Visiting address:
>>>
>>> De Boelelaan 1081____
>>>
>>> Science Building Room T312
>>>
>>>
>>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org 
>>> <mailto:sandro@w3.org>
>>> <mailto:sandro@w3.org>> wrote:
>>>
>>>    On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>>>> On 2014-10-07, Sandro Hawke wrote:
>>>>
>>>>> That may be true, but it is hard for me to see how any benefit this
>>>>> could bring would outweigh the absolute pain in the ass it would be
>>>>> for everyone to change their RDF stacks.
>>>
>>>    It was not me who said that.   That was Alan Ruttenberg.
>>>
>>>>
>>>> So, why not subdivide the process? It ought to be easy and efficient
>>>> enough to detect a rather expansive subset of graphs which do admit
>>>> unique and efficient labeling. At the very least graphs which only
>> use
>>>> a blank node precisely twice (to define something and to refer to it
>>>> once as in the bracket notation) are pretty simple, using a simple
>>>> hash table with counters -- that perhaps being the commonest case as
>>>> well.
>>>>
>>>> If the test succeeds, define a unique labeling based on the rest of
>>>> the attributes of the triple and lexical ordering; if not, ask the
>>>> user whether general graph isomorphism comparison is wanted, and if
>>>> so, do that, somehow signaling that it really went that far (perhaps
>>>> inband in the format of the labels? or out of band as the case may
>>>> be); if not, or if you can't do graph isomorphism in your code, then
>>>> slap on nonunique labels, again differentiating them somehow from the
>>>> first two cases.
>>>>
>>>> That is certainly not an easy or clean solution, but it doesn't break
>>>> the stack, and it works in most of the places where you want to do
>>>> fast path processing under the assumption that in fact the labels are
>>>> canonical, and can be relied upon to have 1-1 correpondence from
>>>> syntax to node.
>>>
>>>    I agree.   Does anyone have a good sampling of the LOD cloud we could
>>>    easily use for this experiment?
>>>
>>>           -- Sandro
>>
>>
>>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Wednesday, 13 May 2015 15:20:45 UTC