- From: Dave Longley <dlongley@digitalbazaar.com>
- Date: Wed, 13 May 2015 11:20:17 -0400
- To: Gregg Kellogg <gregg@greggkellogg.com>, Credentials Community Group <public-credentials@w3.org>, ahogan@dcc.uchile.cl
- Message-ID: <55536BB1.7010505@digitalbazaar.com>
On 05/13/2015 02:47 AM, Gregg Kellogg wrote: > Worth looking at for a comparison. I think Aidan Hogan's paper actually describes a nearly identical algorithm to what we created. Which is a good thing. His paper does a much better job of laying it out in mathematical terms. > > Gregg Kellogg > > Sent from my iPad > > Begin forwarded message: > >> *Resent-From:* semantic-web@w3.org <mailto:semantic-web@w3.org> >> *From:* ahogan@dcc.uchile.cl <mailto:ahogan@dcc.uchile.cl> >> *Date:* May 12, 2015 at 11:14:50 PM PDT >> *To:* semantic-web@w3.org <mailto:semantic-web@w3.org> >> *Subject:* *Re: deterministic naming of blank nodes* >> >> Hi, >> >> Sorry for resurrecting old threads but this discussion sort of >> prompted me >> to work on the topic of "deterministic naming of blank nodes" and I >> thought that it might be time (six months or so later) to report back on >> some findings. >> >> The result is the paper here: >> >> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In >> WWW, Florence, Italy, May 18–22, 2015 (to appear). >> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf >> >> I'll be presenting it next Friday if anyone happens to be around the >> Florence area. >> >> Roughly speaking, the paper presents an algorithm for deterministically >> labelling blank nodes of any RDF graph such that the labels of two >> isomorphic RDF graphs will "correspond". This could be used to find >> isomorphic RDF graphs in a large set of RDF graphs (without needing >> pairwise checks), for computing signatures, or for Skolemising in such a >> way that the output of two RDF graphs will be equal if and only if they >> are isomorphic. >> >> The algorithm again works for general RDF graphs and doesn't assume any >> well-behavedness or similar. The algorithm is exponential (based roughly >> on a Nauty-style recursion) but the experiments show that, in terms of >> isomorphism at least, you would have to have really really really >> *really* >> weird RDF graphs to run into any real trouble (we're talking cliques of >> blank nodes with k>55 or Miyazki graphs or similar). We skolemised >> all the >> individual graphs in the BTC-14 collection ... the slowest took 31–40 >> seconds, mainly due to its size: 7.3 million triples with 254 thousand >> blank nodes. >> >> One drawback of the current approach is that at the moment you have to >> load the RDF graph into memory, which may be prohibitive for very very >> large graphs (of connected blank nodes), but I'm not sure how often that >> happens in practice. (It wasn't a problem for any of the graphs in the >> BTC-14 dataset for example.) >> >> There's also a software package available here: >> >> https://github.com/aidhog/blabel/wiki >> >> Of course the usual disclaimers about research code apply! >> >> Anyways, I thought these results might be of interest and hopefully might >> have applications in the context of Skolemisation or signing RDF >> graphs or >> similar. >> >> Best, >> Aidan >> >> On 21/10/2014 04:30, Laurens Rietveld wrote: >>> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00 >>> local time, sala 300), which might be suitable for your experiment. >>> SampLD is a scalable approach for sampling RDF graphs, solely using the >>> topology of the graph: >>> >>> The Linked Data cloud has grown to become the largest knowledge base >>> ever constructed. >>> Its size is now turning into a major bottleneck for many >>> applications. In order to facilitate access to this structured >>> information, this paper proposes an automatic sampling method >>> targeted at maximizing answer coverage for applications using SPARQL >>> querying. >>> The approach presented in this paper is novel: no similar RDF >>> sampling approach exist. Additionally, the concept of creating a >>> sample aimed at maximizing SPARQL answer coverage, is unique. >>> We empirically show that the relevance of triples for sampling (a >>> semantic notion) is influenced by the topology of the graph (purely >>> structural), and can be determined >>> without prior knowledge of the queries. >>> Experiments show a significantly higher recall of topology based >>> sampling methods over random and naive baseline approaches (e.g. up >>> to 90% for Open-BioMed at a sample size of 6%). >>> >>> >>> And, to support LOD-scale experiments, you might be interested in the >>> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well. >>> Here, we (re)publish clean, sanitized versions of Linked Data sets as >>> sorted gzipped n-triples (13 billion triples and counting), with >>> corresponding meta-data (e.g. the number of blank nodes in this >> dataset[4]). >>> >>> Best, Laurens >>> >>> [1] >>> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf >>> [2] >>> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf >>> [3] http://lodlaundromat.org >>> [4] >>> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100 >>> <http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC%28%3FnumBnodes%29+LIMIT+100> >>> -- >>> >>> VU University Amsterdam____ >>> >>> Faculty of Exact Sciences____ >>> >>> Department of Computer Science____ >>> >>> De Boelelaan 1081 A____ >>> >>> 1081 HV Amsterdam____ >>> >>> The Netherlands >>> >>> www.laurensrietveld.nl <http://www.laurensrietveld.nl> >>> <http://www.laurensrietveld.nl> >>> >>> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl> >>> <mailto:laurens.rietveld@vu.nl> >>> >>> Visiting address: >>> >>> De Boelelaan 1081____ >>> >>> Science Building Room T312 >>> >>> >>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org >>> <mailto:sandro@w3.org> >>> <mailto:sandro@w3.org>> wrote: >>> >>> On 10/07/2014 10:25 PM, Sampo Syreeni wrote: >>>> On 2014-10-07, Sandro Hawke wrote: >>>> >>>>> That may be true, but it is hard for me to see how any benefit this >>>>> could bring would outweigh the absolute pain in the ass it would be >>>>> for everyone to change their RDF stacks. >>> >>> It was not me who said that. That was Alan Ruttenberg. >>> >>>> >>>> So, why not subdivide the process? It ought to be easy and efficient >>>> enough to detect a rather expansive subset of graphs which do admit >>>> unique and efficient labeling. At the very least graphs which only >> use >>>> a blank node precisely twice (to define something and to refer to it >>>> once as in the bracket notation) are pretty simple, using a simple >>>> hash table with counters -- that perhaps being the commonest case as >>>> well. >>>> >>>> If the test succeeds, define a unique labeling based on the rest of >>>> the attributes of the triple and lexical ordering; if not, ask the >>>> user whether general graph isomorphism comparison is wanted, and if >>>> so, do that, somehow signaling that it really went that far (perhaps >>>> inband in the format of the labels? or out of band as the case may >>>> be); if not, or if you can't do graph isomorphism in your code, then >>>> slap on nonunique labels, again differentiating them somehow from the >>>> first two cases. >>>> >>>> That is certainly not an easy or clean solution, but it doesn't break >>>> the stack, and it works in most of the places where you want to do >>>> fast path processing under the assumption that in fact the labels are >>>> canonical, and can be relied upon to have 1-1 correpondence from >>>> syntax to node. >>> >>> I agree. Does anyone have a good sampling of the LOD cloud we could >>> easily use for this experiment? >>> >>> -- Sandro >> >> >> -- Dave Longley CTO Digital Bazaar, Inc. http://digitalbazaar.com
Received on Wednesday, 13 May 2015 15:20:45 UTC