- From: Gregg Kellogg <greggkellogg@gmail.com>
- Date: Wed, 20 May 2015 12:37:42 -0700
- To: ahogan@dcc.uchile.cl
- Cc: W3C Semantic Web IG <semantic-web@w3.org>, David Longley <dlongley@digitalbazaar.com>
Just so readers of this thread are aware, the W3C Credentials Community Group has been fostering an RDF Dataset Normalization specification [1] which was previously initiated as part of JSON-LD, although it concerns itself exclusively with RDF Datasets and canonical N-Quads. The goal is to use this as a basis for a future Recommendation. The algorithm was created by Dave Longley, and I understand that it is substantially the same as Aidan’s (it would be good to have an independent analysis). The specification needs a lot of work to motivate specific use cases and create normative language for describing the particular branches necessary to resolve isomorphism ambiguities and challenging graph topographies. There is a fair test suite already created (see [1]). Ideally, Aidan’s paper could form the basis of a formal description of the algorithm, with the existing algorithms described in an appendix, to the degree not required for interoperability. I’ve just completed my own implementation in Ruby [2] to join other implementations in Python and JavaScript [3], [4]. We would welcome feedback and collaborators, as the expertise required for this is outside of the normal domain for the Credentials Community Group. Gregg [1] http://json-ld.github.io/normalization/ [2] https://github.com/ruby-rdf/rdf-normalize [3] https://github.com/digitalbazaar/pyld [4] https://github.com/digitalbazaar/jsonld.js > On May 12, 2015, at 11:14 PM, ahogan@dcc.uchile.cl wrote: > > Hi, > > Sorry for resurrecting old threads but this discussion sort of prompted me > to work on the topic of "deterministic naming of blank nodes" and I > thought that it might be time (six months or so later) to report back on > some findings. > > The result is the paper here: > > Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In > WWW, Florence, Italy, May 1822, 2015 (to appear). > Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf > > I'll be presenting it next Friday if anyone happens to be around the > Florence area. > > Roughly speaking, the paper presents an algorithm for deterministically > labelling blank nodes of any RDF graph such that the labels of two > isomorphic RDF graphs will "correspond". This could be used to find > isomorphic RDF graphs in a large set of RDF graphs (without needing > pairwise checks), for computing signatures, or for Skolemising in such a > way that the output of two RDF graphs will be equal if and only if they > are isomorphic. > > The algorithm again works for general RDF graphs and doesn't assume any > well-behavedness or similar. The algorithm is exponential (based roughly > on a Nauty-style recursion) but the experiments show that, in terms of > isomorphism at least, you would have to have really really really *really* > weird RDF graphs to run into any real trouble (we're talking cliques of > blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the > individual graphs in the BTC-14 collection ... the slowest took 3140 > seconds, mainly due to its size: 7.3 million triples with 254 thousand > blank nodes. > > One drawback of the current approach is that at the moment you have to > load the RDF graph into memory, which may be prohibitive for very very > large graphs (of connected blank nodes), but I'm not sure how often that > happens in practice. (It wasn't a problem for any of the graphs in the > BTC-14 dataset for example.) > > There's also a software package available here: > > https://github.com/aidhog/blabel/wiki > > Of course the usual disclaimers about research code apply! > > Anyways, I thought these results might be of interest and hopefully might > have applications in the context of Skolemisation or signing RDF graphs or > similar. > > Best, > Aidan > > On 21/10/2014 04:30, Laurens Rietveld wrote: >> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00 >> local time, sala 300), which might be suitable for your experiment. >> SampLD is a scalable approach for sampling RDF graphs, solely using the >> topology of the graph: >> >> The Linked Data cloud has grown to become the largest knowledge base >> ever constructed. >> Its size is now turning into a major bottleneck for many >> applications. In order to facilitate access to this structured >> information, this paper proposes an automatic sampling method >> targeted at maximizing answer coverage for applications using SPARQL >> querying. >> The approach presented in this paper is novel: no similar RDF >> sampling approach exist. Additionally, the concept of creating a >> sample aimed at maximizing SPARQL answer coverage, is unique. >> We empirically show that the relevance of triples for sampling (a >> semantic notion) is influenced by the topology of the graph (purely >> structural), and can be determined >> without prior knowledge of the queries. >> Experiments show a significantly higher recall of topology based >> sampling methods over random and naive baseline approaches (e.g. up >> to 90% for Open-BioMed at a sample size of 6%). >> >> >> And, to support LOD-scale experiments, you might be interested in the >> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well. >> Here, we (re)publish clean, sanitized versions of Linked Data sets as >> sorted gzipped n-triples (13 billion triples and counting), with >> corresponding meta-data (e.g. the number of blank nodes in this > dataset[4]). >> >> Best, Laurens >> >> [1] >> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf >> [2] >> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf >> [3] http://lodlaundromat.org >> [4] >> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100 >> -- >> >> VU University Amsterdam____ >> >> Faculty of Exact Sciences____ >> >> Department of Computer Science____ >> >> De Boelelaan 1081 A____ >> >> 1081 HV Amsterdam____ >> >> The Netherlands >> >> www.laurensrietveld.nl <http://www.laurensrietveld.nl> >> >> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl> >> >> Visiting address: >> >> De Boelelaan 1081____ >> >> Science Building Room T312 >> >> >> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org >> <mailto:sandro@w3.org>> wrote: >> >> On 10/07/2014 10:25 PM, Sampo Syreeni wrote: >>> On 2014-10-07, Sandro Hawke wrote: >>> >>>> That may be true, but it is hard for me to see how any benefit this >>>> could bring would outweigh the absolute pain in the ass it would be >>>> for everyone to change their RDF stacks. >> >> It was not me who said that. That was Alan Ruttenberg. >> >>> >>> So, why not subdivide the process? It ought to be easy and efficient >>> enough to detect a rather expansive subset of graphs which do admit >>> unique and efficient labeling. At the very least graphs which only > use >>> a blank node precisely twice (to define something and to refer to it >>> once as in the bracket notation) are pretty simple, using a simple >>> hash table with counters -- that perhaps being the commonest case as >>> well. >>> >>> If the test succeeds, define a unique labeling based on the rest of >>> the attributes of the triple and lexical ordering; if not, ask the >>> user whether general graph isomorphism comparison is wanted, and if >>> so, do that, somehow signaling that it really went that far (perhaps >>> inband in the format of the labels? or out of band as the case may >>> be); if not, or if you can't do graph isomorphism in your code, then >>> slap on nonunique labels, again differentiating them somehow from the >>> first two cases. >>> >>> That is certainly not an easy or clean solution, but it doesn't break >>> the stack, and it works in most of the places where you want to do >>> fast path processing under the assumption that in fact the labels are >>> canonical, and can be relied upon to have 1-1 correpondence from >>> syntax to node. >> >> I agree. Does anyone have a good sampling of the LOD cloud we could >> easily use for this experiment? >> >> -- Sandro > > >
Received on Wednesday, 20 May 2015 19:38:14 UTC