- From: Laurens Rietveld <laurens.rietveld@vu.nl>
- Date: Tue, 21 Oct 2014 09:30:13 +0200
- To: Sandro Hawke <sandro@w3.org>
- CC: Sampo Syreeni <decoy@iki.fi>, Alan Ruttenberg <alanruttenberg@gmail.com>, David Booth <david@dbooth.org>, Tim Berners-Lee <timbl@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
- Message-ID: <CAKjXa4M8ZQgQpVZP1jrnjh-hAbzZRLZybsO77y6KtD7Hbd_MAw@mail.gmail.com>
Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00 local time, sala 300), which might be suitable for your experiment. SampLD is a scalable approach for sampling RDF graphs, solely using the topology of the graph: The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck for many applications. In order to facilitate access to this structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage for applications using SPARQL querying. The approach presented in this paper is novel: no similar RDF sampling approach exist. Additionally, the concept of creating a sample aimed at maximizing SPARQL answer coverage, is unique. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches (e.g. up to 90% for Open-BioMed at a sample size of 6%). And, to support LOD-scale experiments, you might be interested in the LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well. Here, we (re)publish clean, sanitized versions of Linked Data sets as sorted gzipped n-triples (13 billion triples and counting), with corresponding meta-data (e.g. the number of blank nodes in this dataset[4]). Best, Laurens [1] http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf [2] http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf [3] http://lodlaundromat.org [4] http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100 -- VU University Amsterdam Faculty of Exact Sciences Department of Computer Science De Boelelaan 1081 A 1081 HV Amsterdam The Netherlands www.laurensrietveld.nl laurens.rietveld@vu.nl Visiting address: De Boelelaan 1081 Science Building Room T312 On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org> wrote: > On 10/07/2014 10:25 PM, Sampo Syreeni wrote: > > On 2014-10-07, Sandro Hawke wrote: > > > >> That may be true, but it is hard for me to see how any benefit this > >> could bring would outweigh the absolute pain in the ass it would be > >> for everyone to change their RDF stacks. > > It was not me who said that. That was Alan Ruttenberg. > > > > > So, why not subdivide the process? It ought to be easy and efficient > > enough to detect a rather expansive subset of graphs which do admit > > unique and efficient labeling. At the very least graphs which only use > > a blank node precisely twice (to define something and to refer to it > > once as in the bracket notation) are pretty simple, using a simple > > hash table with counters -- that perhaps being the commonest case as > > well. > > > > If the test succeeds, define a unique labeling based on the rest of > > the attributes of the triple and lexical ordering; if not, ask the > > user whether general graph isomorphism comparison is wanted, and if > > so, do that, somehow signaling that it really went that far (perhaps > > inband in the format of the labels? or out of band as the case may > > be); if not, or if you can't do graph isomorphism in your code, then > > slap on nonunique labels, again differentiating them somehow from the > > first two cases. > > > > That is certainly not an easy or clean solution, but it doesn't break > > the stack, and it works in most of the places where you want to do > > fast path processing under the assumption that in fact the labels are > > canonical, and can be relied upon to have 1-1 correpondence from > > syntax to node. > > I agree. Does anyone have a good sampling of the LOD cloud we could > easily use for this experiment? > > -- Sandro > > > -- VU University Amsterdam Faculty of Exact Sciences Department of Computer Science De Boelelaan 1081 A 1081 HV Amsterdam The Netherlands www.laurensrietveld.nl laurens.rietveld@vu.nl Visiting address: De Boelelaan 1081 Science Building Room T312
Received on Tuesday, 21 October 2014 07:30:49 UTC