Re: deterministic naming of blank nodes from Laurens Rietveld on 2014-10-21 (semantic-web@w3.org from October 2014)

From: Laurens Rietveld <laurens.rietveld@vu.nl>
Date: Tue, 21 Oct 2014 09:30:13 +0200
To: Sandro Hawke <sandro@w3.org>
CC: Sampo Syreeni <decoy@iki.fi>, Alan Ruttenberg <alanruttenberg@gmail.com>, David Booth <david@dbooth.org>, Tim Berners-Lee <timbl@w3.org>, SW-forum Web <semantic-web@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <CAKjXa4M8ZQgQpVZP1jrnjh-hAbzZRLZybsO77y6KtD7Hbd_MAw@mail.gmail.com>

Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
local time, sala 300), which might be suitable for your experiment. SampLD
is a scalable approach for sampling RDF graphs, solely using the topology
of the graph:

The Linked Data cloud has grown to become the largest knowledge base ever
constructed.
Its size is now turning into a major bottleneck for many applications. In
order to facilitate access to this structured information, this paper
proposes an automatic sampling method targeted at maximizing answer
coverage for applications using SPARQL querying.
The approach presented in this paper is novel: no similar RDF sampling
approach exist. Additionally, the concept of creating a sample aimed at
maximizing SPARQL answer coverage, is unique.
We empirically show that the relevance of triples for sampling (a semantic
notion) is influenced by the topology of the graph (purely structural), and
can be determined
without prior knowledge of the queries.
Experiments show a significantly higher recall of topology based sampling
methods over random and naive baseline approaches (e.g. up to 90% for
Open-BioMed at a sample size of 6%).

And, to support LOD-scale experiments, you might be interested in the LOD
Laundromat[2,3] as well, presented this Thursday at ISWC as well. Here, we
(re)publish clean, sanitized versions of Linked Data sets as sorted gzipped
n-triples (13 billion triples and counting), with corresponding meta-data
(e.g. the number of blank nodes in this dataset[4]).

Best, Laurens

[1]
http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
[2]
http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
[3] http://lodlaundromat.org
[4]
http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
-- 

VU University Amsterdam

Faculty of Exact Sciences

Department of Computer Science

De Boelelaan 1081 A

1081 HV Amsterdam

The Netherlands

www.laurensrietveld.nl
laurens.rietveld@vu.nl

Visiting address:

De Boelelaan 1081

Science Building Room T312

On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org> wrote:

> On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
> > On 2014-10-07, Sandro Hawke wrote:
> >
> >> That may be true, but it is hard for me to see how any benefit this
> >> could bring would outweigh the absolute pain in the ass it would be
> >> for everyone to change their RDF stacks.
>
> It was not me who said that.   That was Alan Ruttenberg.
>
> >
> > So, why not subdivide the process? It ought to be easy and efficient
> > enough to detect a rather expansive subset of graphs which do admit
> > unique and efficient labeling. At the very least graphs which only use
> > a blank node precisely twice (to define something and to refer to it
> > once as in the bracket notation) are pretty simple, using a simple
> > hash table with counters -- that perhaps being the commonest case as
> > well.
> >
> > If the test succeeds, define a unique labeling based on the rest of
> > the attributes of the triple and lexical ordering; if not, ask the
> > user whether general graph isomorphism comparison is wanted, and if
> > so, do that, somehow signaling that it really went that far (perhaps
> > inband in the format of the labels? or out of band as the case may
> > be); if not, or if you can't do graph isomorphism in your code, then
> > slap on nonunique labels, again differentiating them somehow from the
> > first two cases.
> >
> > That is certainly not an easy or clean solution, but it doesn't break
> > the stack, and it works in most of the places where you want to do
> > fast path processing under the assumption that in fact the labels are
> > canonical, and can be relied upon to have 1-1 correpondence from
> > syntax to node.
>
> I agree.   Does anyone have a good sampling of the LOD cloud we could
> easily use for this experiment?
>
>       -- Sandro
>
>
>

-- 

VU University Amsterdam

Faculty of Exact Sciences

Department of Computer Science

De Boelelaan 1081 A

1081 HV Amsterdam

The Netherlands

www.laurensrietveld.nl
laurens.rietveld@vu.nl

Visiting address:

De Boelelaan 1081

Science Building Room T312

Received on Tuesday, 21 October 2014 07:30:49 UTC