Re: Fwd: deterministic naming of blank nodes from Eric Korb on 2015-05-14 (public-credentials@w3.org from May 2015)

From: Eric Korb <eric.korb@accreditrust.com>
Date: Thu, 14 May 2015 08:03:28 -0400
To: Dave Longley <dlongley@digitalbazaar.com>
Cc: Gregg Kellogg <gregg@greggkellogg.com>, Credentials Community Group <public-credentials@w3.org>, ahogan@dcc.uchile.cl
Message-ID: <CAMX+RnDvcnYpvBDyOgzy3QhL80eS-p4wNirXuTsUhF=ACtJZNw@mail.gmail.com>
+1


*"Trust only credentials that are TrueCred*™ *verified."*
----------------------------------
Eric Korb, President/CEO - accreditrust.com <https://www.accreditrust.com>


On Wed, May 13, 2015 at 11:20 AM, Dave Longley <dlongley@digitalbazaar.com>
wrote:

>  On 05/13/2015 02:47 AM, Gregg Kellogg wrote:
>
> Worth looking at for a comparison.
>
>
> I think Aidan Hogan's paper actually describes a nearly identical
> algorithm to what we created. Which is a good thing. His paper does a much
> better job of laying it out in mathematical terms.
>
>
> Gregg Kellogg
>
>  Sent from my iPad
>
> Begin forwarded message:
>
>  *Resent-From:* semantic-web@w3.org
> *From:* ahogan@dcc.uchile.cl
> *Date:* May 12, 2015 at 11:14:50 PM PDT
> *To:* semantic-web@w3.org
> *Subject:* *Re: deterministic naming of blank nodes*
>
>   Hi,
>
> Sorry for resurrecting old threads but this discussion sort of prompted me
> to work on the topic of "deterministic naming of blank nodes" and I
> thought that it might be time (six months or so later) to report back on
> some findings.
>
> The result is the paper here:
>
> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
> WWW, Florence, Italy, May 18–22, 2015 (to appear).
> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>
> I'll be presenting it next Friday if anyone happens to be around the
> Florence area.
>
> Roughly speaking, the paper presents an algorithm for deterministically
> labelling blank nodes of any RDF graph such that the labels of two
> isomorphic RDF graphs will "correspond". This could be used to find
> isomorphic RDF graphs in a large set of RDF graphs (without needing
> pairwise checks), for computing signatures, or for Skolemising in such a
> way that the output of two RDF graphs will be equal if and only if they
> are isomorphic.
>
> The algorithm again works for general RDF graphs and doesn't assume any
> well-behavedness or similar. The algorithm is exponential (based roughly
> on a Nauty-style recursion) but the experiments show that, in terms of
> isomorphism at least, you would have to have really really really *really*
> weird RDF graphs to run into any real trouble (we're talking cliques of
> blank nodes with k>55 or Miyazki graphs or similar). We skolemised all the
> individual graphs in the BTC-14 collection ... the slowest took 31–40
> seconds, mainly due to its size: 7.3 million triples with 254 thousand
> blank nodes.
>
> One drawback of the current approach is that at the moment you have to
> load the RDF graph into memory, which may be prohibitive for very very
> large graphs (of connected blank nodes), but I'm not sure how often that
> happens in practice. (It wasn't a problem for any of the graphs in the
> BTC-14 dataset for example.)
>
> There's also a software package available here:
>
> https://github.com/aidhog/blabel/wiki
>
> Of course the usual disclaimers about research code apply!
>
> Anyways, I thought these results might be of interest and hopefully might
> have applications in the context of Skolemisation or signing RDF graphs or
> similar.
>
> Best,
> Aidan
>
> On 21/10/2014 04:30, Laurens Rietveld wrote:
>
> Coincidentally, I will present SampLD[1] in two hours at the ISWC (11.00
>
> local time, sala 300), which might be suitable for your experiment.
>
> SampLD is a scalable approach for sampling RDF graphs, solely using the
>
> topology of the graph:
>
>
>     The Linked Data cloud has grown to become the largest knowledge base
>
>     ever constructed.
>
>     Its size is now turning into a major bottleneck for many
>
>     applications. In order to facilitate access to this structured
>
>     information, this paper proposes an automatic sampling method
>
>     targeted at maximizing answer coverage for applications using SPARQL
>
>     querying.
>
>     The approach presented in this paper is novel: no similar RDF
>
>     sampling approach exist. Additionally, the concept of creating a
>
>     sample aimed at maximizing SPARQL answer coverage, is unique.
>
>     We empirically show that the relevance of triples for sampling (a
>
>     semantic notion) is influenced by the topology of the graph (purely
>
>     structural), and can be determined
>
>     without prior knowledge of the queries.
>
>     Experiments show a significantly higher recall of topology based
>
>     sampling methods over random and naive baseline approaches (e.g. up
>
>     to 90% for Open-BioMed at a sample size of 6%).
>
>
>
>  And, to support LOD-scale experiments, you might be interested in the
>
> LOD Laundromat[2,3] as well, presented this Thursday at ISWC as well.
>
> Here, we (re)publish clean, sanitized versions of Linked Data sets as
>
> sorted gzipped n-triples (13 billion triples and counting), with
>
> corresponding meta-data (e.g. the number of blank nodes in this
>
> dataset[4]).
>
>
>  Best, Laurens
>
>
>  [1]
>
>
> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
>
> [2]
>
>
> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
>
> [3] http://lodlaundromat.org
>
> [4]
>
>
> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
> <http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC%28%3FnumBnodes%29+LIMIT+100>
>
> --
>
>
>  VU University Amsterdam____
>
>
>  Faculty of Exact Sciences____
>
>
>  Department of Computer Science____
>
>
>  De Boelelaan 1081 A____
>
>
>  1081 HV Amsterdam____
>
>
>  The Netherlands
>
>
>  www.laurensrietveld.nl <http://www.laurensrietveld.nl>
>
>
>  laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl
> <laurens.rietveld@vu.nl>>
>
>
>  Visiting address:
>
>
>  De Boelelaan 1081____
>
>
>  Science Building Room T312
>
>
>
>  On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org
>
> <mailto:sandro@w3.org <sandro@w3.org>>> wrote:
>
>
>     On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>
>  On 2014-10-07, Sandro Hawke wrote:
>
>
>   That may be true, but it is hard for me to see how any benefit this
>
>   could bring would outweigh the absolute pain in the ass it would be
>
>   for everyone to change their RDF stacks.
>
>
>     It was not me who said that.   That was Alan Ruttenberg.
>
>
>
>   So, why not subdivide the process? It ought to be easy and efficient
>
>  enough to detect a rather expansive subset of graphs which do admit
>
>  unique and efficient labeling. At the very least graphs which only
>
>  use
>
> a blank node precisely twice (to define something and to refer to it
>
>  once as in the bracket notation) are pretty simple, using a simple
>
>  hash table with counters -- that perhaps being the commonest case as
>
>  well.
>
>
>   If the test succeeds, define a unique labeling based on the rest of
>
>  the attributes of the triple and lexical ordering; if not, ask the
>
>  user whether general graph isomorphism comparison is wanted, and if
>
>  so, do that, somehow signaling that it really went that far (perhaps
>
>  inband in the format of the labels? or out of band as the case may
>
>  be); if not, or if you can't do graph isomorphism in your code, then
>
>  slap on nonunique labels, again differentiating them somehow from the
>
>  first two cases.
>
>
>   That is certainly not an easy or clean solution, but it doesn't break
>
>  the stack, and it works in most of the places where you want to do
>
>  fast path processing under the assumption that in fact the labels are
>
>  canonical, and can be relied upon to have 1-1 correpondence from
>
>  syntax to node.
>
>
>     I agree.   Does anyone have a good sampling of the LOD cloud we could
>
>     easily use for this experiment?
>
>
>            -- Sandro
>
>
>
>
>
>
> --
> Dave Longley
> CTO
> Digital Bazaar, Inc.http://digitalbazaar.com
>
>
Received on Thursday, 14 May 2015 12:04:17 UTC