Re: deterministic naming of blank nodes from Dave Longley on 2015-05-27 (semantic-web@w3.org from May 2015)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Wed, 27 May 2015 15:48:27 -0400
To: Gregg Kellogg <gregg@greggkellogg.net>, Aidan Hogan <ahogan@dcc.uchile.cl>
CC: W3C Semantic Web IG <semantic-web@w3.org>
Message-ID: <55661F8B.7060804@digitalbazaar.com>
On 05/26/2015 05:14 PM, Gregg Kellogg wrote:
>
>> On May 26, 2015, at 2:00 PM, Aidan Hogan <ahogan@dcc.uchile.cl>
>> wrote:
>>
>> Hey Gregg, David,
>>
>> It's good to see that there's some ongoing work in the W3C on this
>> topic!
>>
>> I had a quick look though the proposals in question.
>>
>> http://json-ld.github.io/normalization/spec/index.html
>>
>> Hmm ... but to be honest, I could only get a very high level
>> understanding of the algorithms in 4.4.2/4.8.2. It seems on a
>> high-level like the algorithms are similar to what's in the paper
>> but I can't tell for sure ... maybe a more formal type of
>> pseudo-code would be better for presenting these algorithms?
>> (Indeed I can definitely appreciate it's challenging to get this
>> sort of stuff on paper and keep it readable.) From the gist
>> though, both seem to go in the same direction.

Aidan,

Here's my understanding of the similarities:

 From your paper:

"We assume a set of totally ordered colours C (in our case
hashes). Let Clr be a map from blank nodes to colours
computed, e.g., by the naive methods described previously.
If Clr maps all blank nodes in G to a unique colour, we
are done. However, as discussed, this may not always be
the case. Initial colourings may assign the same colour to
different blank nodes, forming a partition of blank nodes."

This is essentially the same concept as step 5 in our main
"Normalization Algorithm" [1]. We run bnodes through a "Hash First
Degree Quads" algorithm [2] to attempt to produce a unique hash
("color"). The information that is hashed doesn't take arbitrary
information like bnode names into consideration, only the information
that is directly connected to a bnode that can be considered unique is
included. However, as your paper notes, the result may be that some
bnodes share the same hash and need to be further refined to detect
uniqueness.

 From your paper:

"Our goal is to thus use a deterministic process – avoiding
arbitrary choices and use of blank node labels – to compute a
colouring for the RDF graph G that results in a fine partition
of blank nodes. The general idea is to manually distinguish
individual blank nodes in non-trivial partitions. Since there
is no deterministic way to choose one such blank node, we
must try all equal choices. However, we can use an ordering
of the partition to narrow the choices insofar as possible."

Again, we do the same here. Our process for "ordering the partition"
occurs in the "Hash N-Degree Quads" algorithm [3] and is roughly this:

For each bnode in a "partition" (set of bnodes with the same naive
hash), we "order" the partition by traversing away from the bnode, along
a path determined by each permutation of connected bnodes. The connected
bnodes are assigned names based on a hash of the information that
relates them to the bnode. This involves using their assigned canonical
names (if they have already been determined to be unique by the
algorithm), the current path, or a hash of the information related to
them (via "Hash First Degree Quads"). Since the connected bnodes are
assigned names, the paths can be serialized and lexicographically
compared. As we compute paths, we choose the lexicographically least
path as our ordering. If we find that a path's serialization will be
greater than the present candidate, we optimize by stopping further
generation of the path.

While generating paths, we keep track of state information, such as a
temporary bnode "identifier issuer" and the path we've taken so far.
This means that each time we traverse to another bnode in the
permutation, if that bnode hasn't been visited, it is scheduled for
recursion and we use our temporary bnode identifier issuer to assign it 
a name. Then we append its temporary name to our path. If a 
serialization of the path is lexicographically greater than our current 
candidate, we continue to the next permutation.

After processing each bnode in a permutation, we process those that were
scheduled for recursion (by recursing and effectively passing copies of
our state so far).

Once we've processed everything, we use the "identifier issuer" that
goes with the chosen ordered path to assign the remaining canonical names.

There are more details about the algorithm in the spec, but hopefully
this helps better explain the main idea and how it maps to your paper.

[1]
http://json-ld.github.io/normalization/spec/index.html#normalization-algorithm
[2]
http://json-ld.github.io/normalization/spec/index.html#hash-first-degree-quads
[3]
http://json-ld.github.io/normalization/spec/index.html#hash-n-degree-quads

>
> Aidan, thanks for taking a look. Yes, the sections need to be
> written in terms of normative requirements, and there needs to be
> theoretical aspects to discuss the behavior at a higher level. As a
> practical consideration, the existing algorithm specification (or
> something similar) may be required to do an implementation and
> interoperate, but they should be in the form of a non-normative
> section or appendix, IMO.
>
> Dave did a better analysis comparing the two, and he should be able
> to provide better illumination on how the algorithms match up
> against your paper.
>
> We also need some example walk-throughs for different graph shapes
> to indicate the behavior of the algorithm, IMO. Much still to do.
>
> Gregg
>
>> Cheers, Aidan
>>
>> On 20/05/2015 16:37, Gregg Kellogg wrote:
>>> Just so readers of this thread are aware, the W3C Credentials
>>> Community Group has been fostering an RDF Dataset Normalization
>>> specification [1] which was previously initiated as part of
>>> JSON-LD, although it concerns itself exclusively with RDF
>>> Datasets and canonical N-Quads. The goal is to use this as a
>>> basis for a future Recommendation.
>>>
>>> The algorithm was created by Dave Longley, and I understand that
>>> it is substantially the same as Aidan’s (it would be good to have
>>> an independent analysis). The specification needs a lot of work
>>> to motivate specific use cases and create normative language for
>>> describing the particular branches necessary to resolve
>>> isomorphism ambiguities and challenging graph topographies. There
>>> is a fair test suite already created (see [1]). Ideally, Aidan’s
>>> paper could form the basis of a formal description of the
>>> algorithm, with the existing algorithms described in an appendix,
>>> to the degree not required for interoperability.
>>>
>>> I’ve just completed my own implementation in Ruby [2] to join
>>> other implementations in Python and JavaScript [3], [4].
>>>
>>> We would welcome feedback and collaborators, as the expertise
>>> required for this is outside of the normal domain for the
>>> Credentials Community Group.
>>>
>>> Gregg
>>>
>>> [1] http://json-ld.github.io/normalization/ [2]
>>> https://github.com/ruby-rdf/rdf-normalize [3]
>>> https://github.com/digitalbazaar/pyld [4]
>>> https://github.com/digitalbazaar/jsonld.js
>>>
>>>> On May 12, 2015, at 11:14 PM, ahogan@dcc.uchile.cl wrote:
>>>>
>>>> Hi,
>>>>
>>>> Sorry for resurrecting old threads but this discussion sort of
>>>> prompted me to work on the topic of "deterministic naming of
>>>> blank nodes" and I thought that it might be time (six months
>>>> or so later) to report back on some findings.
>>>>
>>>> The result is the paper here:
>>>>
>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving
>>>> Isomorphism". In WWW, Florence, Italy, May 1822, 2015 (to
>>>> appear). Available from:
>>>> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>>>
>>>> I'll be presenting it next Friday if anyone happens to be
>>>> around the Florence area.
>>>>
>>>> Roughly speaking, the paper presents an algorithm for
>>>> deterministically labelling blank nodes of any RDF graph such
>>>> that the labels of two isomorphic RDF graphs will
>>>> "correspond". This could be used to find isomorphic RDF graphs
>>>> in a large set of RDF graphs (without needing pairwise checks),
>>>> for computing signatures, or for Skolemising in such a way that
>>>> the output of two RDF graphs will be equal if and only if they
>>>> are isomorphic.
>>>>
>>>> The algorithm again works for general RDF graphs and doesn't
>>>> assume any well-behavedness or similar. The algorithm is
>>>> exponential (based roughly on a Nauty-style recursion) but the
>>>> experiments show that, in terms of isomorphism at least, you
>>>> would have to have really really really *really* weird RDF
>>>> graphs to run into any real trouble (we're talking cliques of
>>>> blank nodes with k>55 or Miyazki graphs or similar). We
>>>> skolemised all the individual graphs in the BTC-14 collection
>>>> ... the slowest took 3140 seconds, mainly due to its size:
>>>> 7.3 million triples with 254 thousand blank nodes.
>>>>
>>>> One drawback of the current approach is that at the moment you
>>>> have to load the RDF graph into memory, which may be
>>>> prohibitive for very very large graphs (of connected blank
>>>> nodes), but I'm not sure how often that happens in practice.
>>>> (It wasn't a problem for any of the graphs in the BTC-14
>>>> dataset for example.)
>>>>
>>>> There's also a software package available here:
>>>>
>>>> https://github.com/aidhog/blabel/wiki
>>>>
>>>> Of course the usual disclaimers about research code apply!
>>>>
>>>> Anyways, I thought these results might be of interest and
>>>> hopefully might have applications in the context of
>>>> Skolemisation or signing RDF graphs or similar.
>>>>
>>>> Best, Aidan
>>>>
>>>> On 21/10/2014 04:30, Laurens Rietveld wrote:
>>>>> Coincidentally, I will present SampLD[1] in two hours at the
>>>>> ISWC (11.00 local time, sala 300), which might be suitable
>>>>> for your experiment. SampLD is a scalable approach for
>>>>> sampling RDF graphs, solely using the topology of the graph:
>>>>>
>>>>> The Linked Data cloud has grown to become the largest
>>>>> knowledge base ever constructed. Its size is now turning
>>>>> into a major bottleneck for many applications. In order to
>>>>> facilitate access to this structured information, this paper
>>>>> proposes an automatic sampling method targeted at maximizing
>>>>> answer coverage for applications using SPARQL querying. The
>>>>> approach presented in this paper is novel: no similar RDF
>>>>> sampling approach exist. Additionally, the concept of
>>>>> creating a sample aimed at maximizing SPARQL answer
>>>>> coverage, is unique. We empirically show that the relevance
>>>>> of triples for sampling (a semantic notion) is influenced by
>>>>> the topology of the graph (purely structural), and can be
>>>>> determined without prior knowledge of the queries.
>>>>> Experiments show a significantly higher recall of topology
>>>>> based sampling methods over random and naive baseline
>>>>> approaches (e.g. up to 90% for Open-BioMed at a sample size
>>>>> of 6%).
>>>>>
>>>>>
>>>>> And, to support LOD-scale experiments, you might be
>>>>> interested in the LOD Laundromat[2,3] as well, presented
>>>>> this Thursday at ISWC as well. Here, we (re)publish clean,
>>>>> sanitized versions of Linked Data sets as sorted gzipped
>>>>> n-triples (13 billion triples and counting), with
>>>>> corresponding meta-data (e.g. the number of blank nodes in
>>>>> this
>>>> dataset[4]).
>>>>>
>>>>> Best, Laurens
>>>>>
>>>>> [1]
>>>>> http://laurensrietveld.nl/pdf/Structural_Properties_as_Proxy_for_Semantic_Relevance_in_RDF_Graph_Sampling.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
[2]
>>>>> http://laurensrietveld.nl/pdf/LOD_Laundromat_-_A_Uniform_Way_of_Publishing_Other_Peoples_Dirty_Data.pdf
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
[3] http://lodlaundromat.org
>>>>> [4]
>>>>> http://lodlaundromat.org/sparql/?query=PREFIX+llm%3A+%3Chttp%3A%2F%2Flodlaundromat.org%2Fmetrics%2Fontology%2F%3E%0APREFIX+void%3A+%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E%0ASELECT+*+WHERE+%7B%0A++%5B%5D+llm%3Ametrics%2Fllm%3AdistinctBlankNodes+%3FnumBnodes+%3B%0A++++%09void%3AdataDump+%3Fdownload%0A%7D+ORDER+BY+DESC(%3FnumBnodes)+LIMIT+100
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
--
>>>>>
>>>>> VU University Amsterdam____
>>>>>
>>>>> Faculty of Exact Sciences____
>>>>>
>>>>> Department of Computer Science____
>>>>>
>>>>> De Boelelaan 1081 A____
>>>>>
>>>>> 1081 HV Amsterdam____
>>>>>
>>>>> The Netherlands
>>>>>
>>>>> www.laurensrietveld.nl <http://www.laurensrietveld.nl>
>>>>>
>>>>> laurens.rietveld@vu.nl <mailto:laurens.rietveld@vu.nl>
>>>>>
>>>>> Visiting address:
>>>>>
>>>>> De Boelelaan 1081____
>>>>>
>>>>> Science Building Room T312
>>>>>
>>>>>
>>>>> On Mon, Oct 20, 2014 at 5:47 PM, Sandro Hawke <sandro@w3.org
>>>>>  <mailto:sandro@w3.org>> wrote:
>>>>>
>>>>> On 10/07/2014 10:25 PM, Sampo Syreeni wrote:
>>>>>> On 2014-10-07, Sandro Hawke wrote:
>>>>>>
>>>>>>> That may be true, but it is hard for me to see how any
>>>>>>> benefit this could bring would outweigh the absolute
>>>>>>> pain in the ass it would be for everyone to change their
>>>>>>> RDF stacks.
>>>>>
>>>>> It was not me who said that.   That was Alan Ruttenberg.
>>>>>
>>>>>>
>>>>>> So, why not subdivide the process? It ought to be easy and
>>>>>> efficient enough to detect a rather expansive subset of
>>>>>> graphs which do admit unique and efficient labeling. At
>>>>>> the very least graphs which only
>>>> use
>>>>>> a blank node precisely twice (to define something and to
>>>>>> refer to it once as in the bracket notation) are pretty
>>>>>> simple, using a simple hash table with counters -- that
>>>>>> perhaps being the commonest case as well.
>>>>>>
>>>>>> If the test succeeds, define a unique labeling based on
>>>>>> the rest of the attributes of the triple and lexical
>>>>>> ordering; if not, ask the user whether general graph
>>>>>> isomorphism comparison is wanted, and if so, do that,
>>>>>> somehow signaling that it really went that far (perhaps
>>>>>> inband in the format of the labels? or out of band as the
>>>>>> case may be); if not, or if you can't do graph isomorphism
>>>>>> in your code, then slap on nonunique labels, again
>>>>>> differentiating them somehow from the first two cases.
>>>>>>
>>>>>> That is certainly not an easy or clean solution, but it
>>>>>> doesn't break the stack, and it works in most of the
>>>>>> places where you want to do fast path processing under the
>>>>>>  assumption that in fact the labels are canonical, and can
>>>>>>  be relied upon to have 1-1 correpondence from syntax to
>>>>>> node.
>>>>>
>>>>> I agree.   Does anyone have a good sampling of the LOD cloud
>>>>> we could easily use for this experiment?
>>>>>
>>>>> -- Sandro
>>>>
>>>>
>>>>
>>>
>
>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Wednesday, 27 May 2015 19:48:52 UTC