Re: Well Behaved RDF - Taming Blank Nodes, etc. from Aidan Hogan on 2012-12-17 (semantic-web@w3.org from December 2012)

From: Aidan Hogan <aidan.hogan@deri.org>
Date: Mon, 17 Dec 2012 21:56:35 +0000
To: David Booth <david@dbooth.org>
CC: Semantic Web <semantic-web@w3.org>
Message-ID: <50CF9513.9050006@deri.org>
Hi David, all,

On behalf of the authors, I can summarise some relevant experiences we 
had with the work in the "On blank nodes" paper mentioned, as well as 
some subsequent internal discussion about the work and this thread. 
[Apologies for the long resulting email.]

<snip>
On 12/12/2012 19:09, David Booth wrote:
> Perhaps, for some RDF authors.  And those authors could use full RDF
> instead of the Well Behaved RDF profile.  But according to
> http://web.ing.puc.cl/~marenas/publications/iswc11.pdf
> the vast majority of RDF documents (over 98% of their samples) use blank
> nodes in non-problematic ways.  (I.e., they contain no blank node
> cycles, and thus do not cause the graph isomorphism complexity problem.)
> At present the many applications that process RDF have to pay for the
> sins of those (very) few RDF graphs that use blank nodes in problematic
> ways.

Internally, after said ISWC paper, we had similar discussions precisely 
along these lines of a Well Behaved RDF proposal, thinking about a 
profile of RDF that disallowed blank node cycles. It's certainly an 
interesting discussion to have and we're glad to see it being debated here.

An argument for such a proposal would be on the consumption side: 
developers of applications could claim to well-support (with, e.g., 
isomorphism for "equivalence" and homomorphism for "leanness") Well 
Behaved RDF with an authoritative reference for what they mean. This 
would seem beneficial to help buck the current trend of using 
"Skolemisation" as a carpet to sweep all blank nodes under, allowing to 
distinguish easier common cases from more difficult uncommon cases.

One possible argument against doing this is if 98% of the documents 
(indeliberately perhaps) already follow these principles, would it 
perhaps be preaching to the converted?

Another complication arises with entailment: for example, 
materialisation wrt. Well Behaved RDF could often produce Non-Well 
Behaved RDF. This would also need to be considered.

> Actually, it would be interesting to examine whether those <2% of graphs
> that did have blank node cycles really needed them.My suspicion is
> that the authors could have simply minted a few URIs to break those
> blank node cycles and turn them into non-problematic blank node trees.
> In the nearly 4 million RDF documents Mallea, Arenas, Hogan, and
> Polleres examined, the maximum blank node treewidth they found was 7,
> which I think (though a graph theory expert would have to confirm) that
> only 6 URIs would have to have been minted to turn it into a tree.

In terms of distinguishing whether or not those blank node cycles are 
needed in the cases where they are used, that would be quite subjective 
(i.e., "difficult") based on whether or not (i) blank nodes are needed; 
(ii) cycles are needed.

In terms of removing 6 vertices from the graph to remove cycles (i.e., 
replacing bnodes with URIs), this may not be sufficient in some cases 
(thanks to Marcelo for confirming; a counter-example being grid graphs). 
More minting would often be needed (depending on factors other than 
treewidth).

It's also probably worth noting that the graph in question has since 
gone offline, as have many of the very high treewidth documents from the 
rdfabout.com site that we found at the time. In fact, the only 
"high-volume" exporter producing high treewidths seems to have been that 
site. In more recent analysis (of the BTC 2011 dataset), the counts were:

TW No. of bnode graphs % of bnode graphs
1  2,590,686   99.4%
2  15,996    0.6%
3  4    ~
4  3    ~
5  1    ~


Treewidths >=3 were "once-off" documents. As a concrete example, the 
highest treewidth we found (5) was:

 http://kasei.us/e/people/index.rdf

(I hope kasei doesn't mind; it's perhaps useful to have a real-world RDF 
graph to look at since treewidth is a very abstract notion.)

<snip>
> On 13/12/2012 17:09, David Booth wrote:
> On Wed, 2012-12-12 at 22:37 -0800, Pat Hayes wrote:
>> [ . . . ]
>>> I also don't want any cycles, but that is much weaker than your
>>> proposal. Why not just say that wellbehaved means, no bnode cycles?
>>
>> That would be fine from a formal perspective, and I first considered
>> proposing it that way.  The reasons I instead phrased the restriction in
>> terms of Turtle serialization were purely practical.  First, it seemed
>> easier to explain, to say that there simply cannot be any explicit blank
>> node labels.  If the restriction were expressed in terms of blank node
>> cycles, then there would have to be an explanation of what that means,
>> how the blank node graph relates to the original RDF graph, etc.

Yes, we believe that syntaxes are a primary cause of the high prevalence 
of tree structures for blank node. Omitting blank nodes for Turtle or 
RDF/XML (the latter being more prevalent in the Wild) has the same 
effect. As mentioned, having no blank node labels in these syntaxes is 
sufficient but not necessary to ensure cycle-free blank nodes.

The data we have available to us is N-Quads so we can't look at such 
syntax usage directly. However, to distinguish the prevelance of cases 
of no cycles and no use of bnode labels, we can look at prevalence of 
occurrence of bnodes in the object position. From the ISWC paper:

"""Each blank node occurred, on average, 0.995 times in the object 
position of a [...] triple, with 3.1 m blank nodes (1.9% of all blank 
nodes) not occurring in the object position; conversely, each blank node 
occurred on average 4.239 times in the subject position of a triple, 
with 69 k (0.04%) not occurring in the subject position."""
 http://web.ing.puc.cl/~marenas/publications/iswc11.pdf

This suggests that blank nodes tend to occur only once in the object 
position of a triple, lending support to the idea that a high ratio of 
documents don't use bnode IDs ... again, multiple occurrences as an 
object would require bnode labels in RDF/XML or Turtle.

<snip>
> On 14/12/2012 20:42, David Booth wrote:> On Thu, 2012-12-13 at 15:14 -0500, Ivan Herman wrote:
>> [ . . . ]
>>> I think it would still be better to explain these things in a syntax
>>> independent way. After all, I may want to use JSON-LD or RDFa...
>>>
>>> Distilling the various mails and concentrating on bnodes only, what
>>> seems to be the pattern is
>>>
>>> - bnodes can appear in at most one triple as an object
>>> - there can be no cycle in the graphs involving bnodes
>>>
>>> Would that suffice as a more formal definition?
>>
>> As of today I think that would suffice, though I'm unsure of the details
>> of how the "no bnode cycles" constraint should be formalized.  Perhaps
>> Jeremy Carroll or one of the authors of
>> http://web.ing.puc.cl/~marenas/publications/iswc11.pdf
>> could comment.

For no-blank-node cycles, an informal draft would be:

Given an RDF graph, consider an undirected graph with (i) its blank 
nodes as vertices, (ii) an edge between two blank nodes iff they appear 
in the same triple of the original RDF graph (excluding loops). Call 
this the graph of connected blank nodes in the RDF graph.  ...  For RDF 
to be Well Behaved, this graph should not contain cycles.

This is covered more formally in Section 3.1 of the paper, with 
reference to an earlier work by Pichler et al. The concept is 
independent of subject/object directionality.

<snip>
> On 13/12/2012 10:35, Andy Seaborne wrote:
 >
> Promoting well-formed lists would be good.
>
> The restriction of "no labels" is not just about "no cycles" - it's
> things that are not tree-like:
>
> :x1 :p _:a .
> :x2 :q _:a .

Aside from the paper, in my own experience, publishers tend to work with 
the "grain" of the syntax. This means that, e.g., we have a prevalence 
of properties like foaf:knows, rdf:rest, owl:unionOf, etc., and not 
foaf:knownBy, rdf:hasRest, owl:isUnionFor, etc. Perhaps putting it 
another way, when many people model with blank nodes in RDF, they tend 
to think in terms of nesting in syntaxes like Turtle or RDF/XML, which 
are inherently tree like: put your protagonist(s) at the root and branch 
from there. The directionality of properties used (esp. involving blank 
nodes) then follows. This certainly does not cover all cases or all 
publishers, but I think this is a widely followed pattern.


On a related note, another aspect of blank nodes to potentially consider 
for Well Behaved RDF is the notion of (non)-lean RDF. But this email is 
much too long already.

Best,
Aidan (et al.)
Received on Monday, 17 December 2012 21:57:11 UTC