- From: Aidan Hogan <aidan.hogan@deri.org>
- Date: Mon, 17 Dec 2012 21:56:35 +0000
- To: David Booth <david@dbooth.org>
- CC: Semantic Web <semantic-web@w3.org>
Hi David, all, On behalf of the authors, I can summarise some relevant experiences we had with the work in the "On blank nodes" paper mentioned, as well as some subsequent internal discussion about the work and this thread. [Apologies for the long resulting email.] <snip> On 12/12/2012 19:09, David Booth wrote: > Perhaps, for some RDF authors. And those authors could use full RDF > instead of the Well Behaved RDF profile. But according to > http://web.ing.puc.cl/~marenas/publications/iswc11.pdf > the vast majority of RDF documents (over 98% of their samples) use blank > nodes in non-problematic ways. (I.e., they contain no blank node > cycles, and thus do not cause the graph isomorphism complexity problem.) > At present the many applications that process RDF have to pay for the > sins of those (very) few RDF graphs that use blank nodes in problematic > ways. Internally, after said ISWC paper, we had similar discussions precisely along these lines of a Well Behaved RDF proposal, thinking about a profile of RDF that disallowed blank node cycles. It's certainly an interesting discussion to have and we're glad to see it being debated here. An argument for such a proposal would be on the consumption side: developers of applications could claim to well-support (with, e.g., isomorphism for "equivalence" and homomorphism for "leanness") Well Behaved RDF with an authoritative reference for what they mean. This would seem beneficial to help buck the current trend of using "Skolemisation" as a carpet to sweep all blank nodes under, allowing to distinguish easier common cases from more difficult uncommon cases. One possible argument against doing this is if 98% of the documents (indeliberately perhaps) already follow these principles, would it perhaps be preaching to the converted? Another complication arises with entailment: for example, materialisation wrt. Well Behaved RDF could often produce Non-Well Behaved RDF. This would also need to be considered. > Actually, it would be interesting to examine whether those <2% of graphs > that did have blank node cycles really needed them.My suspicion is > that the authors could have simply minted a few URIs to break those > blank node cycles and turn them into non-problematic blank node trees. > In the nearly 4 million RDF documents Mallea, Arenas, Hogan, and > Polleres examined, the maximum blank node treewidth they found was 7, > which I think (though a graph theory expert would have to confirm) that > only 6 URIs would have to have been minted to turn it into a tree. In terms of distinguishing whether or not those blank node cycles are needed in the cases where they are used, that would be quite subjective (i.e., "difficult") based on whether or not (i) blank nodes are needed; (ii) cycles are needed. In terms of removing 6 vertices from the graph to remove cycles (i.e., replacing bnodes with URIs), this may not be sufficient in some cases (thanks to Marcelo for confirming; a counter-example being grid graphs). More minting would often be needed (depending on factors other than treewidth). It's also probably worth noting that the graph in question has since gone offline, as have many of the very high treewidth documents from the rdfabout.com site that we found at the time. In fact, the only "high-volume" exporter producing high treewidths seems to have been that site. In more recent analysis (of the BTC 2011 dataset), the counts were: TW No. of bnode graphs % of bnode graphs 1 2,590,686 99.4% 2 15,996 0.6% 3 4 ~ 4 3 ~ 5 1 ~ Treewidths >=3 were "once-off" documents. As a concrete example, the highest treewidth we found (5) was: http://kasei.us/e/people/index.rdf (I hope kasei doesn't mind; it's perhaps useful to have a real-world RDF graph to look at since treewidth is a very abstract notion.) <snip> > On 13/12/2012 17:09, David Booth wrote: > On Wed, 2012-12-12 at 22:37 -0800, Pat Hayes wrote: >> [ . . . ] >>> I also don't want any cycles, but that is much weaker than your >>> proposal. Why not just say that wellbehaved means, no bnode cycles? >> >> That would be fine from a formal perspective, and I first considered >> proposing it that way. The reasons I instead phrased the restriction in >> terms of Turtle serialization were purely practical. First, it seemed >> easier to explain, to say that there simply cannot be any explicit blank >> node labels. If the restriction were expressed in terms of blank node >> cycles, then there would have to be an explanation of what that means, >> how the blank node graph relates to the original RDF graph, etc. Yes, we believe that syntaxes are a primary cause of the high prevalence of tree structures for blank node. Omitting blank nodes for Turtle or RDF/XML (the latter being more prevalent in the Wild) has the same effect. As mentioned, having no blank node labels in these syntaxes is sufficient but not necessary to ensure cycle-free blank nodes. The data we have available to us is N-Quads so we can't look at such syntax usage directly. However, to distinguish the prevelance of cases of no cycles and no use of bnode labels, we can look at prevalence of occurrence of bnodes in the object position. From the ISWC paper: """Each blank node occurred, on average, 0.995 times in the object position of a [...] triple, with 3.1 m blank nodes (1.9% of all blank nodes) not occurring in the object position; conversely, each blank node occurred on average 4.239 times in the subject position of a triple, with 69 k (0.04%) not occurring in the subject position.""" http://web.ing.puc.cl/~marenas/publications/iswc11.pdf This suggests that blank nodes tend to occur only once in the object position of a triple, lending support to the idea that a high ratio of documents don't use bnode IDs ... again, multiple occurrences as an object would require bnode labels in RDF/XML or Turtle. <snip> > On 14/12/2012 20:42, David Booth wrote:> On Thu, 2012-12-13 at 15:14 -0500, Ivan Herman wrote: >> [ . . . ] >>> I think it would still be better to explain these things in a syntax >>> independent way. After all, I may want to use JSON-LD or RDFa... >>> >>> Distilling the various mails and concentrating on bnodes only, what >>> seems to be the pattern is >>> >>> - bnodes can appear in at most one triple as an object >>> - there can be no cycle in the graphs involving bnodes >>> >>> Would that suffice as a more formal definition? >> >> As of today I think that would suffice, though I'm unsure of the details >> of how the "no bnode cycles" constraint should be formalized. Perhaps >> Jeremy Carroll or one of the authors of >> http://web.ing.puc.cl/~marenas/publications/iswc11.pdf >> could comment. For no-blank-node cycles, an informal draft would be: Given an RDF graph, consider an undirected graph with (i) its blank nodes as vertices, (ii) an edge between two blank nodes iff they appear in the same triple of the original RDF graph (excluding loops). Call this the graph of connected blank nodes in the RDF graph. ... For RDF to be Well Behaved, this graph should not contain cycles. This is covered more formally in Section 3.1 of the paper, with reference to an earlier work by Pichler et al. The concept is independent of subject/object directionality. <snip> > On 13/12/2012 10:35, Andy Seaborne wrote: > > Promoting well-formed lists would be good. > > The restriction of "no labels" is not just about "no cycles" - it's > things that are not tree-like: > > :x1 :p _:a . > :x2 :q _:a . Aside from the paper, in my own experience, publishers tend to work with the "grain" of the syntax. This means that, e.g., we have a prevalence of properties like foaf:knows, rdf:rest, owl:unionOf, etc., and not foaf:knownBy, rdf:hasRest, owl:isUnionFor, etc. Perhaps putting it another way, when many people model with blank nodes in RDF, they tend to think in terms of nesting in syntaxes like Turtle or RDF/XML, which are inherently tree like: put your protagonist(s) at the root and branch from there. The directionality of properties used (esp. involving blank nodes) then follows. This certainly does not cover all cases or all publishers, but I think this is a widely followed pattern. On a related note, another aspect of blank nodes to potentially consider for Well Behaved RDF is the notion of (non)-lean RDF. But this email is much too long already. Best, Aidan (et al.)
Received on Monday, 17 December 2012 21:57:11 UTC