- From: David Booth <david@dbooth.org>
- Date: Thu, 22 Nov 2018 18:49:03 -0500
- To: semantic-web@w3.org
Hi Tim, On 11/22/18 7:02 AM, Tim Berners-Lee wrote: > . . . >Every {} in JSON is > equivalent to a blank node [] in turtle Agreed. > . . . When you look at Turtle as a language > to write and to generate it is I think nice. > In fact using turtle more for documentation and examples instead of > Ntriples etc I think will make things easier for developers. . . . Agreed. >> but [blank nodes] cause insidious downstream complications. >> They have subtle, confusing semantics. > > I find them very simple, thanks. Okay, but you, Sir, are not exactly *average*. :) *Average* developers -- middle 33% of ability -- certainly do *not* find blank node restrictions and semantics simple. They get stung by them on a fairly regular basis. >> (As Nathan Rixham >> once aptly put it, a blank node is "a name that is not >> a name".) > > No, it is not a name that is not a name, it is a thing which has no URI. Uh . . . I don't think that is quite correct. As I understand, a blank node does *not* represent *a* thing. Rather, it asserts that there *exists* a thing, as explained in the RDF Semantics: https://www.w3.org/TR/rdf11-mt/#blank-nodes In contrast, an IRI represents *a* thing. I'm sorry to be pedantic here, but I mention it because it underscores my point: the semantics of blank nodes really *are* subtle -- at least to *average* developers. This subtle semantic distinction -- existence versus a particular thing -- was actually debated a fair amount when RDF was created, if I remember correctly. The prevailing thought at the time was that there was value in being able to make such "existence" assertions, so that is what we got in the RDF semantics. But after 20+ years of use, I think it has become clear that this subtle distinction is not actually *needed* in practice, as Skolem IRIs clearly demonstrate. https://www.w3.org/TR/rdf11-mt/#skolemization-informative But again, I am also convinced that we *do* need the convenience that blank nodes currently provide. So in forging a path forward, we should be sure to retain the convenience, even if we dispense with blank nodes themselves. > . . . > [Blank nodes] are not stable identifiers because the > people who generate the data, like the JSON above, don’t want to have to > go to the pain of thinking up or supporting an identifier. Exactly. That is why I believe one key problem that we need to address, to solve the blank node problem, is to ease the pain of making identifiers, both by both: - using higher-level forms of RDF that eliminate/reduce the need for uninteresting identifiers; and - making it *easier* to allocate IRIs. Turtle and N3 already make an excellent step in the right direction, by providing [] and () notations, as you've pointed out. > >> A blank node label cannot be used in >> a follow-up SPARQL query to refer to the same node, which >> is justifiably viewed as completely broken by RDF newbies. > > If the data is serialized as turtle, typically the blank nodes all > appear as [ ] square brackets, so there is no blank node identifier > which would cause a newbie to thing they could query it. Agreed. But for this approach to really work, I think tools need to work consistently at this higher level, so that users *never* need to look at the underlying triples or think about them, just as Python programmers never need to look at complied byte code. And we're definitely not there yet. > >> Blank nodes also cause duplicate triples (non-lean) when the >> same data is loaded more than once, which can easily happen >> when data is merged from different sources. > > Just a is if you were using an SQL database or an graph database, in general > when you load data, it is wise to query whether this is something we > already know, and if not, don’t add it again. Sure, that's a work-around that RDF users currently employ. But it requires a *lot* of work to perform all of those pre-queries for everything before adding any data. It would be much less burdensome if duplicate triples were eliminated automatically. This could be achieved if predictable identifiers were automatically assigned, for example when n-ary relations are encoded in RDF. To do so, tools must be aware of a key that uniquely identifies that n-ary relation. And in practice, n-ary relations usually *do* have a key -- or composite key. The key could be used in automatically assigning a predictable identifier. This would make it trivial for tools to eliminate duplicate triples. To illustrate, consider this example from the W3C Note on N-ary relations document, https://www.w3.org/TR/swbp-n-aryRelations/#useCase1 in which a blank node _:Diagnosis_Relation_1 is used to connect the entities in the relation: :Christine a :Person ; :has_diagnosis _:Diagnosis_Relation_1 . _:Diagnosis_Relation_1 a :Diagnosis_Relation ; :diagnosis_probability :HIGH ; :diagnosis_value :Breast_Tumor_Christine . Instead of assigning an arbitrary blank node (as above), a predictable identifier could be automatically generated, based (recursively) on the identities of the participants in this n-ary relations, which in the above example are: :Christine (who :has_diagnosis) :HIGH (the :diagnosis_probability) :Breast_Tumor_Christine (the :diagnosis_value) The exact conventions for doing this still need to be worked out, but I think a reasonable balance can be achieved, to enable this to work without placing an onerous burden on RDF authors. (Remember, RDF authors already know what their keys or composite keys are!) > > In most systems, if you load the same data more than once, > you get duplications. RDF with no blank nodes is fairly unique in that > duplicate triples are automatically removed, so long as as everyone has > used the same URIs for the same things. Yes! And I think this observation could help provide a route toward a better solution, as explained above. > >> And they cause difficulties with canonicalization, described next. > > Canonicalization works for me with real data, thanks. > But that is another topic, not this one. > > But the take-away from the your note about blank nodes: use more turtle, > and think about it as the turtle language more than the underlying triples. I fully agree, and more: I think it may be time to create an even higher-level form of RDF, that is even easier than Turtle or N3, and directly supports property graphs. David Booth
Received on Thursday, 22 November 2018 23:49:26 UTC