Re: Blank Nodes Re: Toward easier RDF: a proposal from David Booth on 2018-11-22 (semantic-web@w3.org from November 2018)

From: David Booth <david@dbooth.org>
Date: Thu, 22 Nov 2018 18:49:03 -0500
To: semantic-web@w3.org
Message-ID: <77fd6408-f0c2-8163-cf8f-2493b9c11148@dbooth.org>
Hi Tim,

On 11/22/18 7:02 AM, Tim Berners-Lee wrote:
> . . .
>Every {} in JSON is 
> equivalent to a blank node [] in turtle

Agreed.

> . . . When you look at Turtle as a language
> to write and to generate it is I think nice.
> In fact using turtle more for documentation and examples instead of 
> Ntriples etc I think will make things easier for developers. . . . 

Agreed.

>> but [blank nodes] cause insidious downstream complications.
>> They have subtle, confusing semantics. 
> 
> I find them very simple, thanks.

Okay, but you, Sir, are not exactly *average*.  :)  *Average* developers 
-- middle 33% of ability -- certainly do *not* find blank node 
restrictions and semantics simple.  They get stung by them on a fairly 
regular basis.

>> (As Nathan Rixham
>> once aptly put it, a blank node is "a name that is not
>> a name".) 
> 
> No, it is not a name that is not a name, it is a thing which has no URI.

Uh . . . I don't think that is quite correct.  As I understand, a blank 
node does *not* represent *a* thing.  Rather, it asserts that there 
*exists* a thing, as explained in the RDF Semantics:
https://www.w3.org/TR/rdf11-mt/#blank-nodes
In contrast, an IRI represents *a* thing.  I'm sorry to be pedantic 
here, but I mention it because it underscores my point: the semantics of 
blank nodes really *are* subtle -- at least to *average* developers.

This subtle semantic distinction -- existence versus a particular thing 
-- was actually debated a fair amount when RDF was created, if I 
remember correctly.  The prevailing thought at the time was that there 
was value in being able to make such "existence" assertions, so that is 
what we got in the RDF semantics.  But after 20+ years of use, I think 
it has become clear that this subtle distinction is not actually 
*needed* in practice, as Skolem IRIs clearly demonstrate.
https://www.w3.org/TR/rdf11-mt/#skolemization-informative

But again, I am also convinced that we *do* need the convenience that 
blank nodes currently provide.  So in forging a path forward, we should 
be sure to retain the convenience, even if we dispense with blank nodes 
themselves.

> . . . 
> [Blank nodes] are not stable identifiers because the
> people who generate the data, like the JSON above, don’t want to have to 
> go to the pain of thinking up or supporting an identifier.

Exactly.  That is why I believe one key problem that we need to address, 
to solve the blank node problem, is to ease the pain of making 
identifiers, both by both:

  - using higher-level forms of RDF that eliminate/reduce the need for 
uninteresting identifiers; and

  - making it *easier* to allocate IRIs.

Turtle and N3 already make an excellent step in the right direction, by 
providing [] and () notations, as you've pointed out.

> 
>> A blank node label cannot be used in
>> a follow-up SPARQL query to refer to the same node, which
>> is justifiably viewed as completely broken by RDF newbies.
> 
> If the data is serialized as turtle, typically the blank nodes all
> appear as [ ] square brackets, so there is no blank node identifier
> which would cause a newbie to thing they could query it.

Agreed.  But for this approach to really work, I think tools need to 
work consistently at this higher level, so that users *never* need to 
look at the underlying triples or think about them, just as Python 
programmers never need to look at complied byte code.  And we're 
definitely not there yet.

> 
>> Blank nodes also cause duplicate triples (non-lean) when the
>> same data is loaded more than once, which can easily happen
>> when data is merged from different sources. 
> 
> Just a is if you were using an SQL database or an graph database, in general
> when you load data, it is wise to query whether this is something we 
> already know, and if not, don’t add it again.

Sure, that's a work-around that RDF users currently employ.  But it 
requires a *lot* of work to perform all of those pre-queries for 
everything before adding any data.  It would be much less burdensome if 
duplicate triples were eliminated automatically.   This could be 
achieved if predictable identifiers were automatically assigned, for 
example when n-ary relations are encoded in RDF.  To do so, tools must 
be aware of a key that uniquely identifies that n-ary relation.  And in 
practice, n-ary relations usually *do* have a key -- or composite key. 
The key could be used in automatically assigning a predictable 
identifier.  This would make it trivial for tools to eliminate duplicate 
triples.

To illustrate, consider this example from the W3C Note on N-ary 
relations document,
https://www.w3.org/TR/swbp-n-aryRelations/#useCase1
in which a blank node _:Diagnosis_Relation_1 is used to connect the 
entities in the relation:

:Christine
       a       :Person ;
       :has_diagnosis _:Diagnosis_Relation_1 .

_:Diagnosis_Relation_1
       a       :Diagnosis_Relation ;
       :diagnosis_probability :HIGH ;
       :diagnosis_value :Breast_Tumor_Christine .

Instead of assigning an arbitrary blank node (as above), a predictable 
identifier could be automatically generated, based (recursively) on the 
identities of the participants in this n-ary relations, which in the 
above example are:

       :Christine (who :has_diagnosis)
       :HIGH (the :diagnosis_probability)
       :Breast_Tumor_Christine (the :diagnosis_value)

The exact conventions for doing this still need to be worked out, but I 
think a reasonable balance can be achieved, to enable this to work 
without placing an onerous burden on RDF authors.  (Remember, RDF 
authors already know what their keys or composite keys are!)

> 
> In most systems, if you load the same data more than once,
> you get duplications.  RDF with no blank nodes is fairly unique in that 
> duplicate triples are automatically removed, so long as as everyone has 
> used the same URIs for the same things.

Yes!   And I think this observation could help provide a route toward a 
better solution, as explained above.

> 
>> And they cause difficulties with canonicalization, described next.
> 
> Canonicalization works for me with real data, thanks.
> But that is another topic, not this one.
> 
> But the take-away from the your note about blank nodes: use more turtle, 
> and think about it as the turtle language more than the underlying triples.

I fully agree, and more: I think it may be time to create an even 
higher-level form of RDF, that is even easier than Turtle or N3, and 
directly supports property graphs.

David Booth
Received on Thursday, 22 November 2018 23:49:26 UTC