Re: Blank Nodes Re: Toward easier RDF: a proposal from Pat Hayes on 2018-11-24 (semantic-web@w3.org from November 2018)

From: Pat Hayes <phayes@ihmc.us>
Date: Sat, 24 Nov 2018 13:08:22 -0600
To: David Booth <david@dbooth.org>, semantic-web@w3.org
Message-ID: <1c5f4b6a-cf96-f59b-dc1c-56b161d4dabf@ihmc.us>
On 11/22/18 5:49 PM, David Booth wrote:
> Hi Tim,
> 
> On 11/22/18 7:02 AM, Tim Berners-Lee wrote:
>> . . .
>> Every {} in JSON is equivalent to a blank node [] in turtle
> 
> Agreed.
> 
>> . . . When you look at Turtle as a language
>> to write and to generate it is I think nice.
>> In fact using turtle more for documentation and examples 
>> instead of Ntriples etc I think will make things easier for 
>> developers. . . . 
> 
> Agreed.
> 
>>> but [blank nodes] cause insidious downstream complications.
>>> They have subtle, confusing semantics. 
>>
>> I find them very simple, thanks.
> 
> Okay, but you, Sir, are not exactly *average*.  :)  *Average* 
> developers -- middle 33% of ability -- certainly do *not* find 
> blank node restrictions and semantics simple.  They get stung by 
> them on a fairly regular basis.
> 
>>> (As Nathan Rixham
>>> once aptly put it, a blank node is "a name that is not
>>> a name".) 
>>
>> No, it is not a name that is not a name, it is a thing which 
>> has no URI.
> 
> Uh . . . I don't think that is quite correct.  As I understand, a 
> blank node does *not* represent *a* thing.  Rather, it asserts 
> that there *exists* a thing, as explained in the RDF Semantics:
> https://www.w3.org/TR/rdf11-mt/#blank-nodes
> In contrast, an IRI represents *a* thing.  I'm sorry to be 
> pedantic here, but I mention it because it underscores my point: 
> the semantics of blank nodes really *are* subtle -- at least to 
> *average* developers.

Is this idea really hard for anyone? If URIs are names, then 
blank nodes are pronouns, like 'anyone' in the previous sentence. 
People don't seem to find pronouns hard or subtle or confusing, 
or complain that they have devious semantics.

The generic pronoun is actually 'something'. The triple

ex:PatHayes ex:owns _:x17 .

says 'Pat Hayes owns something', without saying what it is that I 
own. One can conclude things from this: I am not destitute, for 
example. If you know more about what I own:

_:x17 rdf:type dbpedia:Real_estate .

then you can infer more: that I am actually in reasonable 
financial circumstances. Now, you *could* invent a URI for this 
thing that I own, but that strongly suggests that you can 
identify it, which is most unlikely. It also suggests (even if it 
strictly should not do so) that there is only one of it, which is 
downright false. Also, it takes work to create a URI, and a quite 
unreasonable amount of work to create a 'cool' one.

OK, bnodes do make RDF more complicated than it would be without 
them. But RDF without blank nodes is just data graphs. I could 
argue that RDF without IRIs would be even simpler, and I would be 
right, but its a silly idea to defend. Seems to me we need to 
make RDF more expressive, not less so.

However, I agree with your point about bnode *identifiers*. This 
seems to me to be the really bad idea, since giving it an 
identifier is perilously close to using a name, and the 
object/metalevel confusion which it generates (and the lack of 
any scope boundaries for these 'local' identifiers) is I think 
largely responsible for the pain people are feeling. There is 
something inherently contradictory in having an identifier for 
something which, by definition, is something which does not 
identify.

Pat Hayes

> 
> This subtle semantic distinction -- existence versus a particular 
> thing -- was actually debated a fair amount when RDF was created, 
> if I remember correctly.  The prevailing thought at the time was 
> that there was value in being able to make such "existence" 
> assertions, so that is what we got in the RDF semantics.  But 
> after 20+ years of use, I think it has become clear that this 
> subtle distinction is not actually *needed* in practice, as 
> Skolem IRIs clearly demonstrate.
> https://www.w3.org/TR/rdf11-mt/#skolemization-informative
> 
> But again, I am also convinced that we *do* need the convenience 
> that blank nodes currently provide.  So in forging a path 
> forward, we should be sure to retain the convenience, even if we 
> dispense with blank nodes themselves.
> 
>> . . . [Blank nodes] are not stable identifiers because the
>> people who generate the data, like the JSON above, don’t want 
>> to have to go to the pain of thinking up or supporting an 
>> identifier.
> 
> Exactly.  That is why I believe one key problem that we need to 
> address, to solve the blank node problem, is to ease the pain of 
> making identifiers, both by both:
> 
>   - using higher-level forms of RDF that eliminate/reduce the 
> need for uninteresting identifiers; and
> 
>   - making it *easier* to allocate IRIs.
> 
> Turtle and N3 already make an excellent step in the right 
> direction, by providing [] and () notations, as you've pointed out.
> 
>>
>>> A blank node label cannot be used in
>>> a follow-up SPARQL query to refer to the same node, which
>>> is justifiably viewed as completely broken by RDF newbies.
>>
>> If the data is serialized as turtle, typically the blank nodes all
>> appear as [ ] square brackets, so there is no blank node 
>> identifier
>> which would cause a newbie to thing they could query it.
> 
> Agreed.  But for this approach to really work, I think tools need 
> to work consistently at this higher level, so that users *never* 
> need to look at the underlying triples or think about them, just 
> as Python programmers never need to look at complied byte code.  
> And we're definitely not there yet.
> 
>>
>>> Blank nodes also cause duplicate triples (non-lean) when the
>>> same data is loaded more than once, which can easily happen
>>> when data is merged from different sources. 
>>
>> Just a is if you were using an SQL database or an graph 
>> database, in general
>> when you load data, it is wise to query whether this is 
>> something we already know, and if not, don’t add it again.
> 
> Sure, that's a work-around that RDF users currently employ.  But 
> it requires a *lot* of work to perform all of those pre-queries 
> for everything before adding any data.  It would be much less 
> burdensome if duplicate triples were eliminated automatically.   
> This could be achieved if predictable identifiers were 
> automatically assigned, for example when n-ary relations are 
> encoded in RDF.  To do so, tools must be aware of a key that 
> uniquely identifies that n-ary relation.  And in practice, n-ary 
> relations usually *do* have a key -- or composite key. The key 
> could be used in automatically assigning a predictable 
> identifier.  This would make it trivial for tools to eliminate 
> duplicate triples.
> 
> To illustrate, consider this example from the W3C Note on N-ary 
> relations document,
> https://www.w3.org/TR/swbp-n-aryRelations/#useCase1
> in which a blank node _:Diagnosis_Relation_1 is used to connect 
> the entities in the relation:
> 
> :Christine
>        a       :Person ;
>        :has_diagnosis _:Diagnosis_Relation_1 .
> 
> _:Diagnosis_Relation_1
>        a       :Diagnosis_Relation ;
>        :diagnosis_probability :HIGH ;
>        :diagnosis_value :Breast_Tumor_Christine .
> 
> Instead of assigning an arbitrary blank node (as above), a 
> predictable identifier could be automatically generated, based 
> (recursively) on the identities of the participants in this n-ary 
> relations, which in the above example are:
> 
>        :Christine (who :has_diagnosis)
>        :HIGH (the :diagnosis_probability)
>        :Breast_Tumor_Christine (the :diagnosis_value)
> 
> The exact conventions for doing this still need to be worked out, 
> but I think a reasonable balance can be achieved, to enable this 
> to work without placing an onerous burden on RDF authors.  
> (Remember, RDF authors already know what their keys or composite 
> keys are!)
> 
>>
>> In most systems, if you load the same data more than once,
>> you get duplications.  RDF with no blank nodes is fairly unique 
>> in that duplicate triples are automatically removed, so long as 
>> as everyone has used the same URIs for the same things.
> 
> Yes!   And I think this observation could help provide a route 
> toward a better solution, as explained above.
> 
>>
>>> And they cause difficulties with canonicalization, described 
>>> next.
>>
>> Canonicalization works for me with real data, thanks.
>> But that is another topic, not this one.
>>
>> But the take-away from the your note about blank nodes: use 
>> more turtle, and think about it as the turtle language more 
>> than the underlying triples.
> 
> I fully agree, and more: I think it may be time to create an even 
> higher-level form of RDF, that is even easier than Turtle or N3, 
> and directly supports property graphs.
> 
> David Booth
> 
> 
> 
> 

-- 
-----------------------------------
call or text to 850 291 0667
www.ihmc.us/groups/phayes/
www.facebook.com/the.pat.hayes
Received on Saturday, 24 November 2018 19:08:50 UTC