Re: Blank Nodes Re: Toward easier RDF: a proposal from Hugh Glaser on 2018-11-25 (semantic-web@w3.org from November 2018)

From: Hugh Glaser <hugh@glasers.org>
Date: Sun, 25 Nov 2018 13:28:19 +0000
To: Pat Hayes <phayes@ihmc.us>
Cc: David Booth <david@dbooth.org>, semantic-web@w3.org
Message-Id: <0EB45800-75DC-4A58-BE8F-7B91C048BA8E@glasers.org>
Thanks Pat,

So I think that brings another question.
Just how much of bnode usage is about existence, as you describe, and how much a single potentially nameable object that the creating agent doesn't want to take the trouble to name?
I *think* that all the examples discussed here are the latter.
For example:
:foo
  :address [
      :number  123;
      :street  “Acacia Avenue” ]

I understand that this is saying that :foo has something that is related to it by the :address property.
But it is a bit weird that it then goes on to be very specific about that something.

I see it is also saying there exists something that has two properties:
  :number  123;
         :street  “Acacia Avenue” 

Because of the way this is then interpreted (I think) for the two bnodes to have the same bnode identifier,
it is saying that there exists something that has all those properties.

I can see that you may want
 ex:PatHayes ex:owns _:x17 .
But if I say
 ex:PatHayes ex:owns [ :hasReg "A487LUR" . ]  .
it has got pretty specific, now it has both properties.

So now:
> Now, you *could* invent a URI for this thing that I own, but that strongly suggests that you can identify it, which is most unlikely. It also suggests (even if it strictly should not do so) that there is only one of it, which is downright false.
doesn't really apply?
I may be pretty good at identifying :hasReg "A487LUR", if I have the right knowledge, and it may well be what the authoring agent intended me to be able to do.
And it is highly likely that :foo only lives at one of the possible 123 Acacia Avenues, and if I find that one, :foo doesn't live at any of the others.
And most importantly, I think that is what the authoring agents wanted to facilitate (:hasReg) or state (:address).

So it seems to me that there is an RDF facility (bnodes) which can make statements as you describe.
But in practice it is being used for other purposes, which quite possibly are not representing the knowledge that people want.

Best
Hugh

As always, thanks for being patient while I get things clearer in my head.

> On 24 Nov 2018, at 19:08, Pat Hayes <phayes@ihmc.us> wrote:
> 
> On 11/22/18 5:49 PM, David Booth wrote:
>> Hi Tim,
>> On 11/22/18 7:02 AM, Tim Berners-Lee wrote:
>>> . . .
>>> Every {} in JSON is equivalent to a blank node [] in turtle
>> Agreed.
>>> . . . When you look at Turtle as a language
>>> to write and to generate it is I think nice.
>>> In fact using turtle more for documentation and examples instead of Ntriples etc I think will make things easier for developers. . . . 
>> Agreed.
>>>> but [blank nodes] cause insidious downstream complications.
>>>> They have subtle, confusing semantics. 
>>> 
>>> I find them very simple, thanks.
>> Okay, but you, Sir, are not exactly *average*.  :)  *Average* developers -- middle 33% of ability -- certainly do *not* find blank node restrictions and semantics simple.  They get stung by them on a fairly regular basis.
>>>> (As Nathan Rixham
>>>> once aptly put it, a blank node is "a name that is not
>>>> a name".) 
>>> 
>>> No, it is not a name that is not a name, it is a thing which has no URI.
>> Uh . . . I don't think that is quite correct.  As I understand, a blank node does *not* represent *a* thing.  Rather, it asserts that there *exists* a thing, as explained in the RDF Semantics:
>> https://www.w3.org/TR/rdf11-mt/#blank-nodes
>> In contrast, an IRI represents *a* thing.  I'm sorry to be pedantic here, but I mention it because it underscores my point: the semantics of blank nodes really *are* subtle -- at least to *average* developers.
> 
> Is this idea really hard for anyone? If URIs are names, then blank nodes are pronouns, like 'anyone' in the previous sentence. People don't seem to find pronouns hard or subtle or confusing, or complain that they have devious semantics.
> 
> The generic pronoun is actually 'something'. The triple
> 
> ex:PatHayes ex:owns _:x17 .
> 
> says 'Pat Hayes owns something', without saying what it is that I own. One can conclude things from this: I am not destitute, for example. If you know more about what I own:
> 
> _:x17 rdf:type dbpedia:Real_estate .
> 
> then you can infer more: that I am actually in reasonable financial circumstances. Now, you *could* invent a URI for this thing that I own, but that strongly suggests that you can identify it, which is most unlikely. It also suggests (even if it strictly should not do so) that there is only one of it, which is downright false. Also, it takes work to create a URI, and a quite unreasonable amount of work to create a 'cool' one.
> 
> OK, bnodes do make RDF more complicated than it would be without them. But RDF without blank nodes is just data graphs. I could argue that RDF without IRIs would be even simpler, and I would be right, but its a silly idea to defend. Seems to me we need to make RDF more expressive, not less so.
> 
> However, I agree with your point about bnode *identifiers*. This seems to me to be the really bad idea, since giving it an identifier is perilously close to using a name, and the object/metalevel confusion which it generates (and the lack of any scope boundaries for these 'local' identifiers) is I think largely responsible for the pain people are feeling. There is something inherently contradictory in having an identifier for something which, by definition, is something which does not identify.
> 
> Pat Hayes
> 
>> This subtle semantic distinction -- existence versus a particular thing -- was actually debated a fair amount when RDF was created, if I remember correctly.  The prevailing thought at the time was that there was value in being able to make such "existence" assertions, so that is what we got in the RDF semantics.  But after 20+ years of use, I think it has become clear that this subtle distinction is not actually *needed* in practice, as Skolem IRIs clearly demonstrate.
>> https://www.w3.org/TR/rdf11-mt/#skolemization-informative
>> But again, I am also convinced that we *do* need the convenience that blank nodes currently provide.  So in forging a path forward, we should be sure to retain the convenience, even if we dispense with blank nodes themselves.
>>> . . . [Blank nodes] are not stable identifiers because the
>>> people who generate the data, like the JSON above, don’t want to have to go to the pain of thinking up or supporting an identifier.
>> Exactly.  That is why I believe one key problem that we need to address, to solve the blank node problem, is to ease the pain of making identifiers, both by both:
>>  - using higher-level forms of RDF that eliminate/reduce the need for uninteresting identifiers; and
>>  - making it *easier* to allocate IRIs.
>> Turtle and N3 already make an excellent step in the right direction, by providing [] and () notations, as you've pointed out.
>>> 
>>>> A blank node label cannot be used in
>>>> a follow-up SPARQL query to refer to the same node, which
>>>> is justifiably viewed as completely broken by RDF newbies.
>>> 
>>> If the data is serialized as turtle, typically the blank nodes all
>>> appear as [ ] square brackets, so there is no blank node identifier
>>> which would cause a newbie to thing they could query it.
>> Agreed.  But for this approach to really work, I think tools need to work consistently at this higher level, so that users *never* need to look at the underlying triples or think about them, just as Python programmers never need to look at complied byte code.  And we're definitely not there yet.
>>> 
>>>> Blank nodes also cause duplicate triples (non-lean) when the
>>>> same data is loaded more than once, which can easily happen
>>>> when data is merged from different sources. 
>>> 
>>> Just a is if you were using an SQL database or an graph database, in general
>>> when you load data, it is wise to query whether this is something we already know, and if not, don’t add it again.
>> Sure, that's a work-around that RDF users currently employ.  But it requires a *lot* of work to perform all of those pre-queries for everything before adding any data.  It would be much less burdensome if duplicate triples were eliminated automatically.   This could be achieved if predictable identifiers were automatically assigned, for example when n-ary relations are encoded in RDF.  To do so, tools must be aware of a key that uniquely identifies that n-ary relation.  And in practice, n-ary relations usually *do* have a key -- or composite key. The key could be used in automatically assigning a predictable identifier.  This would make it trivial for tools to eliminate duplicate triples.
>> To illustrate, consider this example from the W3C Note on N-ary relations document,
>> https://www.w3.org/TR/swbp-n-aryRelations/#useCase1
>> in which a blank node _:Diagnosis_Relation_1 is used to connect the entities in the relation:
>> :Christine
>>       a       :Person ;
>>       :has_diagnosis _:Diagnosis_Relation_1 .
>> _:Diagnosis_Relation_1
>>       a       :Diagnosis_Relation ;
>>       :diagnosis_probability :HIGH ;
>>       :diagnosis_value :Breast_Tumor_Christine .
>> Instead of assigning an arbitrary blank node (as above), a predictable identifier could be automatically generated, based (recursively) on the identities of the participants in this n-ary relations, which in the above example are:
>>       :Christine (who :has_diagnosis)
>>       :HIGH (the :diagnosis_probability)
>>       :Breast_Tumor_Christine (the :diagnosis_value)
>> The exact conventions for doing this still need to be worked out, but I think a reasonable balance can be achieved, to enable this to work without placing an onerous burden on RDF authors.  (Remember, RDF authors already know what their keys or composite keys are!)
>>> 
>>> In most systems, if you load the same data more than once,
>>> you get duplications.  RDF with no blank nodes is fairly unique in that duplicate triples are automatically removed, so long as as everyone has used the same URIs for the same things.
>> Yes!   And I think this observation could help provide a route toward a better solution, as explained above.
>>> 
>>>> And they cause difficulties with canonicalization, described next.
>>> 
>>> Canonicalization works for me with real data, thanks.
>>> But that is another topic, not this one.
>>> 
>>> But the take-away from the your note about blank nodes: use more turtle, and think about it as the turtle language more than the underlying triples.
>> I fully agree, and more: I think it may be time to create an even higher-level form of RDF, that is even easier than Turtle or N3, and directly supports property graphs.
>> David Booth
> 
> -- 
> -----------------------------------
> call or text to 850 291 0667
> www.ihmc.us/groups/phayes/
> www.facebook.com/the.pat.hayes

-- 
Hugh
023 8061 5652
Received on Sunday, 25 November 2018 13:28:49 UTC