- From: David Booth <david@dbooth.org>
- Date: Thu, 22 Nov 2018 18:49:03 -0500
- To: semantic-web@w3.org
Hi Tim,
On 11/22/18 7:02 AM, Tim Berners-Lee wrote:
> . . .
>Every {} in JSON is
> equivalent to a blank node [] in turtle
Agreed.
> . . . When you look at Turtle as a language
> to write and to generate it is I think nice.
> In fact using turtle more for documentation and examples instead of
> Ntriples etc I think will make things easier for developers. . . .
Agreed.
>> but [blank nodes] cause insidious downstream complications.
>> They have subtle, confusing semantics.
>
> I find them very simple, thanks.
Okay, but you, Sir, are not exactly *average*. :) *Average* developers
-- middle 33% of ability -- certainly do *not* find blank node
restrictions and semantics simple. They get stung by them on a fairly
regular basis.
>> (As Nathan Rixham
>> once aptly put it, a blank node is "a name that is not
>> a name".)
>
> No, it is not a name that is not a name, it is a thing which has no URI.
Uh . . . I don't think that is quite correct. As I understand, a blank
node does *not* represent *a* thing. Rather, it asserts that there
*exists* a thing, as explained in the RDF Semantics:
https://www.w3.org/TR/rdf11-mt/#blank-nodes
In contrast, an IRI represents *a* thing. I'm sorry to be pedantic
here, but I mention it because it underscores my point: the semantics of
blank nodes really *are* subtle -- at least to *average* developers.
This subtle semantic distinction -- existence versus a particular thing
-- was actually debated a fair amount when RDF was created, if I
remember correctly. The prevailing thought at the time was that there
was value in being able to make such "existence" assertions, so that is
what we got in the RDF semantics. But after 20+ years of use, I think
it has become clear that this subtle distinction is not actually
*needed* in practice, as Skolem IRIs clearly demonstrate.
https://www.w3.org/TR/rdf11-mt/#skolemization-informative
But again, I am also convinced that we *do* need the convenience that
blank nodes currently provide. So in forging a path forward, we should
be sure to retain the convenience, even if we dispense with blank nodes
themselves.
> . . .
> [Blank nodes] are not stable identifiers because the
> people who generate the data, like the JSON above, don’t want to have to
> go to the pain of thinking up or supporting an identifier.
Exactly. That is why I believe one key problem that we need to address,
to solve the blank node problem, is to ease the pain of making
identifiers, both by both:
- using higher-level forms of RDF that eliminate/reduce the need for
uninteresting identifiers; and
- making it *easier* to allocate IRIs.
Turtle and N3 already make an excellent step in the right direction, by
providing [] and () notations, as you've pointed out.
>
>> A blank node label cannot be used in
>> a follow-up SPARQL query to refer to the same node, which
>> is justifiably viewed as completely broken by RDF newbies.
>
> If the data is serialized as turtle, typically the blank nodes all
> appear as [ ] square brackets, so there is no blank node identifier
> which would cause a newbie to thing they could query it.
Agreed. But for this approach to really work, I think tools need to
work consistently at this higher level, so that users *never* need to
look at the underlying triples or think about them, just as Python
programmers never need to look at complied byte code. And we're
definitely not there yet.
>
>> Blank nodes also cause duplicate triples (non-lean) when the
>> same data is loaded more than once, which can easily happen
>> when data is merged from different sources.
>
> Just a is if you were using an SQL database or an graph database, in general
> when you load data, it is wise to query whether this is something we
> already know, and if not, don’t add it again.
Sure, that's a work-around that RDF users currently employ. But it
requires a *lot* of work to perform all of those pre-queries for
everything before adding any data. It would be much less burdensome if
duplicate triples were eliminated automatically. This could be
achieved if predictable identifiers were automatically assigned, for
example when n-ary relations are encoded in RDF. To do so, tools must
be aware of a key that uniquely identifies that n-ary relation. And in
practice, n-ary relations usually *do* have a key -- or composite key.
The key could be used in automatically assigning a predictable
identifier. This would make it trivial for tools to eliminate duplicate
triples.
To illustrate, consider this example from the W3C Note on N-ary
relations document,
https://www.w3.org/TR/swbp-n-aryRelations/#useCase1
in which a blank node _:Diagnosis_Relation_1 is used to connect the
entities in the relation:
:Christine
a :Person ;
:has_diagnosis _:Diagnosis_Relation_1 .
_:Diagnosis_Relation_1
a :Diagnosis_Relation ;
:diagnosis_probability :HIGH ;
:diagnosis_value :Breast_Tumor_Christine .
Instead of assigning an arbitrary blank node (as above), a predictable
identifier could be automatically generated, based (recursively) on the
identities of the participants in this n-ary relations, which in the
above example are:
:Christine (who :has_diagnosis)
:HIGH (the :diagnosis_probability)
:Breast_Tumor_Christine (the :diagnosis_value)
The exact conventions for doing this still need to be worked out, but I
think a reasonable balance can be achieved, to enable this to work
without placing an onerous burden on RDF authors. (Remember, RDF
authors already know what their keys or composite keys are!)
>
> In most systems, if you load the same data more than once,
> you get duplications. RDF with no blank nodes is fairly unique in that
> duplicate triples are automatically removed, so long as as everyone has
> used the same URIs for the same things.
Yes! And I think this observation could help provide a route toward a
better solution, as explained above.
>
>> And they cause difficulties with canonicalization, described next.
>
> Canonicalization works for me with real data, thanks.
> But that is another topic, not this one.
>
> But the take-away from the your note about blank nodes: use more turtle,
> and think about it as the turtle language more than the underlying triples.
I fully agree, and more: I think it may be time to create an even
higher-level form of RDF, that is even easier than Turtle or N3, and
directly supports property graphs.
David Booth
Received on Thursday, 22 November 2018 23:49:26 UTC