Re: Blank Nodes Re: Toward easier RDF: a proposal from thomas lörtsch on 2018-11-25 (semantic-web@w3.org from November 2018)

From: thomas lörtsch <tl@rat.io>
Date: Sun, 25 Nov 2018 18:14:10 +0100
To: Tim Berners-Lee <timbl@w3.org>
Cc: SW-forum Web <semantic-web@w3.org>
Message-Id: <49B274D8-2FFF-4578-88FC-460924BB7F21@rat.io>
> On 22. Nov 2018, at 13:02, Tim Berners-Lee <timbl@w3.org> wrote:
> 
> David
> 
> I agree with your resolution to make RDF easier to use for real  developers, whatever they are.  But I do not despair at the level that you do, I am more hopeful.
> Let me pick just one of your points (with a new subject as suggested).
> 
> 
>> On 2018-11 -21, at 22:40, David Booth <david@dbooth.org> wrote:
>> 
>> 3. Blank nodes.  They are an important convenience for RDF
>> authors,
> 
> Yes, here I agree.  The default data language for developers at the moment
> if JSON, and that is full of blank nodes.  Every {} in JSON is equivalent to a blank node [] in turtle
> 
> Where in JSON you write
> 
> { “name”: “Fred Bloggs”,
>   “address”: {
>     “number”:  123,
>     “street”: “Acacia Avenue” }
> }
> 
> in turtle you write
> 
> [ :name “Fred Bloggs”; 
>   :address [
>       :number  123;
>       :street  “Acacia Avenue” ]
> ] 
> 
> Which is just as simple as the JSON.  When you look at Turtle as a language
> to write and to generate it is I think nice.


IMO this is a good example that bnodes actually are foremost: structure. 

I used to think of them as plastic bags: you put things in them to transport them or keep them together but they carry no meaning in themselves (not counting the advertisements usually printed on them as "meaning", of course).

Bnodes allow graphs to encode nested lists (trees). That is useful because although graphs are very flexible, in real life we often prefer less flexible data structures like lists, nested lists, tables. At least I do when I write things down. Those structures are very useful. They add some, well, structure, to what we want to express. Do they carry "meaning"? I’d say yes but normally I don’t refer to the structure itself. In contrary it’s so useful because I don’t have to explicate it - it’s just there, as bullet points, indentation, columns and rows.

Sometimes I do want to adress a specific location in that structure. Then it’s useful to be able to give that bnode an identifier (and the ability to do so is a plus for RDF). However a triple with a bnode seperated from the other triples containing that same bnode can always only be so useful. It’s like taking two cells out of a bigger table, without headings or the full row. How far can that possibly get you? I think that some of the complaints voiced in this thread are based on unreasonable expectations and on a lack of understanding what bnodes are and can be.

Maybe unreasonable expectations at a deeper level are the core of the problem: the usefulness of graphs as data structures is limited, maybe more limited than RDF likes to admit. They are not always the most appropriate solution. We often use much more structured approaches to information modelling like trees and tables, and for good reasons. 
RDF might be much more useful if it had a way to integrate those structures instead of trying to mimick them - and integrate itself better into other datastructures. Then maybe we would need less blank nodes.
Nested lists as first class citizens in RDF would be a good thing. Also tables. There were discussions about "dark triples" pre the 2004 spec but I couldn’t find much in the mailinglist archives on the thinking behind it. 
But putting more emphasis on linking into existing data structures - like into certain cells in a RDBMS table or subtrees in a JSON document - might be helpful as well.

My main problem with bnodes is that it’s so hard to see where one structure ends and the next one begins, and what that structure actually is: a list? nested? how deep? a table even? an n-ary relation? where does that end? which node represents its main role?
A relational table or a nested list make that much easier. In a graph it takes extra effort to mark and characterize boundaries and substructures. RDF tries to do all that with just the bnodes and they are overloaded. That’s why it can be much harder to figure out what’s going on in an RDF based system than in a RDBMS based application - despite all the self describing properties etc. 
 
Graphs are good when linking nodes that are self contained entities. But when such entities consist of more then one node how can I easily see where they start and end? N3/Turtle is really elegant and the example above is nice to read for humans, but it’s hard to query if I don’t know beforehand how it’s structured and where to look for stuff. 

On my wish list are 
- generic structures like nested lists as first class citizens, 
- specific templates for certain types like adresses,
- more support for algorithms like Concise Bounded Descriptions. 

But most of all I’d love to see a generic grouping mechanism that is more powerful than RDFs specification of Named Graphs, supporting nesting and composition of named graphs and identification/reification of statements in named graphs (vulgo: quints). Quints are my favoured hammer and they fit many nails in the threads that David thankfully got started.

Thomas


> In fact using turtle more for documentation and examples instead of Ntriples etc I think will make things easier for developers.
> This is just a bit of nested structure in the language, which is valuable,
> understandable and no cause for alarm.
> 
>> but they cause insidious downstream complications.
>> They have subtle, confusing semantics.  
> 
> I find them very simple, thanks.
> 
>> (As Nathan Rixham
>> once aptly put it, a blank node is "a name that is not
>> a name".)  
> 
> No, it is not a name that is not a name, it is a thing which has no URI.
> A little less hysteria over blank nodes may be in order.
> 
>> Blank nodes are special second-class citizens
>> in RDF.  They cannot be used as predicates,
> 
> 
> Agreed it messes up the symmetry.  Actually in most of my code you can use a blank node as a predicate.  That said, RDF is unusual in having as much symmetry. 
> I don’t think your average JSON programmer expects to be able to use an object as a key.  So this won’t confuse them. 
> 
>> and they are not
>> stable identifiers.  
> 
> They are not stable identifiers because the
> people who generate the data, like the JSON above, don’t want to have to go to the pain of thinking up or supporting an identifier.
> 
>> A blank node label cannot be used in
>> a follow-up SPARQL query to refer to the same node, which
>> is justifiably viewed as completely broken by RDF newbies.
> 
> If the data is serialized as turtle, typically the blank nodes all
> appear as [ ] square brackets, so there is no blank node identifier 
> which would cause a newbie to thing they could query it.
> 
>> Blank nodes also cause duplicate triples (non-lean) when the
>> same data is loaded more than once, which can easily happen
>> when data is merged from different sources.  
> 
> Just a is if you were using an SQL database or an graph database, in general
> when you load data, it is wise to query whether this is something we already know, and if not, don’t add it again.
> 
> In most systems, if you load the same data more than once,
> you get duplications.  RDF with no blank nodes is fairly unique in that duplicate triples are automatically removed, so long as as everyone has used the same URIs for the same things. 
> 
>> And they cause difficulties with canonicalization, described next.
> 
> Canonicalization works for me with real data, thanks.
> But that is another topic, not this one.
> 
> But the take-away from the your note about blank nodes: use more turtle, and think about it as the turtle language more than the underlying triples.
> 
> timbl
> 
>  
>
Received on Sunday, 25 November 2018 17:14:36 UTC