Re: Blank nodes must DIE! [ was Re: Blank nodes semantics - existential variables?] from Aidan Hogan on 2020-06-30 (semantic-web@w3.org from June 2020)

From: Aidan Hogan <aidhog@gmail.com>
Date: Tue, 30 Jun 2020 18:45:43 -0400
To: semantic-web@w3.org
Message-ID: <d5edea31-71e6-3bef-1178-65672813be56@gmail.com>
Hi David,

On 2020-06-30 10:40, David Booth wrote:
> On 6/29/20 7:33 PM, Aidan Hogan wrote:
>> For what it is worth, we started working on the topic of blank nodes 
>> some time ago similarity convinced of the fact that the RDF semantics 
>> of blank nodes was unintuitive, and that a better semantics could be 
>> found. A couple of papers and several years later, I was/am more or 
>> less convinced that the semantics of blank nodes is as it should be in 
>> RDF.
> 
> While I appreciate the very thorough technical analysis that Aiden has 
> done, and I don't exactly disagree with his technical conclusion, after 
> years of consideration I've come to look at the problem differently and 
> have reached a different conclusion: we should not be dealing with blank 
> nodes AT ALL.  Blank nodes should be ELIMINATED from the user 
> experience.  We need to move to a higher-level representation that does 
> not have blank node labels, so that users never need to think about them 
> or be baffled at the semantic subtleties that have dogged these 
> discussions for so long.  Blank nodes should exist ONLY in the 
> underlying machinery that users NEVER need to touch or see.

I think that getting rid of blank nodes entirely is a reasonable 
position to discuss. Assuming we have blank nodes, then the RDF 
semantics makes sense to me: I think they should remain local and 
existential. But it is another question whether or not they are worth it 
in the first place. Note that I am a big fan of minimality. If we could 
get away without blank nodes, and if things would be simpler without 
them, then I would be all for it. My opinion is based on the suspicion 
that things would be more complex without the *option* of using blank 
nodes. But in the context of Linked Data, for example, their use is 
discouraged, and many important datasets heed that advice. I think this 
is a good balance: blank nodes are an option if you need them, but if 
you don't like them and/or don't need them, don't use them.

A third option that various people have worked on, including myself, is 
to develop methods to skolemise blank nodes, converting them into IRIs 
and assigning them consistent canonical labels. So if you don't want the 
headache of dealing with blank nodes (as common in legacy data), there 
is always the option of eliminating the blank nodes by skolemising as 
part of a pre-processing step (though it would of course require an 
additional dependency in the project to include the skolemisation code).

> In practical terms, this means adopting a new, higher level RDF-based 
> syntax that allows RDF tooling to be reused as much as possible.
> 
> A minimum contender would be Turtle/TriG without blank node labels, but 
> if we are contemplating a new syntax then I personally think it would be 
> worth making a few more changes at the same time, to make it even higher 
> level and easier to use.  A number of ideas have been collected here, 
> though somewhat haphazardly:
> https://github.com/w3c/EasierRDF/issues
> 
> But note that a new RDF-based syntax is only one part of the entire tool 
> chain.  A SPARQL successor would also be needed, to support the new 
> features and restrictions, and libraries would have to support them also.

In terms of higher level RDF-based syntaxes, my first thought is that 
this would be Turtle or JSON-LD? You mention Turtle removing blank 
nodes, but I don't immediately agree that it would make the syntax all 
that much easier to understand (I would need to be convinced). It would 
also require removing shortcuts for lists, which creates other issues. 
(Also most of the Semantic Web standards would need to be rewritten, 
which is maybe more of an appeal to historical context or practical 
concerns and thus should perhaps initially take a back-seat to what is 
actually best as a guiding principle.)

I think though it would be interesting to look at a concrete proposal 
along the lines you mention and compare it with the existing standards.

> I REALLY wish that some PhD students would take on this challenge: to 
> design a higher-level successor to RDF, with a top-line goal of making 
> it easy enough for AVERAGE developers (middle 33% of skill), who are new 
> to it, to be consistently success.  Note to such PhD students/research: 
> pay particular attention to Sean Palmer's insightful comments also:
> https://github.com/w3c/EasierRDF/issues/68
> 
> IMO blank nodes have been a significant factor in pushing RDF over the 
> cognitive complexity threshold that average developers are willing to 
> tolerate.  Given how rapidly other easier-to-use graph databases have 
> become popular and have far overtaken RDF in market share, I think it is 
> URGENT that we address the problem of making RDF easier for AVERAGE 
> developers:
> https://db-engines.com/en/ranking/graph+dbms

I don't think the comparison is all that simple. RDF is a standard 
format for data exchange (particularly on the Web). Graph databases are 
systems with query languages for querying graphs. Regarding the adoption 
(or "market share") of RDF, a better statistic might be: "[of 32 million 
websites] approximately 6.3 million of these websites use Microdata, 5.1 
million websites use JSON-LD, and 1 million websites make use of RDFa" 
[1]. Regarding SPARQL more specifically, one might also mention the 
millions of daily queries being processed on Wikidata [2].

That is not to say that we do not have something to learn from graph 
databases like Neo4j. On the contrary, their documentation, demos, 
installation, etc., are geared towards developers in a way that the RDF 
et al. standards/primers have not traditionally been and in a way that 
suggests a possible opportunity that we have been missing. But languages 
like Cypher have their own complications (including, as a personal 
example, the use of an edge-isomorphic semantics within graph patterns, 
which I find messy). Property graphs and Cypher are no more intuitive to 
understand *completely* than RDF and SPARQL, in my opinion; the former 
have their fair share of idiosyncrasies and complications too, probably 
even worse than RDF and SPARQL, and they do not even have to consider 
the needs of the Web! Plus, if you want examples of things that are 
really unintuitive, I can share some examples of queries in MongoDB that 
would put blank nodes to shame (and MongoDB is the most popular NoSQL 
system out there according to the list you reference).

In terms of the arguments that complexity in standards drives developers 
away, I think the key counter-example here would be SQL, which is 
several thousands of pages long [3], with complex features catering to 
niche use-cases. This has not slowed developer adoption of SQL. Few, if 
any, care about that weird feature on page 1413 of the standard.

The message here is that to attract developers, we need a message to 
attract developers, and an aesthetic that attracts developers, and we 
need to address a need that developers have, to understand their 
processes, and to take steps in their direction rather than asking them 
to make the pilgrimage to us. I think that initiatives like JSON-LD, or 
works on trying to bridge GraphQL and RDF/SPARQL, and the work of a 
great many people in the community, including those who make their 
living from these standards, should be celebrated for bridging this gap 
(even if there is much work left to do). For me, these are examples of 
better ways to get more and more developers involved with RDF et al. I 
personally think that there are greater priorities in this direction 
than eliminating blank nodes.

For posterity's sake, I should mention that I might be wrong in all of 
this. :) It would be interesting to see an "easier RDF" proposal that 
might justify this disclaimer.

Best,
Aidan

[1] 
https://www.uni-mannheim.de/dws/news/442-billion-quads-microdata-embedded-json-ld-rdfa-and-microformat-data-originating-from-119-million/

[2] https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

[3] https://www.wiscorp.com/SQLStandards.html
Received on Tuesday, 30 June 2020 22:46:00 UTC