Re: Pragmatic Problems in the RDF Ecosystem (Was: Re: Toward easier RDF: a proposal) from Chris Yocum on 2018-12-09 (semantic-web@w3.org from December 2018)

From: Chris Yocum <cyocum@gmail.com>
Date: Sun, 9 Dec 2018 13:49:01 +0000
To: Steven Harms <sgharms@stevengharms.com>
Cc: semantic-web@w3.org
Message-ID: <20181209134901.GA7036@keiichi>
Dear Everyone,

I have been attempting to follow the discussion after David Booth's
email "Toward easier RDF: a proposal".  Sadly, due to life
circumstances, I could not follow it as well as I would have liked but
I would like now to at least say something as a user of RDF.

First a little background, my degrees are in the Humanities but I work
as a software engineer as my day job.  I feel that I am in the "middle
33%" that Mr. Booth is attempting to address. However, I would also
like to say that I will echo here some of the same points that Steven
G. Harms addressed in his email "Pragmatic Problems in the RDF
Ecosystem" but I will have a few of my own.

In the RDF world, I work mostly on a side project which is to create a
graph database of early Irish genealogies (my background is in early
Irish law and literature): https://github.com/cyocum/irish-gen. This
is by nature a human curated database with only a few tools to help us
turn the data from the manuscripts into RDF.  Some of my points will
be specific to the my project because it is the only project that I
have done in RDF so it may not be universally applicable.  Most of
this could be headed under: Lack of Sufficient Tooling.

1. Lack of a Good Editor

This could be put under Steven G. Harms section "Lack of Automated
Feedback".

Recently both Atom and Visual Studio Code have both come to the
mainstream.  While there are many plugins that will do automatic code
completion for you in Javascript, Java, etc.  There is none out there
that will do that for you in Turtle.  This has lead myself and my
collaborators making mistakes.  For instance, having a literal in the
object position where only a IRI is allowed per OWL.  The editor does
not detect this and put a warning there for them.

2. Lack of Good Tutorials

I had the pleasure of attempting to teach RDF and Turtle at a
workshop.  I think our biggest hurdle was teaching people where
predicates came from and which ones were we using for our examples.
This ended up with me having to teach OWL to people who were probably
not ready for it.

I could not send them to a good tutorial site other than the RDF
1.1. Primer or the "Linked Data: Evolving the Web into a Global Data
Space" book (http://linkeddatabook.com/editions/1.0/).  It is a very
large leap to have someone go from a few pages of primer to an entire
book.

But I think Steven G. Harms said it best: "Read more specs, pleb."

3. Lack of RDF Visualisation Software

My ultimate end users, researchers of early Ireland, want to be able to
see graphs.  They want to be able to see very large graphs at that.  I
could not find anything that would visualise RDF in a sane way.  I
found that maybe the Javascript library d3 would be able to do this
but I would have to write a bunch of code.  There was nothing that
naively understood RDF.  I finally chose Gephi (https://gephi.org/)
with the Semantic Web plugin
(https://seinecle.github.io/gephi-tutorials/generated-html/semantic-web-importer-en.html)
which seems to be OK but my project has now moved to TRiG and named
graphs so I cannot now drop in a Turtle file and have it render.

Having worked during a hackathon with Neo4j which has a nice bouncy
and friendly visualisation directly in the search interface, I can see
why people would gravitate in that direction.

My last complaint here is that there are *way* too many edges in many
visualisers that I tried to use.  I tend to think about information on
an IRI as a property of that IRI rather than having one more node
which everything seems to point and clutters up the interface. I am
thinking here of OWL DatatypeProperty in tools that produce graphviz
files.  Datatype properties should be available in a UI when the user
hovers over a node.  Only ObjectProperties should be shown as edges to
another node and rdf:type should be treated the same as a
DatatypeProperty for the purposes of display (maybe different coloured
nodes?).

3. Lack of Full OWL2 Support in Triplestores

So, let's say that I have some RDF and I want to do something with it.
There is an amazing lack of tools.  First, I have to install a
Triplestore and use SPARQL.  That's all well and good.  There are a
few options out there.  However, this is a very bad fit.  Why? Lack of
OWL2 support.  There are only two triplestores that I could find that
have full OWL2 support: Marklogic (maybe, I am still trying to
understand their documentation on the point) and Stardog (was once
Pellet).  Why is this important to me and my users?  It is important
because, as a human curated database, we have very limited time and we
need to get the maximum value from our data.  This means that, when
searching, we rely heavily on inferencing.  My collaborators and I do
not have the time to hand code that someone is the ancestor or
descendant of someone else.  Thus, we need OWL to do the heavy work
for us.  There are other OWL2 implementations out there: HermiT and
FaCT++ but these seem completely disconnected from say Apache Fuseki.
Additionally, they seem only to be used in things like Protege and not
anywhere else.

Finding this out took many, many days of my time as I had to search
through a seemingly ever increasing amount of academic abandonware.
It seems that most semantic web code is for writing a paper then
moving on and not for building an ecosystem or maintainable service.

4. SPARQL Triplestore and Reasoning Performance

This brings me to SPARQL and inferencing in general.  It seems very,
very easy to write a seemingly simple SPARQL query that will lock your
machine.  For instance, I have various sub properties of foaf:name
because early Irish is an inflected language, names have different
forms depending on where they are in a sentence.  When I search for
foaf:name in a SPARQL query, it never seemed to return and the query
analyser came back with a query plan that was *huge*.  This could have
been just a problem with the Triplestore that I was using (Stardog)
but this seems far easier to do than it is in SQL.

I have been thinking of moving to Marklogic with forward chaining
reasoning and materialisation because I would rather use more disk
than use more CPU.  Disk is cheap; CPU is not.  Also, this dataset is
meant for an OLAP situation which means that it will change
infrequently but be searched far more often.

5. Final Thoughts

When I told my fellow developers that I was thinking of using RDF and
that I had found a problem for which RDF was the solution, they were
amazed.  The consensus among my developer friends is that the Semantic
Web is a solution looking for a problem.  Also, the mass of
impenetrable specifications that back it (does the normal middle 33%
SQL developer need to read the SQL specification?) give the impression
that the Semantic Web will always be like cold fusion, just ten more
years away.

I would like to say in closing that Turtle/TRiG solves my problem and
it does so very well.  I very much appreciate all the work people have
put into it over the years and I hope to see this discussion bear
positive fruit.  I would be happy to answer any questions about my
project or how I use RDF/SPARQL/etc.

All the best,
Christopher Yocum
Received on Sunday, 9 December 2018 13:49:28 UTC