Re: Toward easier RDF: a proposal from Paul Tyson on 2018-11-25 (semantic-web@w3.org from November 2018)

From: Paul Tyson <phtyson@sbcglobal.net>
Date: Sun, 25 Nov 2018 14:30:25 -0600
To: Nathan Rixham <nathan@webr3.org>
Cc: David Booth <david@dbooth.org>, W3C Semantic Web IG <semantic-web@w3.org>, Dan Brickley <danbri@google.com>, "Sean B. Palmer" <sean@miscoranda.com>, Olaf Hartig <olaf.hartig@liu.se>, Axel Polleres <axel@polleres.net>
Message-ID: <1543177825.1931.51.camel@sbcglobal.net>
On Thu, 2018-11-22 at 00:51 +0000, Nathan Rixham wrote:
> Remove everything you can from the full set of specs, until you can
> implement a working version of the full stack in roughly a week, and
> you'll have something 100% of us can use.

But which would perhaps only cover 20% of what we need. They are big not
because of bloat, but because they are precise and complete with respect
to their domain of application.

> 
> 
> Right now the stack of specifications is so big that not one person
> here fully understands them all, let alone uses. The concept of rdf
> and related techs are simple, the specs are frankly impossible.

This has not been my experience (but I won't dodge David's
characterization as a non-average user). I don't actively develop in
this area, and so have some basis for comparison. In the past I
developed 2 nontrivial production applications, each of which used a
different set of components from the RDF technology stack. All told:
RDF, RDFS, OWL, SKOS, SPARQL, RDB2RDF, and RIF. I learned these, as
needed, by reading the specs. I relied heavily on the XML technology
stack, especially xslt, to stitch everything together, as well as SQL to
get at the legacy data sources. I also developed some interesting
prototype applications with the same toolset.

These applications have 3 main beneficial features, which I won't
elaborate on, but which are the direct result of being built on these
standards:
1. They are form-fit to the domain. They were built not by modifying the
legacy enterprise data models, but simply by reincarnating them in more
useful forms.
2. They just work, and are impervious to any framework versioning
escalator. The standards don't change.
3. They are extensible at several points: the model can be extended, the
presentation can be changed, alternate processing flows can be easily
implemented.

Now my day job has me maintaining and extending a large web-based
document delivery application. It uses a popular javascript MVC
framework and some HTML templating languages. I learned those by reading
the tutorials and examples, and trial-and-error. But here's the big
difference between learning a framework and learning a standard: When
you learn a standard, you profit from many person-years of expert
experience, consideration, debate, analysis, and review. When you learn
a framework, you learn what seemed like a good idea at the time to a few
developers working on a particular problem set within some limited
context. To be sure, there are magnificent frameworks and libraries
built on solid principles, but none has had or will have the staying
power of the XML or RDF technology stacks.

That said, I agree that in some particulars there is room for
improvement. But there is also an irreducible complexity in this problem
domain that cannot be eliminated by simplifying the syntax and pushing
some thorny issues aside.

Regards,
--Paul

> 
> On Wed, 21 Nov 2018, 22:45 David Booth <david@dbooth.org wrote:
> 
>         On 10/18/2018 05:09 PM, Dan Brickley wrote:
>          > There are serious frustrations that come with trying to use
>          > RDF (and RDFS/OWL/SPARQL, JSON-LD, RDFa, Turtle, N-Triples
>          > et al.), . . .  [ . . . ] If there is to be value in having
>          > continued SW/RDF groups around here, it's much more likely
>         to
>          > be around practical collaboration to make RDF less annoying
>          > to work with, . . . .
>         
>         Perfect lead-in!  For many months I've been working up the
>         gumption to raise this topic on this list.  I guess now is
>         the time.  :)
>         
>         The value of RDF has been well proven, in many applications,
>         over the 20+ years since it was first created.  At the
>         same time, a painful reality has emerged: RDF is too hard for
>         *average* developers.  By "average developers" I mean those
>         in the middle 33 percent of ability. And by "RDF", I mean the
>         whole RDF ecosystem -- including SPARQL, OWL, tools,
>         standards,
>         etc. -- everything that a developer touches when using RDF.
>         
>         For anyone who might be attempted to argue "But RDF is easy!",
>         please bear in mind that *you*, dear reader, are *not*
>         average.
>         You are a member of an elite who grok RDF and can work around
>         its frustrations and bizarre subtleties.  And for anyone who
>         is
>         tempted to argue that we just need to better educate the world
>         about RDF: Sorry, but no.  I and many others have been trying
>         to
>         do exactly that for over 15 years, and it has not been enough.
>         
>         Using RDF is like programming in assembly language.
>         It is tedious, frustrating and error prone.  Somehow, we
>         need to move up to a higher, easier, more productive level.
>         One bright light in our favor is that RDF already provides a
>         very solid foundation to build upon, based on formal logic.
>         Another is that graph databases -- though not specifically
>         RDF -- are now getting substantial commercial attention.
>         
>         Difficulty of use has caused RDF to be categorized as a niche
>         technology. This is unfortunate because it limits uptake and
>         prevents RDF from being a viable choice for many use cases
>         that
>         would otherwise be an excellent fit.  Use cases that depend
>         on broad uptake can *only* be achieved when RDF is usable by
>         *average* development teams.
>         
>         I've been puzzling this problem for several years.  I spoke
>         about it at the US Semantic Technology Symposium (US2TS) early
>         this year[1], and Evan Wallace and I will lead a session at
>         the 2019 US2TS[2] in March to address it further.  See also
>         excellent observations by Sean Palmer[3], Dan Brickley[4]
>         and Axel Polleres et al[5].  I have collected a few ideas,
>         but I do not have complete answers.  I think it will take a
>         community effort -- and more new ideas -- to fix this problem.
>         
>         PROPOSAL:
>         To address RDF ease-of-use head-on, as a community effort.
>         
>         Guiding principles:
>         
>         1. The goal is to make RDF -- or some RDF-based successor --
>         easy enough for *average* developers (middle 33%), who are
>         new to RDF, to be consistently successful.
>         
>         2. Solutions may involve anything in the RDF ecosystem:
>         standards, tools, guidance, etc.  All options are on the
>         table.
>         
>         3. Backward compatibility is highly desirable, but *less*
>         important than ease of use.
>         
>         SPECIFIC PROBLEMS
>         
>         The rest of this message catalogs some of the biggest
>         difficulties that I have noticed in using RDF.  YMMV. They
>         are not necessarily in priority order, and there may be
>         others that I missed. One goal should be to prioritize them.
>         Some have obvious potential fixes; others don't.  I've also
>         included some potential solution ideas.  I am interested
>         to hear your feedback, as well as any other problems
>         or solution ideas that you think should be considered.
>         
>         Please MAKE A NEW SUBJECT LINE if you reply about one of the
>         specific problems below, to help organize the discussion.
>         
>         1. Tools are scattered.  How to find them?  Which to use?
>         Every team wastes time going through a similar research and
>         selection process.
>         
>         One idea: create a bundled release of RDF tools, analogous
>         to a standard LAMP stack, or Red Hat or Ubuntu; so that if
>         someone wants to use RDF all they have to do is install that
>         bundle and they're ready to go.
>         
>         2. IRI allocation.  IRIs must be allocated for almost
>         everything
>         in RDF: things, concepts, properties, etc. -- both TBox
>         (ontology/schema) and ABox (instance data).  IRI allocation
>         is easy in theory but hard in practice!  "Cool IRIs" are
>         dereferenceable http(s) IRIs, but domain registration costs
>         money and is not permanent.  Dereferenceable IRIs require a
>         commitment that many RDF producers are not ready/able/willing
>         to make.  And even when the RDF producer is willing to use
>         dereferenceable http(s) IRIs, how exactly should those IRIs
>         be formed?  There are many possible solutions, but no standard
>         best practice.  Again every team has to figure out its own
>         path.
>         
>         3. Blank nodes.  They are an important convenience for RDF
>         authors, but they cause insidious downstream complications.
>         They have subtle, confusing semantics.  (As Nathan Rixham
>         once aptly put it, a blank node is "a name that is not
>         a name".)  Blank nodes are special second-class citizens
>         in RDF.  They cannot be used as predicates, and they are not
>         stable identifiers.  A blank node label cannot be used in
>         a follow-up SPARQL query to refer to the same node, which
>         is justifiably viewed as completely broken by RDF newbies.
>         Blank nodes also cause duplicate triples (non-lean) when the
>         same data is loaded more than once, which can easily happen
>         when data is merged from different sources.  And they cause
>         difficulties with canonicalization, described next.
>         
>         4. Lack of standard RDF canonicalization.  Canonicalization
>         is the ability to represent RDF in a consistent, predictable
>         serialization.  It is essential for diff and digital
>         signatures.
>         Developers expect to be able to diff two files, and source
>         control systems rely on being able to do so.  It is easy with
>         most other data representations.  Why not RDF?  Answer: Blank
>         nodes.  Unrestricted blank nodes cause RDF canonicalization
>         to be a "hard problem", equivalent in complexity to the graph
>         isomorphism problem.[6]
>         
>         Some recent good progress on canonicalization: JSON-LD
>         https://json-ld.github.io/normalization/spec/ .  However, the
>         current JSON-LD canonicalization draft (called
>         "normalization")
>         is focused only on the digital signatures use case, and
>         needs improvement to better address the diff use case, in
>         which small, localized graph changes should result in small,
>         localized differences in the canonicalized graph.
>         
>         5. SPARQL-friendly lists.  It is very hard[7] to query RDF
>         lists, using standard SPARQL, while returning item ordering.
>         This inability to conveniently handle such a basic data
>         construct seems brain-dead to developers who have grown to
>         take lists for granted.
>         
>         Apache Jena offers one potential (though non-standard)
>         way to ease this pain, by defining a list:index property:
>         https://jena.apache.org/documentation/query/rdf_lists.html
>         Another possibility would be to add lists as a fundamental
>         concept in RDF, as proposed by David Wood and James Leigh
>         prior to the RDF 1.1 work.[8]
>         
>         6. Standardized n-ary relations (and property graphs).  Since
>         RDF natively supports only binary relations, relations between
>         more than two entities must be encoded using groups of
>         triples.
>         A W3C Working Group Note[9] describes some common patterns,
>         but no standard has been defined for them.  As a result,
>         tools cannot reliably recognize and act on these groups of
>         triples as the atomic units that they are intended to
>         represent.
>         
>         This deficiency has greater significance than it may appear,
>         because it is subtly related to the blank node problem:
>         a major use of blank nodes is to encode n-ary relations.
>         In other words, n-ary relations are a major contributor to
>         the blank node problem.
>         
>         Furthermore, standardized n-ary relations could also enable
>         direct support for property graphs[10], which have emerged as
>         a popular and convenient way to represent graph data, led by
>         Neo4J.[11] Property graphs add the ability to attach
>         attributes
>         to relationships, which can be viewed as a special case of
>         n-ary relations.  Olaf Hartig and Bryan Thompson have proposed
>         conventions for adding property graph support to RDF.[12]
>         
>         7. Literals as subjects.  RDF should allow "anyone to say
>         anything about anything", but RDF does not currently allow
>         literals as subjects!  (One work-around is to use -- you
>         guessed
>         it -- a blank node, which in turn is asserted to be owl:sameAs
>         the literal.)  This deficiency may seem unimportant relative
>         to other RDF difficulties, but it is a peculiar anomaly that
>         may have greater impact than we realize.  Imagine an *average*
>         developer, new to RDF, who unknowingly violates this rule and
>         is puzzled when it doesn't work.  Negative experiences like
>         that drive people away.  Even more insidiously, imagine this
>         developer tries to CONSTRUCT triples using a SPARQL query,
>         and some of those triples happen to have literals in the
>         subject position.  Per the SPARQL standard, those triples will
>         be silently eliminated from the results,[13] which could lead
>         to silently producing wrong answers from the application --
>         the worst of all possible bugs.
>         
>         8. Lack of a standard rules language.  This is a big one.
>         Inference is fundamental to the value proposition of RDF,
>         and almost every application needs to perform some kind
>         of application-specific inference.  ("Inference" is used
>         broadly herein to mean any rule or procedure that produces new
>         assertions from existing assertions -- not just conventional
>         inference engines or rules languages.)  But paradoxically,
>         we still do not have a *standard* RDF rules language.
>         (See also Sean Palmer's apt observations about N3 rules.[14])
>         Furthermore, applications often need to perform custom
>         "inferences" (or data transformations) that are not convenient
>         to express in available (non-standard) rules languages, such
>         as RDF data transformations that are needed when merging data
>         from independently developed sources having different data
>         models and vocabularies.  And merging independently developed
>         data is the *most* fundamental use case of the Semantic Web.
>         
>         One possibility for addressing this need might be to embed
>         RDF in a full-fledged programming language, so that complex
>         inference rules can be expressed using the full power and
>         convenience of that programming language.  Another possibility
>         might be to provide a convenient, standard way to bind custom
>         inference rules to functions defined in a programming
>         language.
>         A third possibility might be to standardize a sufficiently
>         powerful rules language.
>         
>         However, see also some excellent cautionary comments from
>         Jesus
>         Barras(Neo4J) and MarkLogic on inference: "No one likes rules
>         engines --> horrible to debug / performance . . . Reasoning
>         with ontology languages quickly gets intractable/undecidable"
>         and "Inference is expensive. When considering it, you should:
>         1) run it over as small a dataset as possible 2) use only the
>         rules you need 3) consider alternatives."[15]
>         
>         9. Namespace proliferation.  It's hard to manage all the
>         namespaces involved in using RDF: FOAF, SKOS, DC and all the
>         hundreds of specialized namespaces that are encountered when
>         using external RDF.  Namespaces can help organize IRIs into
>         categories (typically based on the IRI's origin), but this
>         fact is nowhere recognized in official RDF specs.  Indeed,
>         the official mantra is that IRIs are opaque, and there are
>         very important design reasons for opacity.[16]  But there is
>         a cost: RDF is stuck in a flat, global naming space analogous
>         to global variables of 1960's programming languages.  Somehow,
>         modern programming languages deal with namespaces much more
>         conveniently than RDF does.  Perhaps we can learn from them,
>         without undermining the Web's design principles.
>         
>         Related issue: the RDF model does not retain namespace info.
>         As such, namespaces are often lost when tools process RDF.
>         One partial solution might be to standardize RDF triples that
>         capture serialization-related information, such as namespaces,
>         and have tools retain them in a separate graph.
>         
>         10. IRI reuse and synonyms.  In theory, RDF authors should
>         reuse
>         existing IRIs, rather than minting their own.  But this makes
>         for messy RDF and increases the up-front burden on developers.
>         Consider a typical RDF project that integrates data from
>         multiple sources, and needs to connect that data into its own
>         vocabulary.  The resulting data involves both the normalized
>         vocabulary and the non-normalized source vocabularies,
>         intermixed.  The developers might be happy to adopt existing
>         concepts like foaf:name (for a person's name) and dc:title
>         (for
>         a document title) into the project's normalized vocabulary.
>         But by using those existing IRIs instead of minting their
>         own IRIs in their own namespace (such as myapp:name and
>         myapp:title), it becomes hard to distinguish IRIs of the
>         normalized
>         vocabulary from IRIs of the non-normalized source
>         vocabularies.
>         
>         Ideally a project should be able to use its own preferred
>         names
>         (and namespaces), like myapp:name and myapp:title, while still
>         tying those names to existing external IRIs, such as foaf:name
>         and dc:title.
>         
>         owl:sameAs is not great for this.  It is too heavyweight
>         for simple synonyms, and it is only for OWL individuals --
>         not classes.  Furthermore, it provides no way to indicate
>         which IRI is locally preferred.  It would be good to have a
>         simple standard way to rename IRIs or define IRI synonyms.
>         
>                                    - - - -
>         
>         Please USE A DIFFERENT SUBJECT LINE if you reply about a
>         specific problem/idea listed above, as opposed to replying
>         about the overall proposal of addressing RDF ease-of-use as
>         a community effort.  As always, comments/suggestions/ideas
>         are welcome.
>         
>         Thanks!
>         David Booth
>         
>         References:
>         
>         1. "Toward Easier RDF", David Booth, slides from 2018 US
>         Semantic Technology Symposium:
>         https://goo.gl/H2vBYi
>         
>         2. US Semantic Technology Symposium (US2TS):
>         http://www.us2ts.org/
>         
>         3. "What happened to the Semantic
>         Web?" (general comments), Sean Palmer:
>         https://lists.w3.org/Archives/Public/semantic-web/2017Oct/0024.html
>         
>         4. "Semantic Web Interest Group now closed",
>         "RDF(-DEV), back to the future", Dan Brickley:
>         https://lists.w3.org/Archives/Public/semantic-web/2018Oct/0086.html
>         https://lists.w3.org/Archives/Public/semantic-web/2018Oct/0052.html
>         
>         5. "A More Decentralized Vision for Linked Data", Axel
>         Polleres,
>         Maulik R. Kamdar, Javier D. Fernandez, Tania Tudorache, and
>         Mark A. Musen: https://openreview.net/pdf?id=H1lS_g81gX
>         
>         6. "Signing RDF Graphs", Jeremy Carroll
>         http://www.hpl.hp.com/techreports/2003/HPL-2003-142.pdf
>         
>         7. "Is it possible to get the position of an element
>         in an RDF Collection in SPARQL?", see Joshua
>         Taylor's answer, "A Pure SPARQL 1.1 Solution":
>         https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql
>         
>         8. "An Ordered RDF List", David Wood and James Leigh:
>         https://www.w3.org/2009/12/rdf-ws/papers/ws14
>         
>         9. "Defining N-ary Relations on the Semantic Web", W3C Working
>         Group:
>         https://www.w3.org/TR/swbp-n-aryRelations/
>         
>         10. Property Graph, Wikipedia:
>         https://en.wikipedia.org/wiki/Graph_database#Labeled-Property_Graph
>         
>         11. DB-Engines Ranking of Graph DBMS:
>         https://db-engines.com/en/ranking/graph+dbms
>         
>         12. "Standards for storing RDF/OWL in a property graph?", Olaf
>         Hartig:
>         https://lists.w3.org/Archives/Public/semantic-web/2018Apr/0030.html
>         
>         13. "SPARQL 1.1 Query Language: CONSTRUCT":
>         https://www.w3.org/TR/sparql11-query/#construct
>         
>         14. "What happened to the Semantic
>         Web?" (SPARQL comments), Sean Palmer:
>         https://lists.w3.org/Archives/Public/semantic-web/2017Oct/0045.html
>         https://lists.w3.org/Archives/Public/semantic-web/2017Oct/0059.html
>         
>         15. "Debunking some 'RDF vs. Property Graph' Alternative
>         Facts",
>         Jesus Barras, slides 34 and 35:
>         https://www.slideshare.net/neo4j/graphconnect-europe-2017-debunking-some-rdf-vs-property-graph-alternative-facts-neo4j
>         
>         16. "Universal Resource Identifiers: The Opacity Axiom", Tim
>         Berners-Lee:
>         https://www.w3.org/DesignIssues/Axioms.html#opaque
>         
>         17. "Notation3 (N3): A readable RDF syntax", W3C Team
>         Submission,
>         Tim Berners-Lee and Dan Connolly:
>         https://www.w3.org/TeamSubmission/n3/
>         
>
Received on Sunday, 25 November 2018 20:31:00 UTC