Comments on SHACL Editors Draft of 29 April from Thomas Baker on 2016-05-01 (public-rdf-shapes@w3.org from May 2016)

From: Thomas Baker <tom@tombaker.org>
Date: Sun, 1 May 2016 16:40:21 +0200
To: RDF Shapes <public-rdf-shapes@w3.org>
Message-ID: <20160501144021.GA58301@Cicero.SpeedportEntry209012601050045>

Comments on

Shapes Constraint Language (SHACL)
Editors Draft 29 April 2016
http://w3c.github.io/data-shapes/shacl/

Some context: I have followed this activity since participating in the workshop
on RDF validation in 2013 [1]. The activity seemed like it might achieve the
goals pursued a decade ago with the DCMI Working Draft, Description Set Profile
Constraint Language [2]. I have tried to keep up with the excellent work by
Karen Coyle, Antoine Isaac, Hugo Manguinhas, Thomas Hartmann, and others on
comparing the emerging SHACL specification to requirements that have
accumulated over the years in the Dublin Core community.

There is alot to like in SHACL but I must confess that each time I tried to
actually read the specification I found myself getting stuck at the same
places. I'd set it aside, assuming that the issues would shake out. Many
months later, however, I find the same sticking points, unchanged. This time I
pressed on through the introduction to Section 2.1.

These comments convey my thoughts while reading the text and end with some
suggestions. I have made no effort to catch up on discussion in the relevant
mailing lists [4,5], so please forgive me if I simply cover issues here that
are already well-understood.

Abstract

First sentence (also first sentence of Introduction):

"SHACL is a language for describing and constraining the contents of RDF
graphs"

So I ask myself: If an RDF graph is an immutable set of triples, in what
sense can it be "constrained"? If an RDF graph is a description with a
meaning determined by RDF semantics, what does it mean for that _description_
to be "described"? Surely SHACL is not meant to somehow limit the
RDF-semantic meaning of an RDF graph, which would make no sense, but then
what does mean "constraining" mean? Surely the specification of a
"constraint language" should start by defining "constraint".

Further on, one finds that the "constraint language" actually has nothing to
do with somehow constraining RDF graphs and everything to do with describing
an instance of the class "shape", which can be used with a process for
determining whether a given RDF graph conforms to the set of constraints
described in that shape ("validation"). In the Abstract, however, validation
is mentioned only in passing ("can be used to communicate information about
data structures... generate or validate data, or drive user interfaces").

The Abstract concludes with an unsettling reference to the "underlying
semantics" of SHACL. We already have RDF semantics. Will this document
define another?

1. Introduction

"This document defines what it means for an RDF graph... to conform to a
graph containing SHACL shapes"

An improvement over the Abstract.

1.2. SHACL example

"A shapes graph containing shape definitions and other information that can
be utilized to determine what validation is to be done"

The wording is odd. How about:

"A shapes graph, which describes a set of constraints, can be used to
determine whether a given data graph conforms to the constraints."

Up to this point, has the text actually said that SHACL shape graphs are
expressed in RDF? The Document Outline does say that examples are expressed
in Turtle syntax, which strongly implies RDF. But that SHACL shape graphs
are expressed in RDF is actually not obvious for anyone who knows that SPARQL
also expresses shape-like constructs for matching against RDF data, and that
SPARQL constructs are not themselves expressed in RDF.

(As an aside, readers of RDF 1.1 Turtle will find instances with prefixed
names in lowercase, whereas in the SHACL spec the prefixed names are in
uppercase. A sentence about the naming conventions used in this document
could make this explicit.)

Section 1.2 continues:

"ex:IssueShape... [has constraints that apply]... to a (transitive)
subclass of ex:Issue following rdf:subClassOf triples"

Hmm - nothing in the spec has yet hinted that the process of validating a
data graph against a shape graph will _require_ additional, out-of-band
information such as schema definitions.

1.3. Relationship between SHACL and RDF

"SHACL uses RDF and RDFS vocabulary... and concepts... [but] SHACL does not
always use this vocabulary or these concepts in exactly the way that they
are formally defined in RDF and RDFS."

Hang on, so SHACL does _not_ use RDF/S vocabulary as defined by the RDF/S
specs?? It is jarring to read this in a W3C rec-track specification. How is
this not a show-stopper?

One then learns that SHACL validation is about more than matching an
immutable data graph against an immutable shapes graph. Apparently it
involves the prior creation of an _expanded_ data graph through selective
materialization of inferred triples.

The notion of "SHACL processors" having (selectively) to support inferencing
goes far beyond just defining a vocabulary for describing a shape and a
process for evaluating that shape against a data graph. It implies a
software application with SHACL-specific features and an inferencing style
that is SHACL-specific -- both of which, to my way of thinking, should be
completely orthogonal to the language specification, which could quite
reasonably focus on just the vocabulary and validation algorithm.

If, as the spec points out, "SHACL implementations may operate on RDF graphs
that include entailments", couldn't the SHACL spec be helpfully simplified by
leaving the materialization of inferred triples out of scope entirely -- as
something done in a pre-processing phase, perhaps according to a few
well-known patterns as described in a separate specification?

The section ends with very puzzling definitions for "subclass", "type", and
"instance" -- "A node is an instance of a class if one of its types is the
given class"?? -- but I press on, hoping the next section will bring some
clarity...

2. Shapes

The first paragraph says:

"Shape scopes define the selection criteria"

but then Figure 1 says:

"Scope selects focus nodes"

If a shape is just a graph (or part of a shapes graph), then surely that
graph cannot actually perform a action, like "selects", as if executed like a
Java method. Figure 1 also talks about filter shapes that "refine" or
"eliminate" and constraints that "produce". Talking about graphs as agents
is deeply confusing.

"Class-based scopes define the scope as the set of all instances of a
class."

Okay, yes... classes have extensions... after all, RDF Schema 1.1 says that
"Associated with each class is a set, called the class extension of the
class, which is the set of the instances of the class" [3]. But what does
this have to do with defining the set of focus nodes for a shape? The scope
of a shape is _not_ a specific data graph but the set of all instances of a
class in the world?

I stop reading.

Summary and suggestions

The spec looks quite nice on the surface but the explanation is conceptually
muddled. Would it not be simpler and clearer to define a SHACL where, to
paraphrase the 2008 DSP specification [2], "the fundamental usage model for a
[shape] is to examine whether a [data graph] matches the [shape]"? Everything
else could be out of scope. Some suggestions:

1. Define "constraint" up-front.

2. If a shape is described in RDF, say so early on, then avoid implying that a
SHACL shape is based on any semantics other than RDF semantics.

3. Come up with better names than 'subclass', 'superclass', 'type', and
'instance' for whatever it is that is being described. Anyone familiar with
classes and instances in RDF -- or classes and instances in OOP -- will
surely be led astray by yet another completely different re-use of
terminology that only _seems_ familiar. Repurposing these well-worn terms
actually gets in the way of understanding.

4. Move anything about materializing additional triples as a pre-processing
step -- even sub-class relationships -- into a separate document specifically
for implementation advice, such as a primer. In other words, split out all
references to inferencing from the SHACL language itself. To keep the language
specification clear, an immutable data graph need only be validated against an
immutable shape graph, full stop. Anything else can be moved elsewhere.

5. Move Sections 6 through 11 into a separate document or primer. Far better
to put this into its own shorter, focused specification than tack it onto
specification that is already much too long -- 108 pages, had I printed it out.

Simpler, clearer specs stand a correspondingly greater chance of actually being
read -- and used.

Tom

[1] https://www.w3.org/blog/SW/2013/10/04/w3c-workshop-report-rdf-validation-practical-assurances-for-quality-rdf-data/
[2] http://dublincore.org/documents/dc-dsp/
[3] https://www.w3.org/TR/rdf-schema/#ch_classes
[4] https://lists.w3.org/Archives/Public/public-rdf-shapes/
[5] https://lists.w3.org/Archives/Public/public-data-shapes-wg/

--
Tom Baker <tom@tombaker.org>

Received on Sunday, 1 May 2016 14:40:59 UTC