More comments on SHACL Editor's Draft of 29 April from Thomas Baker on 2016-05-05 (public-rdf-shapes@w3.org from May 2016)

From: Thomas Baker <tom@tombaker.org>
Date: Thu, 5 May 2016 10:15:11 +0200
To: RDF Shapes <public-rdf-shapes@w3.org>
Message-ID: <20160505081511.GA29150@Cicero.SpeedportEntry209012601050045>
More comments on SHACL [1], Editor's Draft 29 April 2016
http://w3c.github.io/data-shapes/shacl/

I posted a previous batch of comments on 1 May [1] but have learned a few
things since then.  I remain unsure what the specification really means in some
respects, so the following reflects what I think the specification "really"
means -- what I infer it to mean -- with some suggestions on how the spec 
could help the reader by articulating some key assumptions up-front.

1. SHACL provides a vocabulary for describing shapes and a simple 
   algorithm for "validating" an arbitrary graph of RDF data (Data Graph)
   against an RDF description of data shapes (Shapes Graph).

2. The SHACL validation algorithm checks the conformance of triples in 
   the Data Graph to "constraints" described in the Shapes Graph.

3. Validation evaluates a target Data Graph at the level of its abstract 
   syntax.  In accordance with RDF 1.1 Concepts and Abstract Syntax [1], 
   RDF abstract syntax consists of triples, or subject and object nodes 
   connected with predicates, with nodes that may be IRIs, blanks, or 
   datatyped literals. The SHACL spec's use of "focus nodes" fits with 
   the use of "node" in rdf11-concepts [2].
   
4. In accordance with the Closed-World Assumption (CWA), the validation 
   algorithm limits itself to matching constraint patterns, as described in 
   the Shapes Graph, against the abstract-syntactic components of the triples
   actually asserted in target Data Graph, with no further interpretation of
   the Data Graph or inferencing based on its formal semantics.

5. A Shapes Graph is expressed in RDF.  Even though the primary use of 
   a Shapes Graph is for CWA-based validation, it should be noted that the
   semantics of the Shapes Graph itself, as of any other expression in RDF,
   follows the Open-World Assumption (OWA).  
   
6. The inherently open-world meaning of the Shapes Graph, however, does not
   seem to be of practical consequence for its use in CWA-based validation --
   unless, perhaps, one were to construct or augment a Shapes Graph with inferred
   triples -- with the caveat that shapes graphs could potentially pollute 
   "real" data by adding meaning that is not intended to be interpreted as 
   real data, e.g., as when the practical hack of using a class IRI to name a 
   shape were followed (Section 2.1.2.1, "Implicit Class Scopes").

7. A Shapes Graph may specify a potential set of "focus nodes" as the "scope"
   of validation in the Data Graph.  A Shapes Graph may also specify a potential 
   set of "focus nodes" to be dropped out of the validation scope ("filtered").
   Potential focus nodes may or may not match actual nodes in the Data Graph.
   
8. Validation based on closed-world assumptions applies to the relationship
   between constraints (as described the Shapes Graph) and triples in the data
   graph viewed at the level of their RDF abstract-syntactic components
   (e.g., the "focus nodes").

Note: An earlier iteration of these comments was posted on the DC-ARCHITECTURE
[3].  The resulting thread drew out some additional comments and insights that 
could be of interest to members of Data Shapes.
   
[1] https://lists.w3.org/Archives/Public/public-rdf-shapes/2016May/0000.html
[2] https://www.w3.org/TR/rdf11-concepts/
[3] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1605&L=dc-architecture&P=3148

----------------------------------------------------------------------
Discussion

Because SHACL is expressed in RDF, like it or not, a Shapes Graph is
interpreted according to OWA.  Since the design decision was made to express
the Shapes Graph in RDF, and not in a completely different syntax -- as in the
case of SPARQL or, for that matter, DCMI's DSP -- the native OWA interpretation
of a Shapes Graph cannot be papered over, ignored, or otherwise contradicted.

The design choice of expressing Shapes Graphs in RDF does somewhat limit SHACL,
in certain respects, compared to SPARQL or DSP.  In SPARQL, for example,
`rdfs:subClassOf*` is interpreted as referring to the transitive closure of
`rdfs:subClassOf`; the asterisk is a sort of syntactic sugar, a convenience
notation, that triggers specific inferences.  As there is no equivalent way to
express `rdfs:subClassOf*` in RDFS, there is no way to say that
`rdfs:subClassOf` actually _means_ the transitive closure without, in effect, 
arbitrarily overriding its global semantics.

Perhaps this is why the SHACL spec says that "SHACL does not always use this
vocabulary or these concepts in exactly the way that they are formally defined
in RDF and RDFS" (Section 1.3) -- a notion which gratuitously sets SHACL at
odds with W3C Semantic Web standards.

One could perhaps sidestep the issue by dropping _all_ consideration of
inferencing from the normative SHACL specification; saying only that there may
be a need for inferencing in a pre-processing phase; then discussing those
pre-processing options in a separate guidance document.  Putting inferencing
out of scope would make the SHACL spec simpler, clearer, and shorter.

Abstract syntax issues

Because SHACL is viewing RDF data graphs through a closed-world lens, the
meaning of the graph is beside the point -- just as the meaning of a graph is
beside the point with SPARQL.  A SHACL Shapes Graph is validated against a Data
Graph at the level of the abstract syntax of the Data Graph.  According to RDF
1.1 Concepts and Abstract Syntax, RDF graphs are sets of subject-predicate-
object triples, where the elements may be IRIs, blank nodes, or datatyped
literals [1].  

Note that at the level of their abstract syntax, RDF Graphs have no "classes"
and no "instances"!  A search in rdf11-concepts [1] for the words "instance" or
"class" will find no mention of either one, anywhere in the spec.  

Confusingly, the SHACL spec makes reference to "instances", "classes", or
"instances of classes" in the Data Graph, viewing the Data Graph through a
semantic lens.  Coining a new SHACL-specific notion of "instance" (and "class",
etc) next to the existing notions of RDF "instance" and OO "instance" make
SHACL particularly hard to grok.  At the end of Section 1.3, for example, the
definition for "instance" starts off by saying:

  "A node is an instance of a class..."

which I take to mean:

  "A node [in the Data Graph] is an instance of a class..."

By comparison, the SPARQL spec specifies a SPARQL-specific syntax to express
triple patterns composed of variables and RDF-abstract-syntactic things such as
IRIs and Literals.  SPARQL itself does not "understand" that something is a
class or an instance -- it simply supports the formation of triple patterns and
leaves it to Primers and other usage guides to express queries, informally, in
semantic terms (e.g., "What data is stored about instances of class X?")  This
separation of concerns makes the SPARQL specification much easier to
understand.  It is worth noting that DCMI's Description Set Profile Constraint
Language [3] also defines its own syntax.

As an aside, it is unclear to me why it is even necessary for the SHACL spec to
redefine an already-loaded, overdetermined term such as "class" to refer to a
set of what one might call "type-matched focus nodes".   If the intention is to
make SHACL more understandable to people who are unfamiliar with RDF, this
should be done not in the formal spec but in a primer or tutorial, where an
explanation can be customized for a specific audience, such as programmers.

A year ago, it was proposed that an abstract syntax be developed for SHACL [4].
There was little discussion and the issue remains open but neglected.  Since
SHACL is natively expressed in RDF, its abstract syntax is in effect the
abstract syntax for RDF.  It is not clear to me whether this is actually a good
idea.  If a Shapes Graph only exists to be used in a closed-world process
validating a Data Graph, what is the specific advantage of expressing it in
RDF?  Might a proper abstract syntax for SHACL, based on its own BNF, etc,
further focus and clarify the SHACL language?  On the other hand, I see no
specific reasons why SHACL should _not_ use RDF to express shapes graphs as it
does -- provided that the spec (or a primer) point out any potential pitfalls,
as touched on above.
  
[1] https://www.w3.org/TR/rdf11-concepts/
[2] https://www.w3.org/TR/rdf11-concepts/#data-model
[3] http://dublincore.org/documents/dc-dsp/
[4] https://www.w3.org/2014/data-shapes/track/issues/52


-- 
Tom Baker <tom@tombaker.org>
Received on Thursday, 5 May 2016 08:24:21 UTC