Comments on SHACL Editors Draft of 29 April

Comments on 

Shapes Constraint Language (SHACL)
Editors Draft 29 April 2016
http://w3c.github.io/data-shapes/shacl/

Some context: I have followed this activity since participating in the workshop
on RDF validation in 2013 [1].  The activity seemed like it might achieve the
goals pursued a decade ago with the DCMI Working Draft, Description Set Profile
Constraint Language [2].  I have tried to keep up with the excellent work by
Karen Coyle, Antoine Isaac, Hugo Manguinhas, Thomas Hartmann, and others on
comparing the emerging SHACL specification to requirements that have
accumulated over the years in the Dublin Core community.

There is alot to like in SHACL but I must confess that each time I tried to
actually read the specification I found myself getting stuck at the same
places.  I'd set it aside, assuming that the issues would shake out.  Many
months later, however, I find the same sticking points, unchanged.  This time I
pressed on through the introduction to Section 2.1.  

These comments convey my thoughts while reading the text and end with some
suggestions.  I have made no effort to catch up on discussion in the relevant
mailing lists [4,5], so please forgive me if I simply cover issues here that
are already well-understood.

Abstract

  First sentence (also first sentence of Introduction): 
  
    "SHACL is a language for describing and constraining the contents of RDF
    graphs" 

  So I ask myself: If an RDF graph is an immutable set of triples, in what
  sense can it be "constrained"?  If an RDF graph is a description with a
  meaning determined by RDF semantics, what does it mean for that _description_
  to be "described"?  Surely SHACL is not meant to somehow limit the
  RDF-semantic meaning of an RDF graph, which would make no sense, but then
  what does mean "constraining" mean?  Surely the specification of a
  "constraint language" should start by defining "constraint".

  Further on, one finds that the "constraint language" actually has nothing to
  do with somehow constraining RDF graphs and everything to do with describing
  an instance of the class "shape", which can be used with a process for
  determining whether a given RDF graph conforms to the set of constraints
  described in that shape ("validation").  In the Abstract, however, validation
  is mentioned only in passing ("can be used to communicate information about
  data structures...  generate or validate data, or drive user interfaces").

  The Abstract concludes with an unsettling reference to the "underlying
  semantics" of SHACL.  We already have RDF semantics. Will this document
  define another?

1. Introduction

    "This document defines what it means for an RDF graph... to conform to a
    graph containing SHACL shapes" 
    
  An improvement over the Abstract.

1.2. SHACL example

    "A shapes graph containing shape definitions and other information that can
    be utilized to determine what validation is to be done" 

  The wording is odd.  How about: 
  
    "A shapes graph, which describes a set of constraints, can be used to
    determine whether a given data graph conforms to the constraints."

  Up to this point, has the text actually said that SHACL shape graphs are
  expressed in RDF?  The Document Outline does say that examples are expressed
  in Turtle syntax, which strongly implies RDF.  But that SHACL shape graphs
  are expressed in RDF is actually not obvious for anyone who knows that SPARQL
  also expresses shape-like constructs for matching against RDF data, and that
  SPARQL constructs are not themselves expressed in RDF.  
  
  (As an aside, readers of RDF 1.1 Turtle will find instances with prefixed
  names in lowercase, whereas in the SHACL spec the prefixed names are in
  uppercase.  A sentence about the naming conventions used in this document
  could make this explicit.)

  Section 1.2 continues:

    "ex:IssueShape... [has constraints that apply]... to a (transitive)
    subclass of ex:Issue following rdf:subClassOf triples" 
    
  Hmm - nothing in the spec has yet hinted that the process of validating a
  data graph against a shape graph will _require_ additional, out-of-band
  information such as schema definitions.

1.3. Relationship between SHACL and RDF

    "SHACL uses RDF and RDFS vocabulary... and concepts... [but] SHACL does not
    always use this vocabulary or these concepts in exactly the way that they
    are formally defined in RDF and RDFS."

  Hang on, so SHACL does _not_ use RDF/S vocabulary as defined by the RDF/S
  specs??  It is jarring to read this in a W3C rec-track specification.  How is
  this not a show-stopper?

  One then learns that SHACL validation is about more than matching an
  immutable data graph against an immutable shapes graph.  Apparently it
  involves the prior creation of an _expanded_ data graph through selective
  materialization of inferred triples.  
  
  The notion of "SHACL processors" having (selectively) to support inferencing
  goes far beyond just defining a vocabulary for describing a shape and a
  process for evaluating that shape against a data graph.  It implies a
  software application with SHACL-specific features and an inferencing style
  that is SHACL-specific -- both of which, to my way of thinking, should be
  completely orthogonal to the language specification, which could quite
  reasonably focus on just the vocabulary and validation algorithm.

  If, as the spec points out, "SHACL implementations may operate on RDF graphs
  that include entailments", couldn't the SHACL spec be helpfully simplified by
  leaving the materialization of inferred triples out of scope entirely -- as
  something done in a pre-processing phase, perhaps according to a few
  well-known patterns as described in a separate specification?

  The section ends with very puzzling definitions for "subclass", "type", and
  "instance" -- "A node is an instance of a class if one of its types is the
  given class"?? -- but I press on, hoping the next section will bring some
  clarity...

2. Shapes

  The first paragraph says:

    "Shape scopes define the selection criteria"

  but then Figure 1 says:

    "Scope selects focus nodes"

  If a shape is just a graph (or part of a shapes graph), then surely that
  graph cannot actually perform a action, like "selects", as if executed like a
  Java method.  Figure 1 also talks about filter shapes that "refine" or
  "eliminate" and constraints that "produce".  Talking about graphs as agents
  is deeply confusing.

    "Class-based scopes define the scope as the set of all instances of a
    class."

  Okay, yes... classes have extensions... after all, RDF Schema 1.1 says that
  "Associated with each class is a set, called the class extension of the
  class, which is the set of the instances of the class" [3].  But what does
  this have to do with defining the set of focus nodes for a shape?  The scope
  of a shape is _not_ a specific data graph but the set of all instances of a
  class in the world?  
  
  I stop reading.

Summary and suggestions

The spec looks quite nice on the surface but the explanation is conceptually
muddled.  Would it not be simpler and clearer to define a SHACL where, to
paraphrase the 2008 DSP specification [2], "the fundamental usage model for a
[shape] is to examine whether a [data graph] matches the [shape]"?  Everything
else could be out of scope.  Some suggestions:

1. Define "constraint" up-front.

2. If a shape is described in RDF, say so early on, then avoid implying that a
   SHACL shape is based on any semantics other than RDF semantics.

3. Come up with better names than 'subclass', 'superclass', 'type', and
   'instance' for whatever it is that is being described.  Anyone familiar with
   classes and instances in RDF -- or classes and instances in OOP -- will
   surely be led astray by yet another completely different re-use of
   terminology that only _seems_ familiar.  Repurposing these well-worn terms
   actually gets in the way of understanding.

4. Move anything about materializing additional triples as a pre-processing
   step -- even sub-class relationships -- into a separate document specifically
   for implementation advice, such as a primer. In other words, split out all
   references to inferencing from the SHACL language itself.  To keep the language
   specification clear, an immutable data graph need only be validated against an
   immutable shape graph, full stop.  Anything else can be moved elsewhere.

5. Move Sections 6 through 11 into a separate document or primer.  Far better
   to put this into its own shorter, focused specification than tack it onto
   specification that is already much too long -- 108 pages, had I printed it out.

Simpler, clearer specs stand a correspondingly greater chance of actually being
read -- and used.

Tom

[1] https://www.w3.org/blog/SW/2013/10/04/w3c-workshop-report-rdf-validation-practical-assurances-for-quality-rdf-data/
[2] http://dublincore.org/documents/dc-dsp/
[3] https://www.w3.org/TR/rdf-schema/#ch_classes
[4] https://lists.w3.org/Archives/Public/public-rdf-shapes/
[5] https://lists.w3.org/Archives/Public/public-data-shapes-wg/

-- 
Tom Baker <tom@tombaker.org>

Received on Sunday, 1 May 2016 14:40:59 UTC