- From: Thomas Baker <tom@tombaker.org>
- Date: Sun, 1 May 2016 16:40:21 +0200
- To: RDF Shapes <public-rdf-shapes@w3.org>
Comments on Shapes Constraint Language (SHACL) Editors Draft 29 April 2016 http://w3c.github.io/data-shapes/shacl/ Some context: I have followed this activity since participating in the workshop on RDF validation in 2013 [1]. The activity seemed like it might achieve the goals pursued a decade ago with the DCMI Working Draft, Description Set Profile Constraint Language [2]. I have tried to keep up with the excellent work by Karen Coyle, Antoine Isaac, Hugo Manguinhas, Thomas Hartmann, and others on comparing the emerging SHACL specification to requirements that have accumulated over the years in the Dublin Core community. There is alot to like in SHACL but I must confess that each time I tried to actually read the specification I found myself getting stuck at the same places. I'd set it aside, assuming that the issues would shake out. Many months later, however, I find the same sticking points, unchanged. This time I pressed on through the introduction to Section 2.1. These comments convey my thoughts while reading the text and end with some suggestions. I have made no effort to catch up on discussion in the relevant mailing lists [4,5], so please forgive me if I simply cover issues here that are already well-understood. Abstract First sentence (also first sentence of Introduction): "SHACL is a language for describing and constraining the contents of RDF graphs" So I ask myself: If an RDF graph is an immutable set of triples, in what sense can it be "constrained"? If an RDF graph is a description with a meaning determined by RDF semantics, what does it mean for that _description_ to be "described"? Surely SHACL is not meant to somehow limit the RDF-semantic meaning of an RDF graph, which would make no sense, but then what does mean "constraining" mean? Surely the specification of a "constraint language" should start by defining "constraint". Further on, one finds that the "constraint language" actually has nothing to do with somehow constraining RDF graphs and everything to do with describing an instance of the class "shape", which can be used with a process for determining whether a given RDF graph conforms to the set of constraints described in that shape ("validation"). In the Abstract, however, validation is mentioned only in passing ("can be used to communicate information about data structures... generate or validate data, or drive user interfaces"). The Abstract concludes with an unsettling reference to the "underlying semantics" of SHACL. We already have RDF semantics. Will this document define another? 1. Introduction "This document defines what it means for an RDF graph... to conform to a graph containing SHACL shapes" An improvement over the Abstract. 1.2. SHACL example "A shapes graph containing shape definitions and other information that can be utilized to determine what validation is to be done" The wording is odd. How about: "A shapes graph, which describes a set of constraints, can be used to determine whether a given data graph conforms to the constraints." Up to this point, has the text actually said that SHACL shape graphs are expressed in RDF? The Document Outline does say that examples are expressed in Turtle syntax, which strongly implies RDF. But that SHACL shape graphs are expressed in RDF is actually not obvious for anyone who knows that SPARQL also expresses shape-like constructs for matching against RDF data, and that SPARQL constructs are not themselves expressed in RDF. (As an aside, readers of RDF 1.1 Turtle will find instances with prefixed names in lowercase, whereas in the SHACL spec the prefixed names are in uppercase. A sentence about the naming conventions used in this document could make this explicit.) Section 1.2 continues: "ex:IssueShape... [has constraints that apply]... to a (transitive) subclass of ex:Issue following rdf:subClassOf triples" Hmm - nothing in the spec has yet hinted that the process of validating a data graph against a shape graph will _require_ additional, out-of-band information such as schema definitions. 1.3. Relationship between SHACL and RDF "SHACL uses RDF and RDFS vocabulary... and concepts... [but] SHACL does not always use this vocabulary or these concepts in exactly the way that they are formally defined in RDF and RDFS." Hang on, so SHACL does _not_ use RDF/S vocabulary as defined by the RDF/S specs?? It is jarring to read this in a W3C rec-track specification. How is this not a show-stopper? One then learns that SHACL validation is about more than matching an immutable data graph against an immutable shapes graph. Apparently it involves the prior creation of an _expanded_ data graph through selective materialization of inferred triples. The notion of "SHACL processors" having (selectively) to support inferencing goes far beyond just defining a vocabulary for describing a shape and a process for evaluating that shape against a data graph. It implies a software application with SHACL-specific features and an inferencing style that is SHACL-specific -- both of which, to my way of thinking, should be completely orthogonal to the language specification, which could quite reasonably focus on just the vocabulary and validation algorithm. If, as the spec points out, "SHACL implementations may operate on RDF graphs that include entailments", couldn't the SHACL spec be helpfully simplified by leaving the materialization of inferred triples out of scope entirely -- as something done in a pre-processing phase, perhaps according to a few well-known patterns as described in a separate specification? The section ends with very puzzling definitions for "subclass", "type", and "instance" -- "A node is an instance of a class if one of its types is the given class"?? -- but I press on, hoping the next section will bring some clarity... 2. Shapes The first paragraph says: "Shape scopes define the selection criteria" but then Figure 1 says: "Scope selects focus nodes" If a shape is just a graph (or part of a shapes graph), then surely that graph cannot actually perform a action, like "selects", as if executed like a Java method. Figure 1 also talks about filter shapes that "refine" or "eliminate" and constraints that "produce". Talking about graphs as agents is deeply confusing. "Class-based scopes define the scope as the set of all instances of a class." Okay, yes... classes have extensions... after all, RDF Schema 1.1 says that "Associated with each class is a set, called the class extension of the class, which is the set of the instances of the class" [3]. But what does this have to do with defining the set of focus nodes for a shape? The scope of a shape is _not_ a specific data graph but the set of all instances of a class in the world? I stop reading. Summary and suggestions The spec looks quite nice on the surface but the explanation is conceptually muddled. Would it not be simpler and clearer to define a SHACL where, to paraphrase the 2008 DSP specification [2], "the fundamental usage model for a [shape] is to examine whether a [data graph] matches the [shape]"? Everything else could be out of scope. Some suggestions: 1. Define "constraint" up-front. 2. If a shape is described in RDF, say so early on, then avoid implying that a SHACL shape is based on any semantics other than RDF semantics. 3. Come up with better names than 'subclass', 'superclass', 'type', and 'instance' for whatever it is that is being described. Anyone familiar with classes and instances in RDF -- or classes and instances in OOP -- will surely be led astray by yet another completely different re-use of terminology that only _seems_ familiar. Repurposing these well-worn terms actually gets in the way of understanding. 4. Move anything about materializing additional triples as a pre-processing step -- even sub-class relationships -- into a separate document specifically for implementation advice, such as a primer. In other words, split out all references to inferencing from the SHACL language itself. To keep the language specification clear, an immutable data graph need only be validated against an immutable shape graph, full stop. Anything else can be moved elsewhere. 5. Move Sections 6 through 11 into a separate document or primer. Far better to put this into its own shorter, focused specification than tack it onto specification that is already much too long -- 108 pages, had I printed it out. Simpler, clearer specs stand a correspondingly greater chance of actually being read -- and used. Tom [1] https://www.w3.org/blog/SW/2013/10/04/w3c-workshop-report-rdf-validation-practical-assurances-for-quality-rdf-data/ [2] http://dublincore.org/documents/dc-dsp/ [3] https://www.w3.org/TR/rdf-schema/#ch_classes [4] https://lists.w3.org/Archives/Public/public-rdf-shapes/ [5] https://lists.w3.org/Archives/Public/public-data-shapes-wg/ -- Tom Baker <tom@tombaker.org>
Received on Sunday, 1 May 2016 14:40:59 UTC