- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Wed, 18 May 2016 15:15:53 -0700
- To: Arnaud Le Hors <lehors@us.ibm.com>, public-data-shapes-wg@w3.org
Here is a proposed partial reply to Tom Baker. It depends on some changes that have not been done to the SHACL specification. peter Date: Sun, 1 May 2016 16:40:21 +0200 From: Thomas Baker <tom@tombaker.org> To: RDF Shapes <public-rdf-shapes@w3.org> > > Comments on > > Shapes Constraint Language (SHACL) > Editors Draft 29 April 2016 > http://w3c.github.io/data-shapes/shacl/ > > Some context: I have followed this activity since participating in the workshop > on RDF validation in 2013 [1]. The activity seemed like it might achieve the > goals pursued a decade ago with the DCMI Working Draft, Description Set Profile > Constraint Language [2]. I have tried to keep up with the excellent work by > Karen Coyle, Antoine Isaac, Hugo Manguinhas, Thomas Hartmann, and others on > comparing the emerging SHACL specification to requirements that have > accumulated over the years in the Dublin Core community. > > There is alot to like in SHACL but I must confess that each time I tried to > actually read the specification I found myself getting stuck at the same > places. I'd set it aside, assuming that the issues would shake out. Many > months later, however, I find the same sticking points, unchanged. This time I > pressed on through the introduction to Section 2.1. > > These comments convey my thoughts while reading the text and end with some > suggestions. I have made no effort to catch up on discussion in the relevant > mailing lists [4,5], so please forgive me if I simply cover issues here that > are already well-understood. > > Abstract > > First sentence (also first sentence of Introduction): > > "SHACL is a language for describing and constraining the contents of RDF > graphs" > > So I ask myself: If an RDF graph is an immutable set of triples, in what > sense can it be "constrained"? If an RDF graph is a description with a > meaning determined by RDF semantics, what does it mean for that _description_ > to be "described"? Surely SHACL is not meant to somehow limit the > RDF-semantic meaning of an RDF graph, which would make no sense, but then > what does mean "constraining" mean? Surely the specification of a > "constraint language" should start by defining "constraint". 1: Change: Replace "constraining" by "validating" whereever possible in the document. 2: Change in abstract and introduction: SHACL is a language for validating whether RDF graphs meet certain conditions. > Further on, one finds that the "constraint language" actually has nothing to > do with somehow constraining RDF graphs and everything to do with describing > an instance of the class "shape", which can be used with a process for > determining whether a given RDF graph conforms to the set of constraints > described in that shape ("validation"). In the Abstract, however, validation > is mentioned only in passing ("can be used to communicate information about > data structures... generate or validate data, or drive user interfaces"). > > The Abstract concludes with an unsettling reference to the "underlying > semantics" of SHACL. We already have RDF semantics. Will this document > define another? 3: Change: Use "SHACL" to differentiate SHACL versions of terminology from RDF and RDFS versions throughout the document. > 1. Introduction > > "This document defines what it means for an RDF graph... to conform to a > graph containing SHACL shapes" > > An improvement over the Abstract. 4: Change to: This document defines the SHACL Shapes Constraint Language, a language for validating RDF graphs against a set of conditions. These conditions are provided as shapes and other constructs expressed in the form of an RDF graph. RDF graphs that are used in this manner are called "shapes graphs" in SHACL and the RDF graphs that are validated against a shapes graph are called "data graphs". As SHACL shape graphs are used to validate that data graphs satisfy a set of conditions they can also be viewed as a description of the data graphs that do satisfy these conditions. > 1.2. SHACL example > > "A shapes graph containing shape definitions and other information that can > be utilized to determine what validation is to be done" > > The wording is odd. How about: > > "A shapes graph, which describes a set of constraints, can be used to > determine whether a given data graph conforms to the constraints." 5: Change to: A shapes graph contains shapes and other information to determine whether a data graph validates aganinst the shapes graph. > Up to this point, has the text actually said that SHACL shape graphs are > expressed in RDF? The Document Outline does say that examples are expressed > in Turtle syntax, which strongly implies RDF. But that SHACL shape graphs > are expressed in RDF is actually not obvious for anyone who knows that SPARQL > also expresses shape-like constructs for matching against RDF data, and that > SPARQL constructs are not themselves expressed in RDF. > (As an aside, readers of RDF 1.1 Turtle will find instances with prefixed > names in lowercase, whereas in the SHACL spec the prefixed names are in > uppercase. A sentence about the naming conventions used in this document > could make this explicit.) > > Section 1.2 continues: > > "ex:IssueShape... [has constraints that apply]... to a (transitive) > subclass of ex:Issue following rdf:subClassOf triples" > > Hmm - nothing in the spec has yet hinted that the process of validating a > data graph against a shape graph will _require_ additional, out-of-band > information such as schema definitions. 6: *NEEDS WORK* > 1.3. Relationship between SHACL and RDF > > "SHACL uses RDF and RDFS vocabulary... and concepts... [but] SHACL does not > always use this vocabulary or these concepts in exactly the way that they > are formally defined in RDF and RDFS." > > Hang on, so SHACL does _not_ use RDF/S vocabulary as defined by the RDF/S > specs?? It is jarring to read this in a W3C rec-track specification. How is > this not a show-stopper? > > One then learns that SHACL validation is about more than matching an > immutable data graph against an immutable shapes graph. Apparently it > involves the prior creation of an _expanded_ data graph through selective > materialization of inferred triples. 7: The only materialization-like notion in SHACL is default value types. This notion is being revised and may be done away with. > The notion of "SHACL processors" having (selectively) to support inferencing > goes far beyond just defining a vocabulary for describing a shape and a > process for evaluating that shape against a data graph. It implies a > software application with SHACL-specific features and an inferencing style > that is SHACL-specific -- both of which, to my way of thinking, should be > completely orthogonal to the language specification, which could quite > reasonably focus on just the vocabulary and validation algorithm. 8: SHACL shapes are written in RDF and some constructs in SHACL are grouped by using rdf:type and rdfs:subClassOf triples. The document will be changed to use SHACL-specific vocabulary showing that there is no need for inferencing beyond SPARQL paths, in particular rdf:type/rdfs:subClassOf* > If, as the spec points out, "SHACL implementations may operate on RDF graphs > that include entailments", couldn't the SHACL spec be helpfully simplified by > leaving the materialization of inferred triples out of scope entirely -- as > something done in a pre-processing phase, perhaps according to a few > well-known patterns as described in a separate specification? 9: This could have been done but the working group did not want to depend on any external materialization of entailed triples. SHACL thus works on any RDF graph. > The section ends with very puzzling definitions for "subclass", "type", and > "instance" -- "A node is an instance of a class if one of its types is the > given class"?? -- but I press on, hoping the next section will bring some > clarity... 10: This section will be eliminated in favor of SHACL-specific terms. > 2. Shapes > > The first paragraph says: > > "Shape scopes define the selection criteria" > > but then Figure 1 says: > > "Scope selects focus nodes" > > If a shape is just a graph (or part of a shapes graph), then surely that > graph cannot actually perform a action, like "selects", as if executed like a > Java method. Figure 1 also talks about filter shapes that "refine" or > "eliminate" and constraints that "produce". Talking about graphs as agents > is deeply confusing. > > "Class-based scopes define the scope as the set of all instances of a > class." > > Okay, yes... classes have extensions... after all, RDF Schema 1.1 says that > "Associated with each class is a set, called the class extension of the > class, which is the set of the instances of the class" [3]. But what does > this have to do with defining the set of focus nodes for a shape? The scope > of a shape is _not_ a specific data graph but the set of all instances of a > class in the world? 11: *NEEDS WORK* > I stop reading. > > Summary and suggestions > > The spec looks quite nice on the surface but the explanation is conceptually > muddled. Would it not be simpler and clearer to define a SHACL where, to > paraphrase the 2008 DSP specification [2], "the fundamental usage model for a > [shape] is to examine whether a [data graph] matches the [shape]"? Everything > else could be out of scope. Some suggestions: > > 1. Define "constraint" up-front. Shapes are discussed early. Constraints are introduced in the new section on terminology. > 2. If a shape is described in RDF, say so early on, then avoid implying that a > SHACL shape is based on any semantics other than RDF semantics. See change 4: above. > 3. Come up with better names than 'subclass', 'superclass', 'type', and > 'instance' for whatever it is that is being described. Anyone familiar with > classes and instances in RDF -- or classes and instances in OOP -- will > surely be led astray by yet another completely different re-use of > terminology that only _seems_ familiar. Repurposing these well-worn terms > actually gets in the way of understanding. *NEEDS TO BE DONE* Most of these have been eliminated in favor of "SHACL type". > 4. Move anything about materializing additional triples as a pre-processing > step -- even sub-class relationships -- into a separate document specifically > for implementation advice, such as a primer. In other words, split out all > references to inferencing from the SHACL language itself. To keep the language > specification clear, an immutable data graph need only be validated against an > immutable shape graph, full stop. Anything else can be moved elsewhere. *NEEDS TO BE DONE* > 5. Move Sections 6 through 11 into a separate document or primer. Far better > to put this into its own shorter, focused specification than tack it onto > specification that is already much too long -- 108 pages, had I printed it out. *NEEDS TO BE DONE* > Simpler, clearer specs stand a correspondingly greater chance of actually being > read -- and used. > > Tom > > [1] https://www.w3.org/blog/SW/2013/10/04/w3c-workshop-report-rdf-validation-practical-assurances-for-quality-rdf-data/ > [2] http://dublincore.org/documents/dc-dsp/ > [3] https://www.w3.org/TR/rdf-schema/#ch_classes > [4] https://lists.w3.org/Archives/Public/public-rdf-shapes/ > [5] https://lists.w3.org/Archives/Public/public-data-shapes-wg/ > > -- > Tom Baker <tom@tombaker.org> Date: Thu, 5 May 2016 10:15:11 +0200 From: Thomas Baker <tom@tombaker.org> To: RDF Shapes <public-rdf-shapes@w3.org> > More comments on SHACL [1], Editor's Draft 29 April 2016 > http://w3c.github.io/data-shapes/shacl/ > > I posted a previous batch of comments on 1 May [1] but have learned a few > things since then. I remain unsure what the specification really means in some > respects, so the following reflects what I think the specification "really" > means -- what I infer it to mean -- with some suggestions on how the spec > could help the reader by articulating some key assumptions up-front. > > 1. SHACL provides a vocabulary for describing shapes and a simple > algorithm for "validating" an arbitrary graph of RDF data (Data Graph) > against an RDF description of data shapes (Shapes Graph). See 4: > 2. The SHACL validation algorithm checks the conformance of triples in > the Data Graph to "constraints" described in the Shapes Graph. See 4: > 3. Validation evaluates a target Data Graph at the level of its abstract > syntax. In accordance with RDF 1.1 Concepts and Abstract Syntax [1], > RDF abstract syntax consists of triples, or subject and object nodes > connected with predicates, with nodes that may be IRIs, blanks, or > datatyped literals. The SHACL spec's use of "focus nodes" fits with > the use of "node" in rdf11-concepts [2]. SHACL works on RDF graphs, which is the abstract syntax of RDF, but "RDF graphs" is a better name for this. There is wording in the new terminology section that does defer to [2] for terminology from there. > 4. In accordance with the Closed-World Assumption (CWA), the validation > algorithm limits itself to matching constraint patterns, as described in > the Shapes Graph, against the abstract-syntactic components of the triples > actually asserted in target Data Graph, with no further interpretation of > the Data Graph or inferencing based on its formal semantics. More care is now taken to say that SHACL works on data graphs directly. > 5. A Shapes Graph is expressed in RDF. Even though the primary use of > a Shapes Graph is for CWA-based validation, it should be noted that the > semantics of the Shapes Graph itself, as of any other expression in RDF, > follows the Open-World Assumption (OWA). Shapes graphs in SHACL are viewed as syntactic constructs, where the OWA and CWA assumptions are not relevant. SHACL does determine whether some syntactic constructs are valid by using chains of rdfs:subClassOf triples, but this again only looks at the triples that are in the RDF graph. Thus SHACL does not depend on any open-world notions. > 6. The inherently open-world meaning of the Shapes Graph, however, does not > seem to be of practical consequence for its use in CWA-based validation -- > unless, perhaps, one were to construct or augment a Shapes Graph with inferred > triples -- with the caveat that shapes graphs could potentially pollute > "real" data by adding meaning that is not intended to be interpreted as > real data, e.g., as when the practical hack of using a class IRI to name a > shape were followed (Section 2.1.2.1, "Implicit Class Scopes"). The scopes of a shape are determined only by looking at the triples in the shapes graph. SHACL does not depend on the addition of any triples to either the shapes graph or the data graph, even for shapes that are also SHACL instances of rdfs:Class. *NEED TO RENAME "implicit" to something else* > 7. A Shapes Graph may specify a potential set of "focus nodes" as the "scope" > of validation in the Data Graph. A Shapes Graph may also specify a potential > set of "focus nodes" to be dropped out of the validation scope ("filtered"). > Potential focus nodes may or may not match actual nodes in the Data Graph. The discussion of shapes, scopes, and filters has been revised considerably. *NEED TO DO THIS* > 8. Validation based on closed-world assumptions applies to the relationship > between constraints (as described the Shapes Graph) and triples in the data > graph viewed at the level of their RDF abstract-syntactic components > (e.g., the "focus nodes"). *TO DO* > Note: An earlier iteration of these comments was posted on the DC-ARCHITECTURE > [3]. The resulting thread drew out some additional comments and insights that > could be of interest to members of Data Shapes. The working group may take these extra comments into account. > [1] https://lists.w3.org/Archives/Public/public-rdf-shapes/2016May/0000.html > [2] https://www.w3.org/TR/rdf11-concepts/ > [3] https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1605&L=dc-architecture&P=3148 > > ---------------------------------------------------------------------- > Discussion > > Because SHACL is expressed in RDF, like it or not, a Shapes Graph is > interpreted according to OWA. Since the design decision was made to express > the Shapes Graph in RDF, and not in a completely different syntax -- as in the > case of SPARQL or, for that matter, DCMI's DSP -- the native OWA interpretation > of a Shapes Graph cannot be papered over, ignored, or otherwise contradicted. SHACL views the shapes graph as an RDF graph, i.e., a set of triples. As all that counts is this set of triples, the OWA is not relevant. Even if this were not the case, there is nothing in the SHACL specification that is concerned with whether something that is not stated in the shapes graph, or in the data graph, is false or not. > The design choice of expressing Shapes Graphs in RDF does somewhat limit SHACL, > in certain respects, compared to SPARQL or DSP. In SPARQL, for example, > `rdfs:subClassOf*` is interpreted as referring to the transitive closure of > `rdfs:subClassOf`; the asterisk is a sort of syntactic sugar, a convenience > notation, that triggers specific inferences. As there is no equivalent way to > express `rdfs:subClassOf*` in RDFS, there is no way to say that > `rdfs:subClassOf` actually _means_ the transitive closure without, in effect, > arbitrarily overriding its global semantics. The RDFS semantics implies that rdfs:subClassOf is transitive so any discussion of classes in RDFS has to take this into account. As SHACL is only concerned with RDF graphs as sets of triples it does not depend on the RDFS semantics when it talks about SHACL types (and SHACL subclass, superclass, and instance) so these are defined as using the transitive closure of rdfs:subClassOf triples. The SHACL specification takes care to use "SHACL" to distinguish these notions from their RDF and RDFS versions. *NEEDS TO BE DONE* > Perhaps this is why the SHACL spec says that "SHACL does not always use this > vocabulary or these concepts in exactly the way that they are formally defined > in RDF and RDFS" (Section 1.3) -- a notion which gratuitously sets SHACL at > odds with W3C Semantic Web standards. *TO DO* > One could perhaps sidestep the issue by dropping _all_ consideration of > inferencing from the normative SHACL specification; saying only that there may > be a need for inferencing in a pre-processing phase; then discussing those > pre-processing options in a separate guidance document. Putting inferencing > out of scope would make the SHACL spec simpler, clearer, and shorter. Instead of depending on pre-processing SHACL does its own determination here. > Abstract syntax issues > > Because SHACL is viewing RDF data graphs through a closed-world lens, the > meaning of the graph is beside the point -- just as the meaning of a graph is > beside the point with SPARQL. A SHACL Shapes Graph is validated against a Data > Graph at the level of the abstract syntax of the Data Graph. According to RDF > 1.1 Concepts and Abstract Syntax, RDF graphs are sets of subject-predicate- > object triples, where the elements may be IRIs, blank nodes, or datatyped > literals [1]. This is made more clear in the current version of the specification. *NEEDS TO BE DONE* > Note that at the level of their abstract syntax, RDF Graphs have no "classes" > and no "instances"! A search in rdf11-concepts [1] for the words "instance" or > "class" will find no mention of either one, anywhere in the spec. This is why SHACL needs to define its own terminology in the new terminology section. > Confusingly, the SHACL spec makes reference to "instances", "classes", or > "instances of classes" in the Data Graph, viewing the Data Graph through a > semantic lens. Coining a new SHACL-specific notion of "instance" (and "class", > etc) next to the existing notions of RDF "instance" and OO "instance" make > SHACL particularly hard to grok. At the end of Section 1.3, for example, the > definition for "instance" starts off by saying: > > "A node is an instance of a class..." > > which I take to mean: > > "A node [in the Data Graph] is an instance of a class..." *TO DO* > By comparison, the SPARQL spec specifies a SPARQL-specific syntax to express > triple patterns composed of variables and RDF-abstract-syntactic things such as > IRIs and Literals. SPARQL itself does not "understand" that something is a > class or an instance -- it simply supports the formation of triple patterns and > leaves it to Primers and other usage guides to express queries, informally, in > semantic terms (e.g., "What data is stored about instances of class X?") This > separation of concerns makes the SPARQL specification much easier to > understand. It is worth noting that DCMI's Description Set Profile Constraint > Language [3] also defines its own syntax. *TO DO* > As an aside, it is unclear to me why it is even necessary for the SHACL spec to > redefine an already-loaded, overdetermined term such as "class" to refer to a > set of what one might call "type-matched focus nodes". If the intention is to > make SHACL more understandable to people who are unfamiliar with RDF, this > should be done not in the formal spec but in a primer or tutorial, where an > explanation can be customized for a specific audience, such as programmers. *TO DO* > A year ago, it was proposed that an abstract syntax be developed for SHACL [4]. > There was little discussion and the issue remains open but neglected. Since > SHACL is natively expressed in RDF, its abstract syntax is in effect the > abstract syntax for RDF. It is not clear to me whether this is actually a good > idea. If a Shapes Graph only exists to be used in a closed-world process > validating a Data Graph, what is the specific advantage of expressing it in > RDF? Might a proper abstract syntax for SHACL, based on its own BNF, etc, > further focus and clarify the SHACL language? On the other hand, I see no > specific reasons why SHACL should _not_ use RDF to express shapes graphs as it > does -- provided that the spec (or a primer) point out any potential pitfalls, > as touched on above. *TO DO* > [1] https://www.w3.org/TR/rdf11-concepts/ > [2] https://www.w3.org/TR/rdf11-concepts/#data-model > [3] http://dublincore.org/documents/dc-dsp/ > [4] https://www.w3.org/2014/data-shapes/track/issues/52 > > > -- > Tom Baker <tom@tombaker.org>
Received on Wednesday, 18 May 2016 22:16:27 UTC