Re: Analysis of Example in ShEx paper submitted to SWJ from Eric Prud'hommeaux on 2016-01-01 (public-data-shapes-wg@w3.org from January 2016)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 1 Jan 2016 08:50:44 -0500
To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
Cc: Jose Emilio Labra Gayo <jelabra@gmail.com>, RDF Data Shapes Working Group <public-data-shapes-wg@w3.org>
Message-ID: <20160101135042.GA25606@w3.org>
* Peter F. Patel-Schneider <pfpschneider@gmail.com> [2016-01-01 04:24-0800]
> On 12/31/2015 11:58 PM, Jose Emilio Labra Gayo wrote:
> > On Thu, Dec 31, 2015 at 2:02 PM, Peter F. Patel-Schneider
> > <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote:
> > 
> >     So the paper then works something like this:
> > 
> >     Here is some sort of an E-R diagram (Figure [1]) that somehow describes an
> >     actual linked data use case (although even it is modified from the publication
> >     that describes the actual use case).  Here are some ShEx shapes (Section 3)
> >     that do something different - more disjunction, for example.  Therefore ShEx
> >     is suitable for validating and describing linked data portals.
> > 
> > 
> > Not at all. The paper introduces a real use case using an informal notation in
> > section 2, then it describes the structure using ShEx notation which can also
> > be seen as an introduction to ShEx. ShEx was indeed used when we developed the
> > linked data portal to describe its contents. Sections 4 and 5 describe Shape
> > Expressions tools and how they can be used to validate a linked data portal.
> > Section 6 is new and describes the same data model using SHACL. We thought the
> > paper would be useful for readers who wanted to learn about SHACL use in a
> > real use case. Section 7 describes a tool called "wiGen" than can generate
> > random instance data on demand based on the previous defined data model and
> > proposes its use as a performance benchmarking tool.
> 
> The paper does not introduce a real use case.  The use case was already
> described in a previous publication.  There are significant differences
> between the previous publication of the use case and the description here.
> 
> > 
> >     This doesn't sound very convincing.
> > 
> > 
> > If what doesn't convince you are the modifications done to the original data
> > model, the reasons for those modifications are:
> > 
> > 1.- To make the paper self-contained and easier to read by the target
> > audience. We simplified some parts of the original data model like the
> > definitions of the statistical computations because we wanted this paper to be
> > self-contained and easier to read by people not interested in statistical
> > computations.
> 
> Sure, you may do this, but you do have to tell readers that you have done this
> and convince them that ShEx can handle the simplifications.
> 
> > 2.- To be as general as possible. We considered that imposing a "rdf:type"
> > declaration on every node in a linked data portal was too restrictive.
> > Although those declarations can be a good practice, they are not mandatory in
> > RDF and linked data validators should not depend on those declarations to do
> > their job.
> 
> The use case as previously described had rdf:type information for every box in
> the diagram.  Changing this important feature of the use case is not
> acceptable.  Yes, rdf:type links are not required for every node in an RDF
> graph, but if your use case had them then they need to be retained.

This goes beyond whether rdf:type arcs are required for every node. In
order to use rdf:type arcs to target validation, we need to know that
some type arcs *uniquely* identify the nodes of interest (unless you
want to sort through arbitrary numbers of irrelevent validation
failures).

Note that XML schema provides one way for nodes to be associated with
schema types, and Relax NG, provides, iirc, zero. Much of the use of
schema languages entails some external association between nodes and
candidate types. For instance, WSDL associates a schema type with a
node in a document in a protocol exchange. This is used in a few ways:

1 documentation - what do the messages have to look like

2 validation - WSDL applications root around in the SOAP envelope to
  find the nodes of interest and test them against the schema types.

3 code generation - WSDL tools generate stub code based on the schema
  types. This code is generally used within some framework which does
  the rooting around in SOAP envelopes for you.


Given that it is important to be able to trigger validation by other
means than type arcs, how do you suggest that be introduced?

1 a little explanatory text to the effect of "The original
  representation of the web index included type arcs on every
  node. This is not the case for RDF data in general so we are
  modifying the use case to illustrate how validation occurs without
  discriminating type arcs."

2 abandon the web index use case and cook up something much less
  documented.

IMO, 1 seems much more satisfactory to readers in general.


> Otherwise you may be proposing that some feature of ShEx is useful for
> validating and describing linked data portals without any support.
> 
> > 3.- To cover some of the features of ShEx in the context of a real use case.
> > We added some features like closed shapes, disjunction, Extra modifiers etc.
> > to help a reader understand those features when they are applied in practice.
> > Our intention was that section 3 of the paper could be seen both as a
> > description of the data model and as an introduction to ShEx by example.
> 
> These changes have similar problems to the removal of the rdf:type link for
> countries.
> 
> > 
> >     PS: Many of the shapes actually do use rdf:type (as "a").  It is just :Country
> >     that has dropped the rdf:type from the previous  paper.
> > 
> > 
> > Yes, indeed the original data model contained "rdf:type" declarations for most
> > of the nodes except for some computations. In the paper we decided to drop
> > "rdf:type" in :Country for two reasons:
> > 
> > 1.- Given that :Country is defined as an open shape, we don't prohibit its
> > appearance, we just omit its definition from the shape meaning that it can
> > appear or not.
> > 2.- In a later project we noticed that "rdf:type :Country" was too restrictive
> > for those nodes because we included also regions in the range of the
> > "cex:ref-area" property.
> 
> But the box in Figure 1 is still titled :Country and it still has an iso2
> field.   This make it look as if it is for countries.  If you had a need to
> change the use case since its publication you need to describe the change and
> motivate it.
> 
> > Jose Labra
> > 
> > 
> > 
> >     On 12/30/2015 10:49 PM, Jose Emilio Labra Gayo wrote:
> >     > On Mon, Dec 28, 2015 at 6:05 PM, Peter F. Patel-Schneider
> >     > <pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>
> >     <mailto:pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>>> wrote:
> >     >
> >     >     I took a look at "Validating and Describing Linked Data Portals using
> >     >     Shapes", as submitted to the Semantic Web Journal in early December.
> >     >     The current version of the submitted paper is currently available at
> >     >     www.semantic-web-journal.net/system/files/swj1260.pdf
> >     <http://www.semantic-web-journal.net/system/files/swj1260.pdf>
> >     >     <http://www.semantic-web-journal.net/system/files/swj1260.pdf> but this
> >     >     version has
> >     >     unknown differences from the version that I looked at.
> >     >
> >     >     The submission extensively uses an example about measuring the World
> >     Wide
> >     >     Web's contribution to global development and human rights.  This example
> >     >     comes from a previous paper by J. E. L. Gayo, H. Farham, J. C.
> >     Fernández,
> >     >     and J. M. Á. Rodríguez, "Representing statistical indexes as linked data
> >     >     including metadata about their computation process".  The ShEx
> >     provided in
> >     >     the submission for the example has some significant unexplained
> >     differences
> >     >     from the example in the published paper.
> >     >
> >     >
> >     > The differences were introduced to better explain some features from
> >     ShEx. The
> >     > paper uses the WebIndex data as an use case to introduce those features
> >     to the
> >     > reader. The paper is self-contained in that sense because the problem
> >     > statement is described using the figure 2 diagram and the ShEx definitions
> >     > from section 3.
> >     >
> >     >     I was unable to determine the exact details of the example as there
> >     is no
> >     >     definition of the the formalism used for the bulk of information
> >     about the
> >     >     example - Figure 2 in the submission.  Here is my reconstruction of
> >     the data
> >     >     model in Figure 2 plus the suborganization relationship and a little bit
> >     >     more from the earlier paper.
> >     >
> >     >
> >     > The details are given in section 3 using ShEx.
> >     >
> >     > From this email and another private email you sent me with your review, I
> >     > guess that one misunderstanding is that you considered this paper as a
> >     > comparison between ShEx and SHACL, while the paper was not written for that
> >     > purpose in mind.
> >     >
> >     > As you can read in the conclusions: "In general we consider that the
> >     benefits
> >     > of validation using either ShEx or SHACL can help the adoption of RDF based
> >     > solutions where the quality of data is an important issue."
> >     >
> >     > The purpose of the paper is to show that both ShEx and SHACL can be used to
> >     > validate linked data portals.
> >     >
> >     > The paper introduces the problem statement in an informal way in section 2,
> >     > then, it describes the dataset using ShEx in section 3 showing that a linked
> >     > data portal can be described in ShEx. Later on, it shows how those
> >     definitions
> >     > can be defined in SHACL and proposes that dataset as a benchmark.
> >     >
> >     >
> >     >     I am using a ShEx-like syntax to capture the
> >     >     something like the form of the example, but this isn't necessarily ShEx,
> >     >     just a syntax to show the data model for the example.
> >     >
> >     > [...]
> >     >
> >     >
> >     >     country {
> >     >       rdf:type ( wf:Country ) [1,1],
> >     >       wf:iso2 xsd:string [1,1],
> >     >       wf:iso3 xsd:string [1,1],
> >     >       rdf:label xsd:string [1,1] }
> >     >
> >     >
> >     > Notice that in the paper we omitted the "rdf:type" declaration. Although
> >     that
> >     > declaration was in the original data model, we thought that it was better to
> >     > omit it in the new paper. The reason is precisely to show that we can model
> >     > data models which don't depend on "rdf:type" declarations.
> >     >
> >     > The paper explains that as:
> >     >
> >     > "It should be noted that rdf:type may or may not be included in shape
> >     > definitions. In the above example, we deliberately omitted the any rdf:type
> >     > requirement declaration, meaning that, in order to satisfy the :Country
> >     shape,
> >     > a node need only have those properties."
> >     >
> >     >
> >     >     The actual task to be performed is not described in the submission. It
> >     >
> >     >     appears to me that the natural task to be done is to determine
> >     whether an
> >     >     RDF graph containing information about observations conforms to this
> >     data
> >     >     model, for some definition of conforms.
> >     >
> >     >
> >     > The task to be performed can be guessed from the context of the paper.
> >     >
> >     >
> >     >     This determination could be done in a number of ways in SHACL.  The
> >     approach
> >     >     taken in the submission is to use a set of mutually recursive SHACL
> >     shapes.
> >     >     However, it seems to me that it would be better to instead use
> >     non-recursive
> >     >     SHACL shapes with scopes as follows:
> >     >
> >     >
> >     > [...]
> >     >
> >     >     The significant difference between the treatment here and the
> >     treatment in
> >     >     the submission is to use the type information as scopes, so that the
> >     shape of
> >     >     portions of the data is not mandated from its position as a value
> >     for some
> >     >     other portion of the data but is instead mandated by its type.
> >     >
> >     >
> >     > Yes, that's the most significant difference and that's why we omitted the
> >     > mandatory "rdf:type" declaration in the country shape. While having
> >     "rdf:type"
> >     > declarations in linked data portals for every node is probably a good
> >     > practice, it is not mandatory and validating linked data portals should not
> >     > depend on those declarations.
> >     >
> >     > In principle, a node in an RDF graph can have zero, one or more "rdf:type"
> >     > declarations, and the validation tool should be able to handle those
> >     situations.
> >     >
> >     >
> >     >     The point here is mostly to show that a major example of recursive
> >     shapes
> >     >     does not appear to need recursive shapes, nor even shapes referring to
> >     >     other shapes at all.
> >     >
> >     >
> >     > What you have shown is that if every node has a discriminating "rdf:type"
> >     > declaration, then the validation can be done easily and without recursive
> >     > shapes by referring to the corresponding type instead of the shape.
> >     >
> >     >
> >     >     peter
> >     >
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > -- Jose Labra
> >     >
> > 
> > 
> > 
> > 
> > -- 
> > -- Jose Labra
> > 
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Friday, 1 January 2016 13:50:52 UTC