Re: regression testing [was Re: summarizing proposed changes to charter] from David Booth on 2014-08-14 (public-rdf-shapes@w3.org from August 2014)

From: David Booth <david@dbooth.org>
Date: Thu, 14 Aug 2014 13:26:25 -0400
To: Arnaud Le Hors <lehors@us.ibm.com>
CC: public-rdf-shapes@w3.org
Message-ID: <53ECF141.6030900@dbooth.org>
On 08/14/2014 11:01 AM, Arnaud Le Hors wrote:
> Hi David,
>
> Maybe I'm just missing something but I have to admit not to be convinced
> by your argument that this is a necessity for validation. Rather, it
> seems to me that you're just trying to piggyback on top of this WG to
> have it do something that you think would be useful.

In a sense I am, because as I mentioned before, this is a somewhat 
different notion of validation than just looking at the shape of the 
data.  I agree that it is not a necessity for *shape* validation, but I 
do see it as important for validating, in a uniform way, that actual 
data is equivalent to expected data.  But I understand that that is 
tangential to the main use case that the group wants to focus on, so I 
won't push it further.

>
> I understand you have good intentions but I'm sure you know that every
> deliverable has a cost, even if optional, and I'd rather we don't add to
> a charter that is already going to require a lot of work.

Ok, I'll drop the request to include it.  Thanks for considering.

David

>
> Regards.
> --
> Arnaud  Le Hors - Senior Technical Staff Member, Open Web Standards -
> IBM Software Group
>
>
> David Booth <david@dbooth.org> wrote on 08/13/2014 08:14:38 PM:
>
>  > From: David Booth <david@dbooth.org>
>  > To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, public-rdf-
>  > shapes@w3.org
>  > Date: 08/13/2014 08:15 PM
>  > Subject: Re: regression testing [was Re: summarizing proposed
>  > changes to charter]
>  >
>  > On 08/13/2014 10:04 PM, Peter F. Patel-Schneider wrote:
>  > > OK, even though regression testing doesn't need canonicalization, it is
>  > > useful to have RDF canonicalization to support a particular regression
>  > > testing system.
>  > >
>  > > But how is the lack of a W3C-blessed method for RDF canonicalization
>  > > hindering the development or deployment of this system?  How would a
>  > > W3C-blessed method for RDF canonicalization help the development or
>  > > deployment of this system?
>  > >
>  > > The system could use any canonical form whatsoever, after all, right?
>  >
>  > Yes and no.  The lack of a W3C-blessed method of RDF canonicalization
>  > makes the comparison dependant on the particular canonicalization tool
>  > that is used, which means that RDF data produced by different tools (or
>  > different versions of the same tool) could not be reliably compared.  In
>  > many scenarios this won't be an issue, but it will in some.
>  >
>  > But more importantly, the lack of a standard RDF canonicalization method
>  > discourages the development of canonicalization tools.  Canonicalization
>  > has gotten little attention in RDF tools, in my view largely *because*
>  > of the difficulty of doing it and the lack of a W3C-blessed method.  It
>  > is non-trivial to implement, and if one's implementation would just end
>  > up as one's own idiosyncratic canonicalization anyway, instead of being
>  > an implementation of a standard, then there isn't as much motivation to
>  > do it.  I think a W3C-blessed method would help a lot.
>  >
>  > Would you be okay with canonicalization being an OPTIONAL deliverable?
>  >
>  > David
>  >
>  > >
>  > > peter
>  > >
>  > >
>  > > On 08/13/2014 12:00 PM, David Booth wrote:
>  > >> Hi Peter,
>  > >>
>  > >> On 08/13/2014 01:25 PM, Peter F. Patel-Schneider wrote:
>  > >>> On 08/13/2014 08:45 AM, David Booth wrote:
>  > >>>> Hi Peter,
>  > >>>>
>  > >>>> Here is my main use case for RDF canonicalization.
>  > >>>>
>  > >>>> The RDF Pipeline Framework http://rdfpipeline.org/allows any kind of
>  > >>>> data to
>  > >>>> be manipulated in a data production pipeline -- not just RDF. The
>  > >>>> Framework
>  > >>>> has regression tests that, when run, are used to validate the
>  > >>>> correctness of
>  > >>>> the output of each node in a pipeline.  A test passes if the actual
>  > >>>> node
>  > >>>> output exactly matches the expected node output, *after*
> filtering out
>  > >>>> ignorable differences.  (For example, differences in dates and times
>  > >>>> are
>  > >>>> typically treated as ignorable -- they don't cause a test to fail.)
>  > >>>> Since a
>  > >>>> generic comparison tool is used (because the pipeline is
> permitted to
>  > >>>> carry
>  > >>>> *any* kind of data), data serialization must be predictable and
>  > >>>> canonical.
>  > >>>> This works great for all kinds of data *except* RDF.
>  > >>>
>  > >>> Why?  You could just use RDF graph or dataset isomorphism.  Those are
>  > >>> already defined by W3C.  Well maybe you need to modify the graphs
> first
>  > >>> (e.g., to fudge dates and times), but you are already doing that for
>  > >>> other data types.
>  > >>>
>  > >>>> If a canonical form of RDF were defined, then the exact same tools
>  > >>>> that are
>  > >>>> used to compare other kinds of data for differences could also
> be used
>  > >>>> for
>  > >>>> comparing RDF.
>  > >>>
>  > >>> What are these tools?  Why should a tool to determine whether two
>  > >>> strings are the same also work for determining whether two XML
> documents
>  > >>> are the same. Oh, maybe you think that you should first canonicalize
>  > >>> everything and then do string comparison.  However, you are deluding
>  > >>> yourself that this is using the same tools for comparing
> different kinds
>  > >>> of data.  The tool that you are actually using to compare, e.g., XML
>  > >>> documents, is the composition of the datatype-specific
> canonicalizer and
>  > >>> a string comparer.  There is no free lunch---you still need tools
>  > >>> specific to each datatype.
>  > >>
>  > >> Not quite.  cmp is used for comparison of *serialized* data, and
>  > >> canonicalization is part of the data *serialization* process -- not
>  > >> the data
>  > >> *comparison* process.   The serialization process must necessarily
>  > >> understand
>  > >> what kind of data it is -- there is no way around that -- so that
> is the
>  > >> logical place to do the canonicalization.  But the comparison process
>  > >> does
>  > >> *not* know what kind of data is being compared -- nor should it have
>  > >> to.  It's
>  > >> the serializer's job to produce a predictable, repeatable
>  > >> serialization of the
>  > >> data.  This works great and is trivially easy for everything *except*
>  > >> RDF,
>  > >> because of the instability of blank node labels.  In RDF,
> comparison is
>  > >> embarrassingly difficult.
>  > >>
>  > >> One could argue that my application could use some workaround to solve
>  > >> this
>  > >> problem, but that belies the fact that the root cause of the problem
>  > >> is *not*
>  > >> some weird thing my application is trying to do, it is a weakness
> of RDF
>  > >> itself -- a gap in the RDF specs.  This gap makes RDF harder to use
>  > >> than it
>  > >> needs to be.  If we want RDF to be adopted by a wider audience --
> and I
>  > >> certainly do -- then we need to fix obvious gaps like this.
>  > >>
>  > >> I hope that helps clarify why I see this as a problem.  Given the
>  > >> above, would
>  > >> you be okay with canonicalization being an OPTIONAL deliverable?
>  > >>
>  > >> Thanks,
>  > >> David
>  > >>
>  > >>>
>  > >>>> I consider this a major deficiency in RDF that really needs to be
>  > >>>> corrected.
>  > >>>> Any significant software effort uses regression tests to validate
>  > >>>> changes.
>  > >>>> But comparing two documents is currently complicated and difficult
>  > >>>> with RDF
>  > >>>> data.  RDF canonicalization would make it as easy as it is for every
>  > >>>> other
>  > >>>> data representation.
>  > >>>
>  > >>> How so?  Right now you can just use a tool that does RDF graph or
>  > >>> dataset isomorphism.  Under your proposal you would need a tool that
>  > >>> does RDF graph or dataset canonicalization, which is no easier than
>  > >>> isomorphism checking. What's the difference?
>  > >>>
>  > >>>> I realize that this is a slightly different -- and more stringent --
>  > >>>> notion of
>  > >>>> RDF validation than just looking at the general shape of the data,
>  > >>>> because it
>  > >>>> requires that the data not only has the expected shape, but also
>  > >>>> contains the
>  > >>>> expected *values*.  Canonicalization would solve this problem.
>  > >>>
>  > >>> Canonicalization is a part of a solution to a problem that is already
>  > >>> solved.
>  > >>>
>  > >>>
>  > >>>> Given this motivation, would you be okay with RDF canonicalization
>  > >>>> being
>  > >>>> included as an OPTIONAL deliverable in the charter?
>  > >>>>
>  > >>>> Thanks,
>  > >>>> David
>  > >>>
>  > >>>
>  > >>> peter
>  > >>>
>  > >>>> On 08/13/2014 01:11 AM, Peter F. Patel-Schneider wrote:
>  > >>>>> I'm still not getting this at all.
>  > >>>>>
>  > >>>>> How does canonicalization help me determine that I got the RDF data
>  > >>>>> that
>  > >>>>> I expected (exact or otherwise)?  For example, how does
>  > >>>>> canonicalization
>  > >>>>> help me determine that I got some RDF data that tells me the phone
>  > >>>>> numbers of my friends?
>  > >>>>>
>  > >>>>> I just can't come up with a use case at all related to RDF data
>  > >>>>> validation where canonicalization is relevant, except for
> signing RDF
>  > >>>>> graphs, and that can just as easily be done at the surface syntax
>  > >>>>> level,
>  > >>>>> and signing is quite tangential to the WG's purpose, I think.
>  > >>>>>
>  > >>>>> peter
>  > >>>>>
>  > >>>>>
>  > >>>>> On 08/12/2014 09:17 PM, David Booth wrote:
>  > >>>>>> I think "canonicalization" would be a clearer term, as in:
>  > >>>>>>
>  > >>>>>>    "OPTIONAL - A Recommendation for canonical serialization
>  > >>>>>>     of RDF graphs and RDF datasets."
>  > >>>>>>
>  > >>>>>> The purpose of this (to me) is to be able to validate that I
> got the
>  > >>>>>> *exact*
>  > >>>>>> RDF data that I expected -- not merely the right classes and
>  > >>>>>> predicates and
>  > >>>>>> such.  Would you be okay with including this in the charter?
>  > >>>>>>
>  > >>>>>> Thanks,
>  > >>>>>> David
>  > >>>>>>
>  > >>>>>> On 08/12/2014 10:00 PM, Peter F. Patel-Schneider wrote:
>  > >>>>>>> I'm still not exactly sure just what normalization means in this
>  > >>>>>>> context
>  > >>>>>>> or what relationship it has to RDF validation.
>  > >>>>>>>
>  > >>>>>>> peter
>  > >>>>>>>
>  > >>>>>>>
>  > >>>>>>> On 08/12/2014 06:55 PM, David Booth wrote:
>  > >>>>>>>> +1 for all except one item.
>  > >>>>>>>>
>  > >>>>>>>> I'd like to make one last ditch attempt to include graph
>  > >>>>>>>> normalization
>  > >>>>>>>> as an
>  > >>>>>>>> OPTIONAL deliverable.  I expect the WG to treat it as low
> priority,
>  > >>>>>>>> and would
>  > >>>>>>>> only anticipate a normalization document being produced if
> someone
>  > >>>>>>>> takes the
>  > >>>>>>>> personal initiative to draft it.  I do not see any significant
>  > >>>>>>>> harm in
>  > >>>>>>>> including it in the charter on that basis, but I do see a
> benefit,
>  > >>>>>>>> because if
>  > >>>>>>>> the WG did somehow get to it then it would damn nice to have, so
>  > >>>>>>>> that
>  > >>>>>>>> we could
>  > >>>>>>>> finally validate RDF data by having a standard way to compare
>  > >>>>>>>> two RDF
>  > >>>>>>>> documents for equality, like we can routinely do with every
> other
>  > >>>>>>>> data
>  > >>>>>>>> representation.
>  > >>>>>>>>
>  > >>>>>>>> Peter, would that be okay with you, to include graph
>  > >>>>>>>> normalization as
>  > >>>>>>>> OPTIONAL
>  > >>>>>>>> that way?
>  > >>>>>>>>
>  > >>>>>>>> Thanks,
>  > >>>>>>>> David
>  > >>>>>>>>
>  > >>>>>>>> On 08/12/2014 08:55 PM, Eric Prud'hommeaux wrote:
>  > >>>>>>>>> Hi all, we can have a face-to-face at the W3C Technical
> Plenary in
>  > >>>>>>>>> November if we can quickly endorse a good-enough charter.
>   As it
>  > >>>>>>>>> stands now, it isn't clear that the group will be able to reach
>  > >>>>>>>>> consensus within the Working Group, let alone get through the
>  > >>>>>>>>> member
>  > >>>>>>>>> review without objection.
>  > >>>>>>>>>
>  > >>>>>>>>> Please review the proposals that I've culled from the list.  I
>  > >>>>>>>>> encournage compromise on all our parts and we'll have to
> suppress
>  > >>>>>>>>> the
>  > >>>>>>>>> desire to wordsmith. (Given the 3-month evaluation period,
>  > >>>>>>>>> wordsmithing won't change much anyways.)
>  > >>>>>>>>>
>  > >>>>>>>>>
>  > >>>>>>>>> separate semantics:
>  > >>>>>>>>>
>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>  > >>>>>>>>> Message-ID:
>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>  > >>>>>>>>>      A syntax and semantics for shapes specifying how to
> construct
>  > >>>>>>>>> shape
>  > >>>>>>>>> expressions and how shape expressions are evaluated against RDF
>  > >>>>>>>>> graphs.
>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>  > >>>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>  > >>>>>>>>>      defining the the (direct) semantics meaning of shapes and
>  > >>>>>>>>> defining the
>  > >>>>>>>>> associated validation process.
>  > >>>>>>>>>
>  > >>>>>>>>>    opposition: Holger Knublauch
>  > >>>>>>>>>
>  > >>>>>>>>>    proposed resolution: include, noting that if SPARQL is
> judged
>  > >>>>>>>>> to be
>  > >>>>>>>>> useful for the semantics, there's nothing preventing us from
>  > >>>>>>>>> using it.
>  > >>>>>>>>>
>  > >>>>>>>>>
>  > >>>>>>>>> make graph normalization optional or use-case specific:
>  > >>>>>>>>>
>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>  > >>>>>>>>> Message-ID:
>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>  > >>>>>>>>>      3 OPTIONAL A specification of how shape verification
>  > >>>>>>>>> interacts
>  > >>>>>>>>> with
>  > >>>>>>>>> inference.
>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>  > >>>>>>>>> <D954B744-05CD-4E5C-8FC2-C08A9A99BA9F@syapse.com>
>  > >>>>>>>>>      the WG will consider whether it is necessary, practical or
>  > >>>>>>>>> desireable
>  > >>>>>>>>> to normalize a graph...
>  > >>>>>>>>>      A graph normalization method, suitable for  the use cases
>  > >>>>>>>>> determined by
>  > >>>>>>>>> the group....
>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>  > >>>>>>>>>      OPTIONAL - A Recommendation for
>  > >>>>>>>>> normalization/canonicalization
>  > >>>>>>>>> of RDF
>  > >>>>>>>>> graphs and RDF datasets that are serialized in N-Triples and
>  > >>>>>>>>> N-Quads.
>  > >>>>>>>>> opposition - don't do it at all:
>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>  > >>>>>>>>> Message-ID:
>  > >>>>>>>>> <53E3A4CB.4040200@gmail.com>
>  > >>>>>>>>>      the WG should not be working on this.
>  > >>>>>>>>>
>  > >>>>>>>>>    proposed resolution: withdrawn, to go to new light-weight,
>  > >>>>>>>>> focused
>  > >>>>>>>>> WG,
>  > >>>>>>>>> removing this text:
>  > >>>>>>>>>    [[
>  > >>>>>>>>>    The WG MAY produce a Recommendation for graph normalization.
>  > >>>>>>>>>    ]]
>  > >>>>>>>>>
>  > >>>>>>>>>
>  > >>>>>>>>> mandatory human-facing language:
>  > >>>>>>>>>
>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>  > >>>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>  > >>>>>>>>>      ShExC mandatory, but potentially as a Note.
>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>  > >>>>>>>>>      In Section 4 (Deliverables), change "OPTIONAL - Compact,
>  > >>>>>>>>> human-readable
>  > >>>>>>>>> syntax" to "Compact, human-readable syntax", i.e., make it
>  > >>>>>>>>> required.
>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>  > >>>>>>>>> <54AA894F-F4B4-4877-8806-EB85FB5A42E5@syapse.com>
>  > >>>>>>>>>
>  > >>>>>>>>>    opposition - make it OPTIONAL
>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>  > >>>>>>>>> Message-ID:
>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>  > >>>>>>>>>      OPTIONAL A compact, human-readable syntax for expressing
>  > >>>>>>>>> shapes.
>  > >>>>>>>>>
>  > >>>>>>>>>    proposed resolution: keep as OPTIONAL, not mentioning ShExC,
>  > >>>>>>>>> but
>  > >>>>>>>>> clarifying that it's different from the RDF syntax.
>  > >>>>>>>>>
>  > >>>>>>>>>
>  > >>>>>>>>> report formats:
>  > >>>>>>>>>    Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
>  > >>>>>>>>>      provide flexible validation execution plans that range
> from:
>  > >>>>>>>>>        Success / fail
>  > >>>>>>>>>        Success / fail per constraint
>  > >>>>>>>>>        Fails with error counts
>  > >>>>>>>>>        Individual resources that fail per constraint
>  > >>>>>>>>>        And enriched failed resources with annotations
>  > >>>>>>>>>
>  > >>>>>>>>>    proposed resolution: no change, noting that no one seconded
>  > >>>>>>>>> this
>  > >>>>>>>>> proposal.
>  > >>>>>>>>>
>  > >>>>>>>>>
>  > >>>>>>>>> test suite/validator:
>  > >>>>>>>>>
>  > >>>>>>>>>    Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
>  > >>>>>>>>>      Validation results are very important for the progress of
>  > >>>>>>>>> this
>  > >>>>>>>>> WG and
>  > >>>>>>>>> should be a standalone deliverable.
>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>  > >>>>>>>>>      Test Suite, to help ensure interoperability and correct
>  > >>>>>>>>> implementation.
>  > >>>>>>>>> The group will chose the location of this deliverable, such as
>  > >>>>>>>>> a git
>  > >>>>>>>>> repository.
>  > >>>>>>>>>
>  > >>>>>>>>>    proposed resolution: leave from charter as WGs usually
>  > >>>>>>>>> choose to
>  > >>>>>>>>> do this
>  > >>>>>>>>> anyways and it has no impact on IP commitments.
>  > >>>>>>>>>
>  > >>>>>>>>
>  > >>>>>>>
>  > >>>>>>>
>  > >>>>>>>
>  > >>>>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>>
>  > >>>>
>  > >>>
>  > >>>
>  > >>>
>  > >
>  > >
>  > >
>  >
Received on Thursday, 14 August 2014 17:26:54 UTC