Re: regression testing [was Re: summarizing proposed changes to charter] from Holger Knublauch on 2014-08-15 (public-rdf-shapes@w3.org from August 2014)

From: Holger Knublauch <holger@topquadrant.com>
Date: Fri, 15 Aug 2014 12:15:05 +1000
To: public-rdf-shapes@w3.org
Message-ID: <53ED6D29.40203@topquadrant.com>
Hi David,

I sympathize with your desire to piggyback on a WG for this work, but I 
also see the risk of spreading too far. There are probably other ways in 
the W3C processes, to allow very small working groups to proceed with 
such deliverables, outside of the larger WGs for the "big picture" 
items? If not, then the W3C process probably has an unnecessary 
limitation. As I understand this problem is something that less than 3 
people can write up, it gets reviewed and could be signed off without 
impacting any other semantic web standards.

On the specific topic, we have had requests for reliably sorted Turtle 
files for years, especially so that people can compare versions of the 
same file with concurrent versioning systems. TopBraid includes this 
feature now also in the Free Edition, whenever you save .ttl files. 
Jeremy knows much more about this algorithm than I do, and it is 
certainly a useful feature. Whether this needs to be a "standard" or 
just an algorithm from an open source library is a question that I 
cannot answer.

Regards,
Holger


On 8/15/2014 3:26, David Booth wrote:
> On 08/14/2014 11:01 AM, Arnaud Le Hors wrote:
>> Hi David,
>>
>> Maybe I'm just missing something but I have to admit not to be convinced
>> by your argument that this is a necessity for validation. Rather, it
>> seems to me that you're just trying to piggyback on top of this WG to
>> have it do something that you think would be useful.
>
> In a sense I am, because as I mentioned before, this is a somewhat 
> different notion of validation than just looking at the shape of the 
> data.  I agree that it is not a necessity for *shape* validation, but 
> I do see it as important for validating, in a uniform way, that actual 
> data is equivalent to expected data.  But I understand that that is 
> tangential to the main use case that the group wants to focus on, so I 
> won't push it further.
>
>>
>> I understand you have good intentions but I'm sure you know that every
>> deliverable has a cost, even if optional, and I'd rather we don't add to
>> a charter that is already going to require a lot of work.
>
> Ok, I'll drop the request to include it.  Thanks for considering.
>
> David
>
>>
>> Regards.
>> -- 
>> Arnaud  Le Hors - Senior Technical Staff Member, Open Web Standards -
>> IBM Software Group
>>
>>
>> David Booth <david@dbooth.org> wrote on 08/13/2014 08:14:38 PM:
>>
>>  > From: David Booth <david@dbooth.org>
>>  > To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, public-rdf-
>>  > shapes@w3.org
>>  > Date: 08/13/2014 08:15 PM
>>  > Subject: Re: regression testing [was Re: summarizing proposed
>>  > changes to charter]
>>  >
>>  > On 08/13/2014 10:04 PM, Peter F. Patel-Schneider wrote:
>>  > > OK, even though regression testing doesn't need 
>> canonicalization, it is
>>  > > useful to have RDF canonicalization to support a particular 
>> regression
>>  > > testing system.
>>  > >
>>  > > But how is the lack of a W3C-blessed method for RDF 
>> canonicalization
>>  > > hindering the development or deployment of this system?  How 
>> would a
>>  > > W3C-blessed method for RDF canonicalization help the development or
>>  > > deployment of this system?
>>  > >
>>  > > The system could use any canonical form whatsoever, after all, 
>> right?
>>  >
>>  > Yes and no.  The lack of a W3C-blessed method of RDF canonicalization
>>  > makes the comparison dependant on the particular canonicalization 
>> tool
>>  > that is used, which means that RDF data produced by different 
>> tools (or
>>  > different versions of the same tool) could not be reliably 
>> compared.  In
>>  > many scenarios this won't be an issue, but it will in some.
>>  >
>>  > But more importantly, the lack of a standard RDF canonicalization 
>> method
>>  > discourages the development of canonicalization tools. 
>> Canonicalization
>>  > has gotten little attention in RDF tools, in my view largely 
>> *because*
>>  > of the difficulty of doing it and the lack of a W3C-blessed 
>> method.  It
>>  > is non-trivial to implement, and if one's implementation would 
>> just end
>>  > up as one's own idiosyncratic canonicalization anyway, instead of 
>> being
>>  > an implementation of a standard, then there isn't as much 
>> motivation to
>>  > do it.  I think a W3C-blessed method would help a lot.
>>  >
>>  > Would you be okay with canonicalization being an OPTIONAL 
>> deliverable?
>>  >
>>  > David
>>  >
>>  > >
>>  > > peter
>>  > >
>>  > >
>>  > > On 08/13/2014 12:00 PM, David Booth wrote:
>>  > >> Hi Peter,
>>  > >>
>>  > >> On 08/13/2014 01:25 PM, Peter F. Patel-Schneider wrote:
>>  > >>> On 08/13/2014 08:45 AM, David Booth wrote:
>>  > >>>> Hi Peter,
>>  > >>>>
>>  > >>>> Here is my main use case for RDF canonicalization.
>>  > >>>>
>>  > >>>> The RDF Pipeline Framework http://rdfpipeline.org/allows any 
>> kind of
>>  > >>>> data to
>>  > >>>> be manipulated in a data production pipeline -- not just RDF. 
>> The
>>  > >>>> Framework
>>  > >>>> has regression tests that, when run, are used to validate the
>>  > >>>> correctness of
>>  > >>>> the output of each node in a pipeline.  A test passes if the 
>> actual
>>  > >>>> node
>>  > >>>> output exactly matches the expected node output, *after*
>> filtering out
>>  > >>>> ignorable differences.  (For example, differences in dates 
>> and times
>>  > >>>> are
>>  > >>>> typically treated as ignorable -- they don't cause a test to 
>> fail.)
>>  > >>>> Since a
>>  > >>>> generic comparison tool is used (because the pipeline is
>> permitted to
>>  > >>>> carry
>>  > >>>> *any* kind of data), data serialization must be predictable and
>>  > >>>> canonical.
>>  > >>>> This works great for all kinds of data *except* RDF.
>>  > >>>
>>  > >>> Why?  You could just use RDF graph or dataset isomorphism.  
>> Those are
>>  > >>> already defined by W3C.  Well maybe you need to modify the graphs
>> first
>>  > >>> (e.g., to fudge dates and times), but you are already doing 
>> that for
>>  > >>> other data types.
>>  > >>>
>>  > >>>> If a canonical form of RDF were defined, then the exact same 
>> tools
>>  > >>>> that are
>>  > >>>> used to compare other kinds of data for differences could also
>> be used
>>  > >>>> for
>>  > >>>> comparing RDF.
>>  > >>>
>>  > >>> What are these tools?  Why should a tool to determine whether two
>>  > >>> strings are the same also work for determining whether two XML
>> documents
>>  > >>> are the same. Oh, maybe you think that you should first 
>> canonicalize
>>  > >>> everything and then do string comparison. However, you are 
>> deluding
>>  > >>> yourself that this is using the same tools for comparing
>> different kinds
>>  > >>> of data.  The tool that you are actually using to compare, 
>> e.g., XML
>>  > >>> documents, is the composition of the datatype-specific
>> canonicalizer and
>>  > >>> a string comparer.  There is no free lunch---you still need tools
>>  > >>> specific to each datatype.
>>  > >>
>>  > >> Not quite.  cmp is used for comparison of *serialized* data, and
>>  > >> canonicalization is part of the data *serialization* process -- 
>> not
>>  > >> the data
>>  > >> *comparison* process.   The serialization process must necessarily
>>  > >> understand
>>  > >> what kind of data it is -- there is no way around that -- so that
>> is the
>>  > >> logical place to do the canonicalization.  But the comparison 
>> process
>>  > >> does
>>  > >> *not* know what kind of data is being compared -- nor should it 
>> have
>>  > >> to.  It's
>>  > >> the serializer's job to produce a predictable, repeatable
>>  > >> serialization of the
>>  > >> data.  This works great and is trivially easy for everything 
>> *except*
>>  > >> RDF,
>>  > >> because of the instability of blank node labels. In RDF,
>> comparison is
>>  > >> embarrassingly difficult.
>>  > >>
>>  > >> One could argue that my application could use some workaround 
>> to solve
>>  > >> this
>>  > >> problem, but that belies the fact that the root cause of the 
>> problem
>>  > >> is *not*
>>  > >> some weird thing my application is trying to do, it is a weakness
>> of RDF
>>  > >> itself -- a gap in the RDF specs.  This gap makes RDF harder to 
>> use
>>  > >> than it
>>  > >> needs to be.  If we want RDF to be adopted by a wider audience --
>> and I
>>  > >> certainly do -- then we need to fix obvious gaps like this.
>>  > >>
>>  > >> I hope that helps clarify why I see this as a problem.  Given the
>>  > >> above, would
>>  > >> you be okay with canonicalization being an OPTIONAL deliverable?
>>  > >>
>>  > >> Thanks,
>>  > >> David
>>  > >>
>>  > >>>
>>  > >>>> I consider this a major deficiency in RDF that really needs 
>> to be
>>  > >>>> corrected.
>>  > >>>> Any significant software effort uses regression tests to 
>> validate
>>  > >>>> changes.
>>  > >>>> But comparing two documents is currently complicated and 
>> difficult
>>  > >>>> with RDF
>>  > >>>> data.  RDF canonicalization would make it as easy as it is 
>> for every
>>  > >>>> other
>>  > >>>> data representation.
>>  > >>>
>>  > >>> How so?  Right now you can just use a tool that does RDF graph or
>>  > >>> dataset isomorphism.  Under your proposal you would need a 
>> tool that
>>  > >>> does RDF graph or dataset canonicalization, which is no easier 
>> than
>>  > >>> isomorphism checking. What's the difference?
>>  > >>>
>>  > >>>> I realize that this is a slightly different -- and more 
>> stringent --
>>  > >>>> notion of
>>  > >>>> RDF validation than just looking at the general shape of the 
>> data,
>>  > >>>> because it
>>  > >>>> requires that the data not only has the expected shape, but also
>>  > >>>> contains the
>>  > >>>> expected *values*.  Canonicalization would solve this problem.
>>  > >>>
>>  > >>> Canonicalization is a part of a solution to a problem that is 
>> already
>>  > >>> solved.
>>  > >>>
>>  > >>>
>>  > >>>> Given this motivation, would you be okay with RDF 
>> canonicalization
>>  > >>>> being
>>  > >>>> included as an OPTIONAL deliverable in the charter?
>>  > >>>>
>>  > >>>> Thanks,
>>  > >>>> David
>>  > >>>
>>  > >>>
>>  > >>> peter
>>  > >>>
>>  > >>>> On 08/13/2014 01:11 AM, Peter F. Patel-Schneider wrote:
>>  > >>>>> I'm still not getting this at all.
>>  > >>>>>
>>  > >>>>> How does canonicalization help me determine that I got the 
>> RDF data
>>  > >>>>> that
>>  > >>>>> I expected (exact or otherwise)?  For example, how does
>>  > >>>>> canonicalization
>>  > >>>>> help me determine that I got some RDF data that tells me the 
>> phone
>>  > >>>>> numbers of my friends?
>>  > >>>>>
>>  > >>>>> I just can't come up with a use case at all related to RDF data
>>  > >>>>> validation where canonicalization is relevant, except for
>> signing RDF
>>  > >>>>> graphs, and that can just as easily be done at the surface 
>> syntax
>>  > >>>>> level,
>>  > >>>>> and signing is quite tangential to the WG's purpose, I think.
>>  > >>>>>
>>  > >>>>> peter
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> On 08/12/2014 09:17 PM, David Booth wrote:
>>  > >>>>>> I think "canonicalization" would be a clearer term, as in:
>>  > >>>>>>
>>  > >>>>>>    "OPTIONAL - A Recommendation for canonical serialization
>>  > >>>>>>     of RDF graphs and RDF datasets."
>>  > >>>>>>
>>  > >>>>>> The purpose of this (to me) is to be able to validate that I
>> got the
>>  > >>>>>> *exact*
>>  > >>>>>> RDF data that I expected -- not merely the right classes and
>>  > >>>>>> predicates and
>>  > >>>>>> such.  Would you be okay with including this in the charter?
>>  > >>>>>>
>>  > >>>>>> Thanks,
>>  > >>>>>> David
>>  > >>>>>>
>>  > >>>>>> On 08/12/2014 10:00 PM, Peter F. Patel-Schneider wrote:
>>  > >>>>>>> I'm still not exactly sure just what normalization means 
>> in this
>>  > >>>>>>> context
>>  > >>>>>>> or what relationship it has to RDF validation.
>>  > >>>>>>>
>>  > >>>>>>> peter
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>> On 08/12/2014 06:55 PM, David Booth wrote:
>>  > >>>>>>>> +1 for all except one item.
>>  > >>>>>>>>
>>  > >>>>>>>> I'd like to make one last ditch attempt to include graph
>>  > >>>>>>>> normalization
>>  > >>>>>>>> as an
>>  > >>>>>>>> OPTIONAL deliverable.  I expect the WG to treat it as low
>> priority,
>>  > >>>>>>>> and would
>>  > >>>>>>>> only anticipate a normalization document being produced if
>> someone
>>  > >>>>>>>> takes the
>>  > >>>>>>>> personal initiative to draft it.  I do not see any 
>> significant
>>  > >>>>>>>> harm in
>>  > >>>>>>>> including it in the charter on that basis, but I do see a
>> benefit,
>>  > >>>>>>>> because if
>>  > >>>>>>>> the WG did somehow get to it then it would damn nice to 
>> have, so
>>  > >>>>>>>> that
>>  > >>>>>>>> we could
>>  > >>>>>>>> finally validate RDF data by having a standard way to 
>> compare
>>  > >>>>>>>> two RDF
>>  > >>>>>>>> documents for equality, like we can routinely do with every
>> other
>>  > >>>>>>>> data
>>  > >>>>>>>> representation.
>>  > >>>>>>>>
>>  > >>>>>>>> Peter, would that be okay with you, to include graph
>>  > >>>>>>>> normalization as
>>  > >>>>>>>> OPTIONAL
>>  > >>>>>>>> that way?
>>  > >>>>>>>>
>>  > >>>>>>>> Thanks,
>>  > >>>>>>>> David
>>  > >>>>>>>>
>>  > >>>>>>>> On 08/12/2014 08:55 PM, Eric Prud'hommeaux wrote:
>>  > >>>>>>>>> Hi all, we can have a face-to-face at the W3C Technical
>> Plenary in
>>  > >>>>>>>>> November if we can quickly endorse a good-enough charter.
>>   As it
>>  > >>>>>>>>> stands now, it isn't clear that the group will be able 
>> to reach
>>  > >>>>>>>>> consensus within the Working Group, let alone get 
>> through the
>>  > >>>>>>>>> member
>>  > >>>>>>>>> review without objection.
>>  > >>>>>>>>>
>>  > >>>>>>>>> Please review the proposals that I've culled from the 
>> list.  I
>>  > >>>>>>>>> encournage compromise on all our parts and we'll have to
>> suppress
>>  > >>>>>>>>> the
>>  > >>>>>>>>> desire to wordsmith. (Given the 3-month evaluation period,
>>  > >>>>>>>>> wordsmithing won't change much anyways.)
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> separate semantics:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      A syntax and semantics for shapes specifying how to
>> construct
>>  > >>>>>>>>> shape
>>  > >>>>>>>>> expressions and how shape expressions are evaluated 
>> against RDF
>>  > >>>>>>>>> graphs.
>>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>  > >>>>>>>>> 
>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>  > >>>>>>>>>      defining the the (direct) semantics meaning of 
>> shapes and
>>  > >>>>>>>>> defining the
>>  > >>>>>>>>> associated validation process.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    opposition: Holger Knublauch
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: include, noting that if SPARQL is
>> judged
>>  > >>>>>>>>> to be
>>  > >>>>>>>>> useful for the semantics, there's nothing preventing us 
>> from
>>  > >>>>>>>>> using it.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> make graph normalization optional or use-case specific:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      3 OPTIONAL A specification of how shape verification
>>  > >>>>>>>>> interacts
>>  > >>>>>>>>> with
>>  > >>>>>>>>> inference.
>>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>  > >>>>>>>>> <D954B744-05CD-4E5C-8FC2-C08A9A99BA9F@syapse.com>
>>  > >>>>>>>>>      the WG will consider whether it is necessary, 
>> practical or
>>  > >>>>>>>>> desireable
>>  > >>>>>>>>> to normalize a graph...
>>  > >>>>>>>>>      A graph normalization method, suitable for  the use 
>> cases
>>  > >>>>>>>>> determined by
>>  > >>>>>>>>> the group....
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      OPTIONAL - A Recommendation for
>>  > >>>>>>>>> normalization/canonicalization
>>  > >>>>>>>>> of RDF
>>  > >>>>>>>>> graphs and RDF datasets that are serialized in N-Triples 
>> and
>>  > >>>>>>>>> N-Quads.
>>  > >>>>>>>>> opposition - don't do it at all:
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E3A4CB.4040200@gmail.com>
>>  > >>>>>>>>>      the WG should not be working on this.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: withdrawn, to go to new 
>> light-weight,
>>  > >>>>>>>>> focused
>>  > >>>>>>>>> WG,
>>  > >>>>>>>>> removing this text:
>>  > >>>>>>>>>    [[
>>  > >>>>>>>>>    The WG MAY produce a Recommendation for graph 
>> normalization.
>>  > >>>>>>>>>    ]]
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> mandatory human-facing language:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>  > >>>>>>>>> 
>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>  > >>>>>>>>>      ShExC mandatory, but potentially as a Note.
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      In Section 4 (Deliverables), change "OPTIONAL - 
>> Compact,
>>  > >>>>>>>>> human-readable
>>  > >>>>>>>>> syntax" to "Compact, human-readable syntax", i.e., make it
>>  > >>>>>>>>> required.
>>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>  > >>>>>>>>> <54AA894F-F4B4-4877-8806-EB85FB5A42E5@syapse.com>
>>  > >>>>>>>>>
>>  > >>>>>>>>>    opposition - make it OPTIONAL
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      OPTIONAL A compact, human-readable syntax for 
>> expressing
>>  > >>>>>>>>> shapes.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: keep as OPTIONAL, not mentioning 
>> ShExC,
>>  > >>>>>>>>> but
>>  > >>>>>>>>> clarifying that it's different from the RDF syntax.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> report formats:
>>  > >>>>>>>>>    Dimitris Kontokostas 
>> <kontokostas@informatik.uni-leipzig.de>
>>  > >>>>>>>>>      provide flexible validation execution plans that range
>> from:
>>  > >>>>>>>>>        Success / fail
>>  > >>>>>>>>>        Success / fail per constraint
>>  > >>>>>>>>>        Fails with error counts
>>  > >>>>>>>>>        Individual resources that fail per constraint
>>  > >>>>>>>>>        And enriched failed resources with annotations
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: no change, noting that no one 
>> seconded
>>  > >>>>>>>>> this
>>  > >>>>>>>>> proposal.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> test suite/validator:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    Dimitris Kontokostas 
>> <kontokostas@informatik.uni-leipzig.de>
>>  > >>>>>>>>>      Validation results are very important for the 
>> progress of
>>  > >>>>>>>>> this
>>  > >>>>>>>>> WG and
>>  > >>>>>>>>> should be a standalone deliverable.
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      Test Suite, to help ensure interoperability and 
>> correct
>>  > >>>>>>>>> implementation.
>>  > >>>>>>>>> The group will chose the location of this deliverable, 
>> such as
>>  > >>>>>>>>> a git
>>  > >>>>>>>>> repository.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: leave from charter as WGs usually
>>  > >>>>>>>>> choose to
>>  > >>>>>>>>> do this
>>  > >>>>>>>>> anyways and it has no impact on IP commitments.
>>  > >>>>>>>>>
>>  > >>>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>
>>  > >>>
>>  > >>>
>>  > >>>
>>  > >
>>  > >
>>  > >
>>  >
>
Received on Friday, 15 August 2014 02:16:46 UTC