regression testing [was Re: summarizing proposed changes to charter] from Peter F. Patel-Schneider on 2014-08-14 (public-rdf-shapes@w3.org from August 2014)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Wed, 13 Aug 2014 19:04:39 -0700
To: David Booth <david@dbooth.org>, public-rdf-shapes@w3.org
Message-ID: <53EC1937.4010605@gmail.com>
OK, even though regression testing doesn't need canonicalization, it is useful 
to have RDF canonicalization to support a particular regression testing system.

But how is the lack of a W3C-blessed method for RDF canonicalization hindering 
the development or deployment of this system?  How would a W3C-blessed method 
for RDF canonicalization help the development or deployment of this system?

The system could use any canonical form whatsoever, after all, right?

peter


On 08/13/2014 12:00 PM, David Booth wrote:
> Hi Peter,
>
> On 08/13/2014 01:25 PM, Peter F. Patel-Schneider wrote:
>> On 08/13/2014 08:45 AM, David Booth wrote:
>>> Hi Peter,
>>>
>>> Here is my main use case for RDF canonicalization.
>>>
>>> The RDF Pipeline Framework http://rdfpipeline.org/ allows any kind of
>>> data to
>>> be manipulated in a data production pipeline -- not just RDF. The
>>> Framework
>>> has regression tests that, when run, are used to validate the
>>> correctness of
>>> the output of each node in a pipeline.  A test passes if the actual node
>>> output exactly matches the expected node output, *after* filtering out
>>> ignorable differences.  (For example, differences in dates and times are
>>> typically treated as ignorable -- they don't cause a test to fail.)
>>> Since a
>>> generic comparison tool is used (because the pipeline is permitted to
>>> carry
>>> *any* kind of data), data serialization must be predictable and
>>> canonical.
>>> This works great for all kinds of data *except* RDF.
>>
>> Why?  You could just use RDF graph or dataset isomorphism.  Those are
>> already defined by W3C.  Well maybe you need to modify the graphs first
>> (e.g., to fudge dates and times), but you are already doing that for
>> other data types.
>>
>>> If a canonical form of RDF were defined, then the exact same tools
>>> that are
>>> used to compare other kinds of data for differences could also be used
>>> for
>>> comparing RDF.
>>
>> What are these tools?  Why should a tool to determine whether two
>> strings are the same also work for determining whether two XML documents
>> are the same. Oh, maybe you think that you should first canonicalize
>> everything and then do string comparison.  However, you are deluding
>> yourself that this is using the same tools for comparing different kinds
>> of data.  The tool that you are actually using to compare, e.g., XML
>> documents, is the composition of the datatype-specific canonicalizer and
>> a string comparer.  There is no free lunch---you still need tools
>> specific to each datatype.
>
> Not quite.  cmp is used for comparison of *serialized* data, and
> canonicalization is part of the data *serialization* process -- not the data
> *comparison* process.   The serialization process must necessarily understand
> what kind of data it is -- there is no way around that -- so that is the
> logical place to do the canonicalization.  But the comparison process does
> *not* know what kind of data is being compared -- nor should it have to.  It's
> the serializer's job to produce a predictable, repeatable serialization of the
> data.  This works great and is trivially easy for everything *except* RDF,
> because of the instability of blank node labels.  In RDF, comparison is
> embarrassingly difficult.
>
> One could argue that my application could use some workaround to solve this
> problem, but that belies the fact that the root cause of the problem is *not*
> some weird thing my application is trying to do, it is a weakness of RDF
> itself -- a gap in the RDF specs.  This gap makes RDF harder to use than it
> needs to be.  If we want RDF to be adopted by a wider audience -- and I
> certainly do -- then we need to fix obvious gaps like this.
>
> I hope that helps clarify why I see this as a problem.  Given the above, would
> you be okay with canonicalization being an OPTIONAL deliverable?
>
> Thanks,
> David
>
>>
>>> I consider this a major deficiency in RDF that really needs to be
>>> corrected.
>>> Any significant software effort uses regression tests to validate
>>> changes.
>>> But comparing two documents is currently complicated and difficult
>>> with RDF
>>> data.  RDF canonicalization would make it as easy as it is for every
>>> other
>>> data representation.
>>
>> How so?  Right now you can just use a tool that does RDF graph or
>> dataset isomorphism.  Under your proposal you would need a tool that
>> does RDF graph or dataset canonicalization, which is no easier than
>> isomorphism checking. What's the difference?
>>
>>> I realize that this is a slightly different -- and more stringent --
>>> notion of
>>> RDF validation than just looking at the general shape of the data,
>>> because it
>>> requires that the data not only has the expected shape, but also
>>> contains the
>>> expected *values*.  Canonicalization would solve this problem.
>>
>> Canonicalization is a part of a solution to a problem that is already
>> solved.
>>
>>
>>> Given this motivation, would you be okay with RDF canonicalization being
>>> included as an OPTIONAL deliverable in the charter?
>>>
>>> Thanks,
>>> David
>>
>>
>> peter
>>
>>> On 08/13/2014 01:11 AM, Peter F. Patel-Schneider wrote:
>>>> I'm still not getting this at all.
>>>>
>>>> How does canonicalization help me determine that I got the RDF data that
>>>> I expected (exact or otherwise)?  For example, how does canonicalization
>>>> help me determine that I got some RDF data that tells me the phone
>>>> numbers of my friends?
>>>>
>>>> I just can't come up with a use case at all related to RDF data
>>>> validation where canonicalization is relevant, except for signing RDF
>>>> graphs, and that can just as easily be done at the surface syntax level,
>>>> and signing is quite tangential to the WG's purpose, I think.
>>>>
>>>> peter
>>>>
>>>>
>>>> On 08/12/2014 09:17 PM, David Booth wrote:
>>>>> I think "canonicalization" would be a clearer term, as in:
>>>>>
>>>>>    "OPTIONAL - A Recommendation for canonical serialization
>>>>>     of RDF graphs and RDF datasets."
>>>>>
>>>>> The purpose of this (to me) is to be able to validate that I got the
>>>>> *exact*
>>>>> RDF data that I expected -- not merely the right classes and
>>>>> predicates and
>>>>> such.  Would you be okay with including this in the charter?
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 08/12/2014 10:00 PM, Peter F. Patel-Schneider wrote:
>>>>>> I'm still not exactly sure just what normalization means in this
>>>>>> context
>>>>>> or what relationship it has to RDF validation.
>>>>>>
>>>>>> peter
>>>>>>
>>>>>>
>>>>>> On 08/12/2014 06:55 PM, David Booth wrote:
>>>>>>> +1 for all except one item.
>>>>>>>
>>>>>>> I'd like to make one last ditch attempt to include graph
>>>>>>> normalization
>>>>>>> as an
>>>>>>> OPTIONAL deliverable.  I expect the WG to treat it as low priority,
>>>>>>> and would
>>>>>>> only anticipate a normalization document being produced if someone
>>>>>>> takes the
>>>>>>> personal initiative to draft it.  I do not see any significant
>>>>>>> harm in
>>>>>>> including it in the charter on that basis, but I do see a benefit,
>>>>>>> because if
>>>>>>> the WG did somehow get to it then it would damn nice to have, so that
>>>>>>> we could
>>>>>>> finally validate RDF data by having a standard way to compare two RDF
>>>>>>> documents for equality, like we can routinely do with every other
>>>>>>> data
>>>>>>> representation.
>>>>>>>
>>>>>>> Peter, would that be okay with you, to include graph normalization as
>>>>>>> OPTIONAL
>>>>>>> that way?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On 08/12/2014 08:55 PM, Eric Prud'hommeaux wrote:
>>>>>>>> Hi all, we can have a face-to-face at the W3C Technical Plenary in
>>>>>>>> November if we can quickly endorse a good-enough charter.  As it
>>>>>>>> stands now, it isn't clear that the group will be able to reach
>>>>>>>> consensus within the Working Group, let alone get through the member
>>>>>>>> review without objection.
>>>>>>>>
>>>>>>>> Please review the proposals that I've culled from the list.  I
>>>>>>>> encournage compromise on all our parts and we'll have to suppress
>>>>>>>> the
>>>>>>>> desire to wordsmith. (Given the 3-month evaluation period,
>>>>>>>> wordsmithing won't change much anyways.)
>>>>>>>>
>>>>>>>>
>>>>>>>> separate semantics:
>>>>>>>>
>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID:
>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>>>>>>>      A syntax and semantics for shapes specifying how to construct
>>>>>>>> shape
>>>>>>>> expressions and how shape expressions are evaluated against RDF
>>>>>>>> graphs.
>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>>>>>>>      defining the the (direct) semantics meaning of shapes and
>>>>>>>> defining the
>>>>>>>> associated validation process.
>>>>>>>>
>>>>>>>>    opposition: Holger Knublauch
>>>>>>>>
>>>>>>>>    proposed resolution: include, noting that if SPARQL is judged
>>>>>>>> to be
>>>>>>>> useful for the semantics, there's nothing preventing us from
>>>>>>>> using it.
>>>>>>>>
>>>>>>>>
>>>>>>>> make graph normalization optional or use-case specific:
>>>>>>>>
>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID:
>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>>>>>>>      3 OPTIONAL A specification of how shape verification interacts
>>>>>>>> with
>>>>>>>> inference.
>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>>>>>>> <D954B744-05CD-4E5C-8FC2-C08A9A99BA9F@syapse.com>
>>>>>>>>      the WG will consider whether it is necessary, practical or
>>>>>>>> desireable
>>>>>>>> to normalize a graph...
>>>>>>>>      A graph normalization method, suitable for  the use cases
>>>>>>>> determined by
>>>>>>>> the group....
>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>>>>>>> <53E28D07.9000804@dbooth.org>
>>>>>>>>      OPTIONAL - A Recommendation for normalization/canonicalization
>>>>>>>> of RDF
>>>>>>>> graphs and RDF datasets that are serialized in N-Triples and
>>>>>>>> N-Quads.
>>>>>>>> opposition - don't do it at all:
>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID:
>>>>>>>> <53E3A4CB.4040200@gmail.com>
>>>>>>>>      the WG should not be working on this.
>>>>>>>>
>>>>>>>>    proposed resolution: withdrawn, to go to new light-weight,
>>>>>>>> focused
>>>>>>>> WG,
>>>>>>>> removing this text:
>>>>>>>>    [[
>>>>>>>>    The WG MAY produce a Recommendation for graph normalization.
>>>>>>>>    ]]
>>>>>>>>
>>>>>>>>
>>>>>>>> mandatory human-facing language:
>>>>>>>>
>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>>>>>>>      ShExC mandatory, but potentially as a Note.
>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>>>>>>> <53E28D07.9000804@dbooth.org>
>>>>>>>>      In Section 4 (Deliverables), change "OPTIONAL - Compact,
>>>>>>>> human-readable
>>>>>>>> syntax" to "Compact, human-readable syntax", i.e., make it required.
>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>>>>>>> <54AA894F-F4B4-4877-8806-EB85FB5A42E5@syapse.com>
>>>>>>>>
>>>>>>>>    opposition - make it OPTIONAL
>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID:
>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>>>>>>>      OPTIONAL A compact, human-readable syntax for expressing
>>>>>>>> shapes.
>>>>>>>>
>>>>>>>>    proposed resolution: keep as OPTIONAL, not mentioning ShExC, but
>>>>>>>> clarifying that it's different from the RDF syntax.
>>>>>>>>
>>>>>>>>
>>>>>>>> report formats:
>>>>>>>>    Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
>>>>>>>>      provide flexible validation execution plans that range from:
>>>>>>>>        Success / fail
>>>>>>>>        Success / fail per constraint
>>>>>>>>        Fails with error counts
>>>>>>>>        Individual resources that fail per constraint
>>>>>>>>        And enriched failed resources with annotations
>>>>>>>>
>>>>>>>>    proposed resolution: no change, noting that no one seconded this
>>>>>>>> proposal.
>>>>>>>>
>>>>>>>>
>>>>>>>> test suite/validator:
>>>>>>>>
>>>>>>>>    Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de>
>>>>>>>>      Validation results are very important for the progress of this
>>>>>>>> WG and
>>>>>>>> should be a standalone deliverable.
>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>>>>>>> <53E28D07.9000804@dbooth.org>
>>>>>>>>      Test Suite, to help ensure interoperability and correct
>>>>>>>> implementation.
>>>>>>>> The group will chose the location of this deliverable, such as a git
>>>>>>>> repository.
>>>>>>>>
>>>>>>>>    proposed resolution: leave from charter as WGs usually choose to
>>>>>>>> do this
>>>>>>>> anyways and it has no impact on IP commitments.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
Received on Thursday, 14 August 2014 02:05:16 UTC