Graph normalization/canonicalization (was: regression testing [was Re: summarizing proposed changes to charter]) from Markus Lanthaler on 2014-08-15 (public-rdf-shapes@w3.org from August 2014)

From: Markus Lanthaler <markus.lanthaler@gmx.net>
Date: Fri, 15 Aug 2014 18:13:49 +0200
To: <public-rdf-shapes@w3.org>
Cc: "'Manu Sporny'" <msporny@digitalbazaar.com>, "'Dave Longley'" <dlongley@digitalbazaar.com>
Message-ID: <009d01cfb8a3$e6e3e9f0$b4abbdd0$@gmx.net>
On 15 Aug 2014 at 04:15, Holger Knublauch wrote:
> I sympathize with your desire to piggyback on a WG for this work,
> but I also see the risk of spreading too far. There are probably
> other ways in the W3C processes, to allow very small working groups
> to proceed with such deliverables, outside of the larger WGs for the
> "big picture" items? If not, then the W3C process probably has an
> unnecessary limitation. As I understand this problem is something
> that less than 3 people can write up, it gets reviewed and could be
> signed off without impacting any other semantic web standards.
>
> On the specific topic, we have had requests for reliably sorted
> Turtle files for years, especially so that people can compare
> versions of the same file with concurrent versioning systems.
> TopBraid includes this feature now also in the Free Edition,
> whenever you save .ttl files. Jeremy knows much more about this
> algorithm than I do, and it is certainly a useful feature. Whether
> this needs to be a "standard" or just an algorithm from an open
> source library is a question that I cannot answer.

The JSON-LD CG worked on graph normalization [1] but there wasn't really
enough interest (apart from Digital Bazaar) to continue the work on it.
Nevertheless I think setting up a separate CG focusing on just this might be
interesting and will help to find out if there's enough interest to produce
a spec.

I CCed Manu and Dave as they might interested or have something more to say.


[1] http://json-ld.org/spec/latest/rdf-graph-normalization/


--
Markus Lanthaler
@markuslanthaler



On 8/15/2014 3:26, David Booth wrote:
> On 08/14/2014 11:01 AM, Arnaud Le Hors wrote:
>> Hi David,
>>
>> Maybe I'm just missing something but I have to admit not to be convinced
>> by your argument that this is a necessity for validation. Rather, it
>> seems to me that you're just trying to piggyback on top of this WG to
>> have it do something that you think would be useful.
>
> In a sense I am, because as I mentioned before, this is a somewhat 
> different notion of validation than just looking at the shape of the 
> data.  I agree that it is not a necessity for *shape* validation, but 
> I do see it as important for validating, in a uniform way, that actual 
> data is equivalent to expected data.  But I understand that that is 
> tangential to the main use case that the group wants to focus on, so I 
> won't push it further.
>
>>
>> I understand you have good intentions but I'm sure you know that every
>> deliverable has a cost, even if optional, and I'd rather we don't add to
>> a charter that is already going to require a lot of work.
>
> Ok, I'll drop the request to include it.  Thanks for considering.
>
> David
>
>>
>> Regards.
>> -- 
>> Arnaud  Le Hors - Senior Technical Staff Member, Open Web Standards -
>> IBM Software Group
>>
>>
>> David Booth <david@dbooth.org> wrote on 08/13/2014 08:14:38 PM:
>>
>>  > From: David Booth <david@dbooth.org>
>>  > To: "Peter F. Patel-Schneider" <pfpschneider@gmail.com>, public-rdf-
>>  > shapes@w3.org
>>  > Date: 08/13/2014 08:15 PM
>>  > Subject: Re: regression testing [was Re: summarizing proposed
>>  > changes to charter]
>>  >
>>  > On 08/13/2014 10:04 PM, Peter F. Patel-Schneider wrote:
>>  > > OK, even though regression testing doesn't need 
>> canonicalization, it is
>>  > > useful to have RDF canonicalization to support a particular 
>> regression
>>  > > testing system.
>>  > >
>>  > > But how is the lack of a W3C-blessed method for RDF 
>> canonicalization
>>  > > hindering the development or deployment of this system?  How 
>> would a
>>  > > W3C-blessed method for RDF canonicalization help the development or
>>  > > deployment of this system?
>>  > >
>>  > > The system could use any canonical form whatsoever, after all, 
>> right?
>>  >
>>  > Yes and no.  The lack of a W3C-blessed method of RDF canonicalization
>>  > makes the comparison dependant on the particular canonicalization 
>> tool
>>  > that is used, which means that RDF data produced by different 
>> tools (or
>>  > different versions of the same tool) could not be reliably 
>> compared.  In
>>  > many scenarios this won't be an issue, but it will in some.
>>  >
>>  > But more importantly, the lack of a standard RDF canonicalization 
>> method
>>  > discourages the development of canonicalization tools. 
>> Canonicalization
>>  > has gotten little attention in RDF tools, in my view largely 
>> *because*
>>  > of the difficulty of doing it and the lack of a W3C-blessed 
>> method.  It
>>  > is non-trivial to implement, and if one's implementation would 
>> just end
>>  > up as one's own idiosyncratic canonicalization anyway, instead of 
>> being
>>  > an implementation of a standard, then there isn't as much 
>> motivation to
>>  > do it.  I think a W3C-blessed method would help a lot.
>>  >
>>  > Would you be okay with canonicalization being an OPTIONAL 
>> deliverable?
>>  >
>>  > David
>>  >
>>  > >
>>  > > peter
>>  > >
>>  > >
>>  > > On 08/13/2014 12:00 PM, David Booth wrote:
>>  > >> Hi Peter,
>>  > >>
>>  > >> On 08/13/2014 01:25 PM, Peter F. Patel-Schneider wrote:
>>  > >>> On 08/13/2014 08:45 AM, David Booth wrote:
>>  > >>>> Hi Peter,
>>  > >>>>
>>  > >>>> Here is my main use case for RDF canonicalization.
>>  > >>>>
>>  > >>>> The RDF Pipeline Framework http://rdfpipeline.org/allows any 
>> kind of
>>  > >>>> data to
>>  > >>>> be manipulated in a data production pipeline -- not just RDF. 
>> The
>>  > >>>> Framework
>>  > >>>> has regression tests that, when run, are used to validate the
>>  > >>>> correctness of
>>  > >>>> the output of each node in a pipeline.  A test passes if the 
>> actual
>>  > >>>> node
>>  > >>>> output exactly matches the expected node output, *after*
>> filtering out
>>  > >>>> ignorable differences.  (For example, differences in dates 
>> and times
>>  > >>>> are
>>  > >>>> typically treated as ignorable -- they don't cause a test to 
>> fail.)
>>  > >>>> Since a
>>  > >>>> generic comparison tool is used (because the pipeline is
>> permitted to
>>  > >>>> carry
>>  > >>>> *any* kind of data), data serialization must be predictable and
>>  > >>>> canonical.
>>  > >>>> This works great for all kinds of data *except* RDF.
>>  > >>>
>>  > >>> Why?  You could just use RDF graph or dataset isomorphism.  
>> Those are
>>  > >>> already defined by W3C.  Well maybe you need to modify the graphs
>> first
>>  > >>> (e.g., to fudge dates and times), but you are already doing 
>> that for
>>  > >>> other data types.
>>  > >>>
>>  > >>>> If a canonical form of RDF were defined, then the exact same 
>> tools
>>  > >>>> that are
>>  > >>>> used to compare other kinds of data for differences could also
>> be used
>>  > >>>> for
>>  > >>>> comparing RDF.
>>  > >>>
>>  > >>> What are these tools?  Why should a tool to determine whether two
>>  > >>> strings are the same also work for determining whether two XML
>> documents
>>  > >>> are the same. Oh, maybe you think that you should first 
>> canonicalize
>>  > >>> everything and then do string comparison. However, you are 
>> deluding
>>  > >>> yourself that this is using the same tools for comparing
>> different kinds
>>  > >>> of data.  The tool that you are actually using to compare, 
>> e.g., XML
>>  > >>> documents, is the composition of the datatype-specific
>> canonicalizer and
>>  > >>> a string comparer.  There is no free lunch---you still need tools
>>  > >>> specific to each datatype.
>>  > >>
>>  > >> Not quite.  cmp is used for comparison of *serialized* data, and
>>  > >> canonicalization is part of the data *serialization* process -- 
>> not
>>  > >> the data
>>  > >> *comparison* process.   The serialization process must necessarily
>>  > >> understand
>>  > >> what kind of data it is -- there is no way around that -- so that
>> is the
>>  > >> logical place to do the canonicalization.  But the comparison 
>> process
>>  > >> does
>>  > >> *not* know what kind of data is being compared -- nor should it 
>> have
>>  > >> to.  It's
>>  > >> the serializer's job to produce a predictable, repeatable
>>  > >> serialization of the
>>  > >> data.  This works great and is trivially easy for everything 
>> *except*
>>  > >> RDF,
>>  > >> because of the instability of blank node labels. In RDF,
>> comparison is
>>  > >> embarrassingly difficult.
>>  > >>
>>  > >> One could argue that my application could use some workaround 
>> to solve
>>  > >> this
>>  > >> problem, but that belies the fact that the root cause of the 
>> problem
>>  > >> is *not*
>>  > >> some weird thing my application is trying to do, it is a weakness
>> of RDF
>>  > >> itself -- a gap in the RDF specs.  This gap makes RDF harder to 
>> use
>>  > >> than it
>>  > >> needs to be.  If we want RDF to be adopted by a wider audience --
>> and I
>>  > >> certainly do -- then we need to fix obvious gaps like this.
>>  > >>
>>  > >> I hope that helps clarify why I see this as a problem.  Given the
>>  > >> above, would
>>  > >> you be okay with canonicalization being an OPTIONAL deliverable?
>>  > >>
>>  > >> Thanks,
>>  > >> David
>>  > >>
>>  > >>>
>>  > >>>> I consider this a major deficiency in RDF that really needs 
>> to be
>>  > >>>> corrected.
>>  > >>>> Any significant software effort uses regression tests to 
>> validate
>>  > >>>> changes.
>>  > >>>> But comparing two documents is currently complicated and 
>> difficult
>>  > >>>> with RDF
>>  > >>>> data.  RDF canonicalization would make it as easy as it is 
>> for every
>>  > >>>> other
>>  > >>>> data representation.
>>  > >>>
>>  > >>> How so?  Right now you can just use a tool that does RDF graph or
>>  > >>> dataset isomorphism.  Under your proposal you would need a 
>> tool that
>>  > >>> does RDF graph or dataset canonicalization, which is no easier 
>> than
>>  > >>> isomorphism checking. What's the difference?
>>  > >>>
>>  > >>>> I realize that this is a slightly different -- and more 
>> stringent --
>>  > >>>> notion of
>>  > >>>> RDF validation than just looking at the general shape of the 
>> data,
>>  > >>>> because it
>>  > >>>> requires that the data not only has the expected shape, but also
>>  > >>>> contains the
>>  > >>>> expected *values*.  Canonicalization would solve this problem.
>>  > >>>
>>  > >>> Canonicalization is a part of a solution to a problem that is 
>> already
>>  > >>> solved.
>>  > >>>
>>  > >>>
>>  > >>>> Given this motivation, would you be okay with RDF 
>> canonicalization
>>  > >>>> being
>>  > >>>> included as an OPTIONAL deliverable in the charter?
>>  > >>>>
>>  > >>>> Thanks,
>>  > >>>> David
>>  > >>>
>>  > >>>
>>  > >>> peter
>>  > >>>
>>  > >>>> On 08/13/2014 01:11 AM, Peter F. Patel-Schneider wrote:
>>  > >>>>> I'm still not getting this at all.
>>  > >>>>>
>>  > >>>>> How does canonicalization help me determine that I got the 
>> RDF data
>>  > >>>>> that
>>  > >>>>> I expected (exact or otherwise)?  For example, how does
>>  > >>>>> canonicalization
>>  > >>>>> help me determine that I got some RDF data that tells me the 
>> phone
>>  > >>>>> numbers of my friends?
>>  > >>>>>
>>  > >>>>> I just can't come up with a use case at all related to RDF data
>>  > >>>>> validation where canonicalization is relevant, except for
>> signing RDF
>>  > >>>>> graphs, and that can just as easily be done at the surface 
>> syntax
>>  > >>>>> level,
>>  > >>>>> and signing is quite tangential to the WG's purpose, I think.
>>  > >>>>>
>>  > >>>>> peter
>>  > >>>>>
>>  > >>>>>
>>  > >>>>> On 08/12/2014 09:17 PM, David Booth wrote:
>>  > >>>>>> I think "canonicalization" would be a clearer term, as in:
>>  > >>>>>>
>>  > >>>>>>    "OPTIONAL - A Recommendation for canonical serialization
>>  > >>>>>>     of RDF graphs and RDF datasets."
>>  > >>>>>>
>>  > >>>>>> The purpose of this (to me) is to be able to validate that I
>> got the
>>  > >>>>>> *exact*
>>  > >>>>>> RDF data that I expected -- not merely the right classes and
>>  > >>>>>> predicates and
>>  > >>>>>> such.  Would you be okay with including this in the charter?
>>  > >>>>>>
>>  > >>>>>> Thanks,
>>  > >>>>>> David
>>  > >>>>>>
>>  > >>>>>> On 08/12/2014 10:00 PM, Peter F. Patel-Schneider wrote:
>>  > >>>>>>> I'm still not exactly sure just what normalization means 
>> in this
>>  > >>>>>>> context
>>  > >>>>>>> or what relationship it has to RDF validation.
>>  > >>>>>>>
>>  > >>>>>>> peter
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>> On 08/12/2014 06:55 PM, David Booth wrote:
>>  > >>>>>>>> +1 for all except one item.
>>  > >>>>>>>>
>>  > >>>>>>>> I'd like to make one last ditch attempt to include graph
>>  > >>>>>>>> normalization
>>  > >>>>>>>> as an
>>  > >>>>>>>> OPTIONAL deliverable.  I expect the WG to treat it as low
>> priority,
>>  > >>>>>>>> and would
>>  > >>>>>>>> only anticipate a normalization document being produced if
>> someone
>>  > >>>>>>>> takes the
>>  > >>>>>>>> personal initiative to draft it.  I do not see any 
>> significant
>>  > >>>>>>>> harm in
>>  > >>>>>>>> including it in the charter on that basis, but I do see a
>> benefit,
>>  > >>>>>>>> because if
>>  > >>>>>>>> the WG did somehow get to it then it would damn nice to 
>> have, so
>>  > >>>>>>>> that
>>  > >>>>>>>> we could
>>  > >>>>>>>> finally validate RDF data by having a standard way to 
>> compare
>>  > >>>>>>>> two RDF
>>  > >>>>>>>> documents for equality, like we can routinely do with every
>> other
>>  > >>>>>>>> data
>>  > >>>>>>>> representation.
>>  > >>>>>>>>
>>  > >>>>>>>> Peter, would that be okay with you, to include graph
>>  > >>>>>>>> normalization as
>>  > >>>>>>>> OPTIONAL
>>  > >>>>>>>> that way?
>>  > >>>>>>>>
>>  > >>>>>>>> Thanks,
>>  > >>>>>>>> David
>>  > >>>>>>>>
>>  > >>>>>>>> On 08/12/2014 08:55 PM, Eric Prud'hommeaux wrote:
>>  > >>>>>>>>> Hi all, we can have a face-to-face at the W3C Technical
>> Plenary in
>>  > >>>>>>>>> November if we can quickly endorse a good-enough charter.
>>   As it
>>  > >>>>>>>>> stands now, it isn't clear that the group will be able 
>> to reach
>>  > >>>>>>>>> consensus within the Working Group, let alone get 
>> through the
>>  > >>>>>>>>> member
>>  > >>>>>>>>> review without objection.
>>  > >>>>>>>>>
>>  > >>>>>>>>> Please review the proposals that I've culled from the 
>> list.  I
>>  > >>>>>>>>> encournage compromise on all our parts and we'll have to
>> suppress
>>  > >>>>>>>>> the
>>  > >>>>>>>>> desire to wordsmith. (Given the 3-month evaluation period,
>>  > >>>>>>>>> wordsmithing won't change much anyways.)
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> separate semantics:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      A syntax and semantics for shapes specifying how to
>> construct
>>  > >>>>>>>>> shape
>>  > >>>>>>>>> expressions and how shape expressions are evaluated 
>> against RDF
>>  > >>>>>>>>> graphs.
>>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>  > >>>>>>>>> 
>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>  > >>>>>>>>>      defining the the (direct) semantics meaning of 
>> shapes and
>>  > >>>>>>>>> defining the
>>  > >>>>>>>>> associated validation process.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    opposition: Holger Knublauch
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: include, noting that if SPARQL is
>> judged
>>  > >>>>>>>>> to be
>>  > >>>>>>>>> useful for the semantics, there's nothing preventing us 
>> from
>>  > >>>>>>>>> using it.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> make graph normalization optional or use-case specific:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      3 OPTIONAL A specification of how shape verification
>>  > >>>>>>>>> interacts
>>  > >>>>>>>>> with
>>  > >>>>>>>>> inference.
>>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>  > >>>>>>>>> <D954B744-05CD-4E5C-8FC2-C08A9A99BA9F@syapse.com>
>>  > >>>>>>>>>      the WG will consider whether it is necessary, 
>> practical or
>>  > >>>>>>>>> desireable
>>  > >>>>>>>>> to normalize a graph...
>>  > >>>>>>>>>      A graph normalization method, suitable for  the use 
>> cases
>>  > >>>>>>>>> determined by
>>  > >>>>>>>>> the group....
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      OPTIONAL - A Recommendation for
>>  > >>>>>>>>> normalization/canonicalization
>>  > >>>>>>>>> of RDF
>>  > >>>>>>>>> graphs and RDF datasets that are serialized in N-Triples 
>> and
>>  > >>>>>>>>> N-Quads.
>>  > >>>>>>>>> opposition - don't do it at all:
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E3A4CB.4040200@gmail.com>
>>  > >>>>>>>>>      the WG should not be working on this.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: withdrawn, to go to new 
>> light-weight,
>>  > >>>>>>>>> focused
>>  > >>>>>>>>> WG,
>>  > >>>>>>>>> removing this text:
>>  > >>>>>>>>>    [[
>>  > >>>>>>>>>    The WG MAY produce a Recommendation for graph 
>> normalization.
>>  > >>>>>>>>>    ]]
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> mandatory human-facing language:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID:
>>  > >>>>>>>>> 
>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl>
>>  > >>>>>>>>>      ShExC mandatory, but potentially as a Note.
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      In Section 4 (Deliverables), change "OPTIONAL - 
>> Compact,
>>  > >>>>>>>>> human-readable
>>  > >>>>>>>>> syntax" to "Compact, human-readable syntax", i.e., make it
>>  > >>>>>>>>> required.
>>  > >>>>>>>>>    Jeremy J Carroll <jjc@syapse.com> - Message-Id:
>>  > >>>>>>>>> <54AA894F-F4B4-4877-8806-EB85FB5A42E5@syapse.com>
>>  > >>>>>>>>>
>>  > >>>>>>>>>    opposition - make it OPTIONAL
>>  > >>>>>>>>>    "Peter F. Patel-Schneider" <pfpschneider@gmail.com> -
>>  > >>>>>>>>> Message-ID:
>>  > >>>>>>>>> <53E2AFBD.9050102@gmail.com>
>>  > >>>>>>>>>      OPTIONAL A compact, human-readable syntax for 
>> expressing
>>  > >>>>>>>>> shapes.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: keep as OPTIONAL, not mentioning 
>> ShExC,
>>  > >>>>>>>>> but
>>  > >>>>>>>>> clarifying that it's different from the RDF syntax.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> report formats:
>>  > >>>>>>>>>    Dimitris Kontokostas 
>> <kontokostas@informatik.uni-leipzig.de>
>>  > >>>>>>>>>      provide flexible validation execution plans that range
>> from:
>>  > >>>>>>>>>        Success / fail
>>  > >>>>>>>>>        Success / fail per constraint
>>  > >>>>>>>>>        Fails with error counts
>>  > >>>>>>>>>        Individual resources that fail per constraint
>>  > >>>>>>>>>        And enriched failed resources with annotations
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: no change, noting that no one 
>> seconded
>>  > >>>>>>>>> this
>>  > >>>>>>>>> proposal.
>>  > >>>>>>>>>
>>  > >>>>>>>>>
>>  > >>>>>>>>> test suite/validator:
>>  > >>>>>>>>>
>>  > >>>>>>>>>    Dimitris Kontokostas 
>> <kontokostas@informatik.uni-leipzig.de>
>>  > >>>>>>>>>      Validation results are very important for the 
>> progress of
>>  > >>>>>>>>> this
>>  > >>>>>>>>> WG and
>>  > >>>>>>>>> should be a standalone deliverable.
>>  > >>>>>>>>>    David Booth <david@dbooth.org> - Message-ID:
>>  > >>>>>>>>> <53E28D07.9000804@dbooth.org>
>>  > >>>>>>>>>      Test Suite, to help ensure interoperability and 
>> correct
>>  > >>>>>>>>> implementation.
>>  > >>>>>>>>> The group will chose the location of this deliverable, 
>> such as
>>  > >>>>>>>>> a git
>>  > >>>>>>>>> repository.
>>  > >>>>>>>>>
>>  > >>>>>>>>>    proposed resolution: leave from charter as WGs usually
>>  > >>>>>>>>> choose to
>>  > >>>>>>>>> do this
>>  > >>>>>>>>> anyways and it has no impact on IP commitments.
>>  > >>>>>>>>>
>>  > >>>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>>
>>  > >>>>
>>  > >>>
>>  > >>>
>>  > >>>
>>  > >
>>  > >
>>  > >
>>  >
>
Received on Friday, 15 August 2014 16:14:32 UTC