- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Wed, 13 Aug 2014 19:04:39 -0700
- To: David Booth <david@dbooth.org>, public-rdf-shapes@w3.org
OK, even though regression testing doesn't need canonicalization, it is useful to have RDF canonicalization to support a particular regression testing system. But how is the lack of a W3C-blessed method for RDF canonicalization hindering the development or deployment of this system? How would a W3C-blessed method for RDF canonicalization help the development or deployment of this system? The system could use any canonical form whatsoever, after all, right? peter On 08/13/2014 12:00 PM, David Booth wrote: > Hi Peter, > > On 08/13/2014 01:25 PM, Peter F. Patel-Schneider wrote: >> On 08/13/2014 08:45 AM, David Booth wrote: >>> Hi Peter, >>> >>> Here is my main use case for RDF canonicalization. >>> >>> The RDF Pipeline Framework http://rdfpipeline.org/ allows any kind of >>> data to >>> be manipulated in a data production pipeline -- not just RDF. The >>> Framework >>> has regression tests that, when run, are used to validate the >>> correctness of >>> the output of each node in a pipeline. A test passes if the actual node >>> output exactly matches the expected node output, *after* filtering out >>> ignorable differences. (For example, differences in dates and times are >>> typically treated as ignorable -- they don't cause a test to fail.) >>> Since a >>> generic comparison tool is used (because the pipeline is permitted to >>> carry >>> *any* kind of data), data serialization must be predictable and >>> canonical. >>> This works great for all kinds of data *except* RDF. >> >> Why? You could just use RDF graph or dataset isomorphism. Those are >> already defined by W3C. Well maybe you need to modify the graphs first >> (e.g., to fudge dates and times), but you are already doing that for >> other data types. >> >>> If a canonical form of RDF were defined, then the exact same tools >>> that are >>> used to compare other kinds of data for differences could also be used >>> for >>> comparing RDF. >> >> What are these tools? Why should a tool to determine whether two >> strings are the same also work for determining whether two XML documents >> are the same. Oh, maybe you think that you should first canonicalize >> everything and then do string comparison. However, you are deluding >> yourself that this is using the same tools for comparing different kinds >> of data. The tool that you are actually using to compare, e.g., XML >> documents, is the composition of the datatype-specific canonicalizer and >> a string comparer. There is no free lunch---you still need tools >> specific to each datatype. > > Not quite. cmp is used for comparison of *serialized* data, and > canonicalization is part of the data *serialization* process -- not the data > *comparison* process. The serialization process must necessarily understand > what kind of data it is -- there is no way around that -- so that is the > logical place to do the canonicalization. But the comparison process does > *not* know what kind of data is being compared -- nor should it have to. It's > the serializer's job to produce a predictable, repeatable serialization of the > data. This works great and is trivially easy for everything *except* RDF, > because of the instability of blank node labels. In RDF, comparison is > embarrassingly difficult. > > One could argue that my application could use some workaround to solve this > problem, but that belies the fact that the root cause of the problem is *not* > some weird thing my application is trying to do, it is a weakness of RDF > itself -- a gap in the RDF specs. This gap makes RDF harder to use than it > needs to be. If we want RDF to be adopted by a wider audience -- and I > certainly do -- then we need to fix obvious gaps like this. > > I hope that helps clarify why I see this as a problem. Given the above, would > you be okay with canonicalization being an OPTIONAL deliverable? > > Thanks, > David > >> >>> I consider this a major deficiency in RDF that really needs to be >>> corrected. >>> Any significant software effort uses regression tests to validate >>> changes. >>> But comparing two documents is currently complicated and difficult >>> with RDF >>> data. RDF canonicalization would make it as easy as it is for every >>> other >>> data representation. >> >> How so? Right now you can just use a tool that does RDF graph or >> dataset isomorphism. Under your proposal you would need a tool that >> does RDF graph or dataset canonicalization, which is no easier than >> isomorphism checking. What's the difference? >> >>> I realize that this is a slightly different -- and more stringent -- >>> notion of >>> RDF validation than just looking at the general shape of the data, >>> because it >>> requires that the data not only has the expected shape, but also >>> contains the >>> expected *values*. Canonicalization would solve this problem. >> >> Canonicalization is a part of a solution to a problem that is already >> solved. >> >> >>> Given this motivation, would you be okay with RDF canonicalization being >>> included as an OPTIONAL deliverable in the charter? >>> >>> Thanks, >>> David >> >> >> peter >> >>> On 08/13/2014 01:11 AM, Peter F. Patel-Schneider wrote: >>>> I'm still not getting this at all. >>>> >>>> How does canonicalization help me determine that I got the RDF data that >>>> I expected (exact or otherwise)? For example, how does canonicalization >>>> help me determine that I got some RDF data that tells me the phone >>>> numbers of my friends? >>>> >>>> I just can't come up with a use case at all related to RDF data >>>> validation where canonicalization is relevant, except for signing RDF >>>> graphs, and that can just as easily be done at the surface syntax level, >>>> and signing is quite tangential to the WG's purpose, I think. >>>> >>>> peter >>>> >>>> >>>> On 08/12/2014 09:17 PM, David Booth wrote: >>>>> I think "canonicalization" would be a clearer term, as in: >>>>> >>>>> "OPTIONAL - A Recommendation for canonical serialization >>>>> of RDF graphs and RDF datasets." >>>>> >>>>> The purpose of this (to me) is to be able to validate that I got the >>>>> *exact* >>>>> RDF data that I expected -- not merely the right classes and >>>>> predicates and >>>>> such. Would you be okay with including this in the charter? >>>>> >>>>> Thanks, >>>>> David >>>>> >>>>> On 08/12/2014 10:00 PM, Peter F. Patel-Schneider wrote: >>>>>> I'm still not exactly sure just what normalization means in this >>>>>> context >>>>>> or what relationship it has to RDF validation. >>>>>> >>>>>> peter >>>>>> >>>>>> >>>>>> On 08/12/2014 06:55 PM, David Booth wrote: >>>>>>> +1 for all except one item. >>>>>>> >>>>>>> I'd like to make one last ditch attempt to include graph >>>>>>> normalization >>>>>>> as an >>>>>>> OPTIONAL deliverable. I expect the WG to treat it as low priority, >>>>>>> and would >>>>>>> only anticipate a normalization document being produced if someone >>>>>>> takes the >>>>>>> personal initiative to draft it. I do not see any significant >>>>>>> harm in >>>>>>> including it in the charter on that basis, but I do see a benefit, >>>>>>> because if >>>>>>> the WG did somehow get to it then it would damn nice to have, so that >>>>>>> we could >>>>>>> finally validate RDF data by having a standard way to compare two RDF >>>>>>> documents for equality, like we can routinely do with every other >>>>>>> data >>>>>>> representation. >>>>>>> >>>>>>> Peter, would that be okay with you, to include graph normalization as >>>>>>> OPTIONAL >>>>>>> that way? >>>>>>> >>>>>>> Thanks, >>>>>>> David >>>>>>> >>>>>>> On 08/12/2014 08:55 PM, Eric Prud'hommeaux wrote: >>>>>>>> Hi all, we can have a face-to-face at the W3C Technical Plenary in >>>>>>>> November if we can quickly endorse a good-enough charter. As it >>>>>>>> stands now, it isn't clear that the group will be able to reach >>>>>>>> consensus within the Working Group, let alone get through the member >>>>>>>> review without objection. >>>>>>>> >>>>>>>> Please review the proposals that I've culled from the list. I >>>>>>>> encournage compromise on all our parts and we'll have to suppress >>>>>>>> the >>>>>>>> desire to wordsmith. (Given the 3-month evaluation period, >>>>>>>> wordsmithing won't change much anyways.) >>>>>>>> >>>>>>>> >>>>>>>> separate semantics: >>>>>>>> >>>>>>>> "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID: >>>>>>>> <53E2AFBD.9050102@gmail.com> >>>>>>>> A syntax and semantics for shapes specifying how to construct >>>>>>>> shape >>>>>>>> expressions and how shape expressions are evaluated against RDF >>>>>>>> graphs. >>>>>>>> "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID: >>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl> >>>>>>>> defining the the (direct) semantics meaning of shapes and >>>>>>>> defining the >>>>>>>> associated validation process. >>>>>>>> >>>>>>>> opposition: Holger Knublauch >>>>>>>> >>>>>>>> proposed resolution: include, noting that if SPARQL is judged >>>>>>>> to be >>>>>>>> useful for the semantics, there's nothing preventing us from >>>>>>>> using it. >>>>>>>> >>>>>>>> >>>>>>>> make graph normalization optional or use-case specific: >>>>>>>> >>>>>>>> "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID: >>>>>>>> <53E2AFBD.9050102@gmail.com> >>>>>>>> 3 OPTIONAL A specification of how shape verification interacts >>>>>>>> with >>>>>>>> inference. >>>>>>>> Jeremy J Carroll <jjc@syapse.com> - Message-Id: >>>>>>>> <D954B744-05CD-4E5C-8FC2-C08A9A99BA9F@syapse.com> >>>>>>>> the WG will consider whether it is necessary, practical or >>>>>>>> desireable >>>>>>>> to normalize a graph... >>>>>>>> A graph normalization method, suitable for the use cases >>>>>>>> determined by >>>>>>>> the group.... >>>>>>>> David Booth <david@dbooth.org> - Message-ID: >>>>>>>> <53E28D07.9000804@dbooth.org> >>>>>>>> OPTIONAL - A Recommendation for normalization/canonicalization >>>>>>>> of RDF >>>>>>>> graphs and RDF datasets that are serialized in N-Triples and >>>>>>>> N-Quads. >>>>>>>> opposition - don't do it at all: >>>>>>>> "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID: >>>>>>>> <53E3A4CB.4040200@gmail.com> >>>>>>>> the WG should not be working on this. >>>>>>>> >>>>>>>> proposed resolution: withdrawn, to go to new light-weight, >>>>>>>> focused >>>>>>>> WG, >>>>>>>> removing this text: >>>>>>>> [[ >>>>>>>> The WG MAY produce a Recommendation for graph normalization. >>>>>>>> ]] >>>>>>>> >>>>>>>> >>>>>>>> mandatory human-facing language: >>>>>>>> >>>>>>>> "Dam, Jesse van" <jesse.vandam@wur.nl> - Message-ID: >>>>>>>> <63CF398D7F09744BA51193F17F5252AB1FD60B24@SCOMP0936.wurnet.nl> >>>>>>>> ShExC mandatory, but potentially as a Note. >>>>>>>> David Booth <david@dbooth.org> - Message-ID: >>>>>>>> <53E28D07.9000804@dbooth.org> >>>>>>>> In Section 4 (Deliverables), change "OPTIONAL - Compact, >>>>>>>> human-readable >>>>>>>> syntax" to "Compact, human-readable syntax", i.e., make it required. >>>>>>>> Jeremy J Carroll <jjc@syapse.com> - Message-Id: >>>>>>>> <54AA894F-F4B4-4877-8806-EB85FB5A42E5@syapse.com> >>>>>>>> >>>>>>>> opposition - make it OPTIONAL >>>>>>>> "Peter F. Patel-Schneider" <pfpschneider@gmail.com> - Message-ID: >>>>>>>> <53E2AFBD.9050102@gmail.com> >>>>>>>> OPTIONAL A compact, human-readable syntax for expressing >>>>>>>> shapes. >>>>>>>> >>>>>>>> proposed resolution: keep as OPTIONAL, not mentioning ShExC, but >>>>>>>> clarifying that it's different from the RDF syntax. >>>>>>>> >>>>>>>> >>>>>>>> report formats: >>>>>>>> Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de> >>>>>>>> provide flexible validation execution plans that range from: >>>>>>>> Success / fail >>>>>>>> Success / fail per constraint >>>>>>>> Fails with error counts >>>>>>>> Individual resources that fail per constraint >>>>>>>> And enriched failed resources with annotations >>>>>>>> >>>>>>>> proposed resolution: no change, noting that no one seconded this >>>>>>>> proposal. >>>>>>>> >>>>>>>> >>>>>>>> test suite/validator: >>>>>>>> >>>>>>>> Dimitris Kontokostas <kontokostas@informatik.uni-leipzig.de> >>>>>>>> Validation results are very important for the progress of this >>>>>>>> WG and >>>>>>>> should be a standalone deliverable. >>>>>>>> David Booth <david@dbooth.org> - Message-ID: >>>>>>>> <53E28D07.9000804@dbooth.org> >>>>>>>> Test Suite, to help ensure interoperability and correct >>>>>>>> implementation. >>>>>>>> The group will chose the location of this deliverable, such as a git >>>>>>>> repository. >>>>>>>> >>>>>>>> proposed resolution: leave from charter as WGs usually choose to >>>>>>>> do this >>>>>>>> anyways and it has no impact on IP commitments. >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >>>> >>> >> >> >>
Received on Thursday, 14 August 2014 02:05:16 UTC