Re: wording changes re normalization

Hi Jeremy,

On 08/07/2014 10:13 AM, Jeremy J Carroll wrote:
> It is easy to forget that in general RDF canonicalization is
> Graph-Isomorphism complete, and hence too difficult for production use
> at scale. [1]

For unrestricted RDF, yes, but for "Well Behaved RDF" canonicalization 
is very feasible:
http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
You showed this in [1] long before the term "Well Behaved RDF" was coined.

>
> On the other hand, within any particular application domain, which is
> the scope of the users of the proposed working group, normalizing an RDF
> graph tends to be fairly straightforward.

It is, but it's also a tedious complete waste of time when a general 
purpose canonicalization tool could be defined and used for *all* 
applications that are willing to use Well Behaved RDF.

The paper by Hogan, Arenas, Mallea and Polleres on "Everything You 
Always Wanted to Know About Blank Nodes"
http://www.websemanticsjournal.org/index.php/ps/article/download/365/387
shows that the vast majority of RDF does not need the problematic uses 
of blank nodes that cause difficulty in canonicalization.  Most uses of 
blank nodes are benign, like the implicit blank nodes generated by 
Turtle list "( ... )" syntax and square bracket "[ ... ]" syntax.

>
> Mindful of this I suggest:
>
> Section 1
> Replace;
> [[
> In addressing these issues, the WG will consider whether it is
> necessary, practical or desireable to normalize a graph prior to
> validation. That is, whether an algorithm can and should be defined that
> creates a canonical form of a given graph.
> ]]
> With
> [[
> In addressing these issues, the WG will consider whether it is
> necessary, practical or desireable to normalize a graph as part of
> validation. That is, whether an algorithm can and should be defined that
> creates a representation of a given graph, or an equivalent graph, that
> is canonical for the purpose of processing with respect to a specific
> machine-readable interface definition.
> ]]
>
> Rationale: the answer to the current question "should such an algorithm
> be defined" is simply "no, it should not"
> I weaken the question to indicate that the algorithm is part of
> validation, not prior, and that the canonicalization is not independent
> of the application but application dependent.
>
> Section 3:
> Replace:
> [[
> The WG *MAY* produce a Recommendation for *graph normalization*.
> ]]
> With
> [[
> 3. OPTIONAL - A graph normalization method, suitable for  the use cases
> determined by the group. This should not be a general purpose RDF
> canonicalization algorithm, see [1].
> ]]
> Rationale: consistent styling with other deliverables; restricting scope
> to avoid the impossible.
>
> [1]
> http://www.hpl.hp.com/techreports/2003/HPL-2003-142.pdf

The fact that a canonicalization algorithm may fail to finish on some 
problematic input does *not* mean that it is useless to define or 
implement such an algorithm.  It just means that the users of that 
algorithm must be aware if its limitations.  And those limitations are 
very modest: all one has to do is avoid explicit blank nodes.  Implicit 
blank nodes are fine.

However, the reason I'm pushing for canonicalization has little to do 
with RDF "shape" validation.  It is more about the broader problem of 
the need to compare two RDF graphs for "equality" for purposes such as 
regression testing, which is essential to almost any significant 
software project.  I think it's crazy that we (the RDF community) are 
promoting the information representation that we think is so great, but 
it has this blatant gaping flaw: two RDF graphs cannot be easily 
compared for "equality"!   So I saw canonicalization mentioned in the 
charter and opportunistically thought "Hey, maybe we can *finally* get 
RDF canonicalization standardized!"

I don't view RDF canonicalization as essential for shape validation, but 
I *do* view it as essential to the future of RDF.

David

Received on Friday, 8 August 2014 03:21:42 UTC