RE: How are correct, unambiguous results possible with implementation-defined XML pre-processing?

Hi Harry,

Thanks for your comments.  Responses below.

> From: Harry Halpin [mailto:hhalpin@ibiblio.org] 
> 
> As chair, I am going to make a note here about the 
> relationship of our WG 
> to others relating to this issue.
> On Fri, 25 May 2007, Booth, David (HP Software - Boston) wrote:
> 
>   >>
> >> Secondly, we expect that early transformations will be
> >> written using XSLT 1 & 2.
> >> So, we cannot require transformations to perform XInclude or
> >> validation.
> >
> > But the spec could provide a way for a GRDDL transformation 
> > to specify
> > what pre-processing should occur prior to invoking the XSLT script.
> 
> The problem is that the notion of preprocessing is 
> underdefined for XML 
> parsers in general. Can someone point me to a document that specifies 
> exactly what finite number steps must be taken to preprocess an XML 
> document so one can apply XPath to get a node (and here come up 
> questions about how one gets from bytes on the wire to a data 
> model). 

I think the point that Henry Thompson and others observed is that there
is no *single* preprocessing sequence that would be appropriate for all
XML documents.  Different documents require different preprocessing
sequences.  Since the root namespace determines the overall semantics of
the document (and thus the expected preprocessing sequence), it seems
quite reasonable for a GRDDL transformation to explicitly specify what
pre-processing needs to occur.

> It seems, since
> the XML Spec stack has grown, there is no normative way to 
> determine this,
> and so the as the  question is much more complex than just 
> the interaction 
> of Xincludes (for  example, what about DTD or Schema 
> validaton, and their 
> interaction with  Xincludes?). Therefore, our reliance  on the XML 
> Processing Model WG, and  we have also in the  past before 
> Last Call asked the XQuery and XSL WG for advice.

The GRDDL spec mentions XProc, but does not indicate any dependency on
it.  If such a dependency is intended, it would be helpful to clarify
exactly what is the dependency and how it fits into GRDDL, as described
in issue-dbooth-3, point 2:
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007
8.html

> 
> So, while I heavily sympathize with Davids concern it seems 
> this problem 
> of being able to define preprocessing for  a XML parser in 
> general belongs 
> in the domain of the TAG of the XML  Processing Model WG, not 
> the GRDDL WG per se.
> 
> Intead of remaining silent on the issue, Murray wrote a 
> warning bringing 
> this up and encouraging GRDDL transformations to take this 
> into account.

I appreciate the intent, but it does not solve the problem, and as this
thread has pointed out, the advice given by the spec (quoted below) is
not even possible to follow.

> 
> In other words, any alternative (and again, exact text would 
> be great) 
> would require exactly what one means when one says "The GRDDL 
> Agent should 
> not perform any preprocessing". To me this statement is also 
> underdefined, 
> as one has to get to a XPath node somehow and those steps are 
> underdefined and in practice can be varied.
> 
> step in an XProc transformation could be 'delete >> all xincludes'.
> >> So, you can be quite explicit about the policy that you want
> >> to implement in
> >> an XProc XML Pipeline transformation.
> >>
> >> However, if the expansion has already happened -- because,
> >> for example, local
> >> policy requires expansion of all xincludes as documents go
> >> through a local proxy, then you are out of luck.
> >
> > Right.  So regarding the following advice in sec 6:
> > http://www.w3.org/TR/grddl/#txforms
> > [[
> > Therefore, it is suggested that GRDDL transformations be 
> > written so that
> > they perform all expected pre-processing, including processing of
> > related DTDs, Schemas and namespaces.
> > ]]
> > it sounds like you would agree with my conclusion that this 
> > advice is
> > untenable in this case, because it is not possible to write 
> > a transform
> > that reliably prevents xi:include from being processed.
> 
> I am again going to point out this is a probem with XML not having it 
> preprocessing (or processing model in general) defined, and so this 
> problem is not unique to GRDDL. 

No, my comment here is about the above advice given in the GRDDL spec --
not about the pre-processing problem in general.  This advice is unique
to the GRDDL spec.  My intent in this thread was merely to confirm my
suspicion that this particular advice is impossible to follow, as my
example illustrated.

> However, xincludes is a special case of 
> "preprocessing, including processing of related DTDs, Schemas, and 
> namespaces." And if one forbids explicitly Xinclude 
> expansion, is one also 
> forbidding DTD or Schema validation, or other forms of 
> preprocessing? 

Yes, of course.  XInclude was merely one convenient example.  My
suggestion was that, given an XML instance document (i.e., a
"representation") a GRDDL transformation would be responsible for
specifying *all* processing required, in order to be unambiguous.

> The 
> question is tricky and the GRDDL WG has taken so far a 
> conservative route, 
> but one that is indeed coherent.
> 
> ...>

Yes, I think it is coherent, and it is obvious that significant thought
went into it -- I very much like the way the normative rules are
explicitly called out and formalized, BTW -- but I think the spec is
biased toward applications that can afford to be somewhat loose about a
document's semantics.  For a spec of this nature that may be used to
expose the semantics of *any* XML document type for any application, I
think it is essential that the spec more precisely define how the
intentended GRDDL results for a given XML instance document should be
determined, as explained in issue-dbooth-3, point 1:
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007
8.html

David Booth, Ph.D.
HP Software
+1 617 629 8881 office  |  dbooth@hp.com
http://www.hp.com/go/software
 

Received on Tuesday, 29 May 2007 21:15:30 UTC