RE: How are correct, unambiguous results possible with implementation-defined XML pre-processing? from Booth, David (HP Software - Boston) on 2007-05-25 (public-grddl-wg@w3.org from May 2007)

From: Booth, David (HP Software - Boston) <dbooth@hp.com>
Date: Fri, 25 May 2007 17:00:00 -0400
To: "Murray Maloney" <murray@muzmo.com>
Cc: <public-grddl-wg@w3.org>
Message-ID: <EBBD956B8A9002479B0C9CE9FE14A6C202B16699@tayexc19.americas.cpqcorp.net>
Murray,

Thanks for your reply.  Detailed responses below.

> From: Murray Maloney [mailto:murray@muzmo.com] 
> 
> At 05:28 PM 5/23/2007 -0400, Booth, David (HP Software - 
> Boston) wrote:
> 
> >Hi,
> >
> >This is intended as a question rather than a formal comment, and I'm
> >asking it as an individual -- not specifically representing HP.
> >
> >I have been quite puzzled about one aspect of the GRDDL spec, and I'm
> >wondering if someone could shed some light on it.  The spec says:
> >http://www.w3.org/2004/01/rdxh/spec#txforms
> >[[
> >This specification is purposely silent on the question of which XML
> >processors are employed by or for GRDDL-aware agents. Whether or not
> >processing of XInclude, XML Validity, XML Schema Validity, XML
> >Signatures or XML Decryption take place is 
> implementation-defined. There
> >is no universal expectation that an XSLT processor will call on such
> >processing before executing a GRDDL transformation. Therefore, it is
> >suggested that GRDDL transformations be written so that they 
> perform all
> >expected pre-processing, including processing of related 
> DTDs, Schemas and namespaces.
> >]]
> 
> I wrote that.
> 
> First of all, we cannot require that all GRDDL-aware agents 
> must perform
> specific pre-processing, such as XInclude or DTD Validation. 
> That would be too much of a burden on implementations.

Although I do not agree with this rationale, I agree with the
conclusion, because as I pointed out, if a GRDDL-aware agent were
required to perform XInclude pre-processing, that would prevent anyone
from writing a GRDDL transformation that does not want XIncludes to be
processed.

> 
> Secondly, we expect that early transformations will be 
> written using XSLT 1 & 2.
> So, we cannot require transformations to perform XInclude or 
> validation.

But the spec could provide a way for a GRDDL transformation to specify
what pre-processing should occur prior to invoking the XSLT script.

> 
> Thirdly, we expect that some GRDDL-aware agents and 
> transformations will be able
> to perform preprocessing, such as XInclude and validation. So 
> we cannot stipulate
> that no preprocessing is allowed or that transformations must 
> not validate or use Xinclude.

I'm not sure what you mean here.  Just because an agent is *able* to
perform pre-processing, that doesn't necessarily mean that it cannot be
turned off.  And if it cannot be turned off then, as my example showed,
it is not possible to write a GRDDL transformation that requires such
pre-processing to be turned off (such as not performing XIncludes)

> 
> All of this means that the infoset that a GRDDL-aware agent 
> and transformation
> have available to them may differ between instantiations of 
> the GRDDL-aware agent.

I'm not at all convinced that that is a necessary conclusion.

> 
> Now that you know that, you have to think about how you write 
> your document
> and your transformation and the environment in which it will be run.

But: (a) the GRDDL transformation author has no control over the
environment in which the GRDDL transformation will be run; and (b) the
GRDDL transformation author also may have little or no control over how
the document is written.  I think the only reasonable assumptions to
make are: (a) the GRDDL transformation author *knows* the intent of the
document (i.e., its intended semantics); and (b) the document is somehow
able to indicate its desired GRDDL transformations, either directly
through grddl:transformation tags, or indirectly through namespace or
profile documents.

>  
> 
> >Specifically, if:
> >  - the GRDDL spec allows the XML pre-processing to be 
> > implementation defined; and
> >  - an XML pre-processor automatically expands xincludes 
> > (for example); and
> >  - I have a document that uses xinclude; and
> >  - I wish to write a GRDDL transformation that does NOT want the
> > xinclude to be expanded;
> > then I do not see how it is possible for me to write such a
> > transformation, regardless of what XProc or any other spec may say.
> 
> If you want a policy that forbids expansion of Xincludes, then don't 
> publish Xincludes.
> If you use Xincludes in your original document, then a 
> GRDDL-aware agent
> has sufficient authority to expand them.

No.  The point is that if the semantics of a document are determined by
the root element namespace -- and I believe that is a given -- then
xi:include means "insert this document now" only if the root element
namespace document says it does.  In the example I gave, it does not,
because my root element namespace defines a <myns:quote> tag that
effectively quotes everything inside it.

> 
> The first step in an XProc transformation could be 'delete 
> all xincludes'.
> So, you can be quite explicit about the policy that you want 
> to implement in
> an XProc XML Pipeline transformation.
> 
> However, if the expansion has already happened -- because, 
> for example, local
> policy requires expansion of all xincludes as documents go 
> through a local proxy, then you are out of luck.

Right.  So regarding the following advice in sec 6:
http://www.w3.org/TR/grddl/#txforms
[[
Therefore, it is suggested that GRDDL transformations be written so that
they perform all expected pre-processing, including processing of
related DTDs, Schemas and namespaces.
]]
it sounds like you would agree with my conclusion that this advice is
untenable in this case, because it is not possible to write a transform
that reliably prevents xi:include from being processed.

> 
> 
> >If we assume that there are existing XML documents that require
> >arbitrary kinds and sequences of pre-processing; and (b) we wish to
> >allow a GRDDL transformation to be written for any such XML document;
> >and (c)  we wish to allow such transformation to be 
> >unambiguous (i.e.,
> >producing the same results for any implementation, given the same
> >security policy and resource access) and reliably produce correct
> >results; then I do not see how it is possible to write such a
> >transformation.
> 
> See XProc at http://www.w3.org/TR/xproc/

Yes, I have looked at XProc, and XProc may provide a good way to write
GRDDL transformations, but AFAICT it cannot solve this specific issue.
Because if the GRDDL spec permits the pre-processing to be
implementation defined, and the GRDDL transformation does not get
control until *after* that pre-processing has occurred, the damage may
already have been done.  For example, the pre-processing may have
already performed XIncludes that the GRDDL transformation did not want
performed.

> 
> 
> >For example, suppose either: (a) the XML pre-processing is 
> left to the
> >implementation's discretion; or (b) the XProc or any other spec later
> >"clarifies" the GRDDL spec to require the XML pre-processing 
> to be any
> >particular sequence at all other than no pre-processing.  And further
> >suppose that my schema includes blocks of XML code from 
> other documents,
> >and I define a <myns:quote> tag to prevent the embedded chunks of XML
> >from being interpreted, and suppose that one of those embedded chunks
> >uses xinclude:
> >
> ><myns:myDoc . . . >
> >    <myns:quote>
> >       <otherNs:whatever>
> >          <xi:include href="http://example.org/do-not-expand" />
> >       </otherNx:whatever>
> >    <myns:quote>
> ></myns:myDoc>
> >
> >For the purpose of this example (and without loss of generality),
> >further suppose that one of the pre-processing steps that is either
> >permitted or later required is to expand xi:include tags, by 
> including
> >the referenced document.  I wish to write my GRDDL transform 
> such that
> >the entire chunk of XML inside the <myns:quote> element is 
> supposed to
> >become the value of an RDF property *verbatim*, without expanding the
> >xi:include directive.  But if the XML parser is permitted to 
> expand the
> >xi:include directive, before my GRDDL transformation even 
> sees it, then
> >I do not see any way to write my transformation such that it always
> >produces the correct results.  In other words, short of 
> superceding the
> >GRDDL spec with GRDDL 2.0, I do not see how XProc or any 
> other spec can
> >solve this problem.
> 
> It cannot. Nobody can guarantee that Xinclude will not be 
> used before the GRDDL-aware agent sees it.

Okay, so again it sounds like you would agree that the advice given in
sec 6 -- that "GRDDL transformations be written so that they perform all
expected pre-processing" -- is untenable in this case.

> 
> >The only way out of this dilemma that I can see is for the 
> GRDDL spec to
> >declare that the XML parser must do NO pre-processing, so 
> that the GRDDL
> >transformation *can* specify whatever processing the 
> semantics of that
> >particular document type require.
> 
> By and large, the spec supports nominal idempotence of the 
> GRDDL-aware agent.
> However, it does not impede the existence of local policy 
> preferences for user agents.
> This is a good thing. An organization may wish to ensure that either 
> xinclude is
> filtered out or expanded depending on security clearances. 
> Another may wish to
> ensure that DTD or Schema validation is always performed so 
> that no data is lost.
> Another may prefer to filter out boilerplate triples from 
> DTDs or Schemas.

This is fine, because in this case the GRDDL-aware agent is making a
conscious choice to do so, and has chosen to accept the consequences,
which include the possibility of producing output that is not what the
GRDDL transformation author intended.  I'm not concerned about this
case.  

> 
> 
> >I don't want to raise this as a formal issue if I'm simply
> >misunderstanding something, but thus far I have not been 
> able to figure
> >this out.  And since I see GRDDL as the cornerstone to bridging the
> >worlds of XML and RDF, and since GRDDL may last a *long* time -- note
> >that both XML and RDF have been around for several years 
> without being
> >superceded, and I don't see any plans to supercede them on 
> the horizon
> >-- this question seems quite important and relevant to me.
> >
> >Can anyone shed some light on this?
> 
> The spec is clear about the potential for gotchas. But we are 
> better off not defining
> any preprocessing and leaving it up to the GRDDL-aware agent 
> to sort out policy.

I'm not concerned about cases where the GRDDL-aware agent *knowingly*
chooses to produce output that is not what the GRDDL author intended.
(For example, if it chooses not to perform certain transformations for
security reasons, or it chooses to do different processing than the
transformation specified.)  I'm concerned about cases of *unintended*
ambiguity or *unknowingly* producing incorrect output.

> 
> Triples might emerge under some policies that would not 
> otherwise be evident.
> Yes, it is true that some people may write transformations 
> that fail to work as
> expected in environments where different default processing obtains. 
> "Faithful infoset"
> is highlighted and explained so that expectations can be set 
> appropriately.
> 
> Consider a more secure environment, such as a hospital or a 
> military command:
> a GRDDL transformation might be designed to yield bogus 
> triples unless the
> GRDDL-aware agent is capable of performing an xinclude or 
> fetching a DTD
> that is protected by security measures.
> 
> "Faithful infoset" may seem like a bug or a glaring hole in the spec,
> but if you look at it just right, it is a feature.

I assume you meant "Faithful Renditions":
http://www.w3.org/TR/grddl/#sec_rend
I agree with the "Faithful Renditions" concept.  I do not see it as a
hole or bug at all.  But I also do not see it as any justification for
the spec to permit unintended ambiguity, especially because a key
purpose of expressing semantics in RDF is to make them unambiguous.

Thanks,
David Booth
Received on Friday, 25 May 2007 21:00:39 UTC