Re: How are correct, unambiguous results possible with implementation-defined XML pre-processing? from Murray Maloney on 2007-05-24 (public-grddl-wg@w3.org from May 2007)

From: Murray Maloney <murray@muzmo.com>
Date: Thu, 24 May 2007 14:19:17 -0400
To: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
Cc: <public-grddl-wg@w3.org>
Message-Id: <5.1.1.6.2.20070524000639.05f25bc8@mail.muzmo.com>
At 05:28 PM 5/23/2007 -0400, Booth, David (HP Software - Boston) wrote:

>Hi,
>
>This is intended as a question rather than a formal comment, and I'm
>asking it as an individual -- not specifically representing HP.
>
>I have been quite puzzled about one aspect of the GRDDL spec, and I'm
>wondering if someone could shed some light on it.  The spec says:
>http://www.w3.org/2004/01/rdxh/spec#txforms
>[[
>This specification is purposely silent on the question of which XML
>processors are employed by or for GRDDL-aware agents. Whether or not
>processing of XInclude, XML Validity, XML Schema Validity, XML
>Signatures or XML Decryption take place is implementation-defined. There
>is no universal expectation that an XSLT processor will call on such
>processing before executing a GRDDL transformation. Therefore, it is
>suggested that GRDDL transformations be written so that they perform all
>expected pre-processing, including processing of related DTDs, Schemas
>and namespaces.
>]]

I wrote that.

First of all, we cannot require that all GRDDL-aware agents must perform
specific pre-processing, such as XInclude or DTD Validation. That would
be too much of a burden on implementations.

Secondly, we expect that early transformations will be written using XSLT 1 
& 2.
So, we cannot require transformations to perform XInclude or validation.

Thirdly, we expect that some GRDDL-aware agents and transformations will be 
able
to perform preprocessing, such as XInclude and validation. So we cannot 
stipulate
that no preprocessing is allowed or that transformations must not validate or
use Xinclude.

All of this means that the infoset that a GRDDL-aware agent and transformation
have available to them may differ between instantiations of the GRDDL-aware 
agent.

Now that you know that, you have to think about how you write your document
and your transformation and the environment in which it will be run.



>Specifically, if:
>  - the GRDDL spec allows the XML pre-processing to be implementation
>defined; and
>  - an XML pre-processor automatically expands xincludes (for example);
>and
>  - I have a document that uses xinclude; and
>  - I wish to write a GRDDL transformation that does NOT want the
>xinclude to be expanded;
>then I do not see how it is possible for me to write such a
>transformation, regardless of what XProc or any other spec may say.

If you want a policy that forbids expansion of Xincludes, then don't 
publish Xincludes.
If you use Xincludes in your original document, then a GRDDL-aware agent
has sufficient authority to expand them.

The first step in an XProc transformation could be 'delete all xincludes'.
So, you can be quite explicit about the policy that you want to implement in
an XProc XML Pipeline transformation.

However, if the expansion has already happened -- because, for example, local
policy requires expansion of all xincludes as documents go through a local 
proxy,
then you are out of luck.


>If we assume that there are existing XML documents that require
>arbitrary kinds and sequences of pre-processing; and (b) we wish to
>allow a GRDDL transformation to be written for any such XML document;
>and (c)  we wish to allow such transformation to be unambiguous (i.e.,
>producing the same results for any implementation, given the same
>security policy and resource access) and reliably produce correct
>results; then I do not see how it is possible to write such a
>transformation.

See XProc at http://www.w3.org/TR/xproc/


>For example, suppose either: (a) the XML pre-processing is left to the
>implementation's discretion; or (b) the XProc or any other spec later
>"clarifies" the GRDDL spec to require the XML pre-processing to be any
>particular sequence at all other than no pre-processing.  And further
>suppose that my schema includes blocks of XML code from other documents,
>and I define a <myns:quote> tag to prevent the embedded chunks of XML
>from being interpreted, and suppose that one of those embedded chunks
>uses xinclude:
>
><myns:myDoc . . . >
>    <myns:quote>
>       <otherNs:whatever>
>          <xi:include href="http://example.org/do-not-expand" />
>       </otherNx:whatever>
>    <myns:quote>
></myns:myDoc>
>
>For the purpose of this example (and without loss of generality),
>further suppose that one of the pre-processing steps that is either
>permitted or later required is to expand xi:include tags, by including
>the referenced document.  I wish to write my GRDDL transform such that
>the entire chunk of XML inside the <myns:quote> element is supposed to
>become the value of an RDF property *verbatim*, without expanding the
>xi:include directive.  But if the XML parser is permitted to expand the
>xi:include directive, before my GRDDL transformation even sees it, then
>I do not see any way to write my transformation such that it always
>produces the correct results.  In other words, short of superceding the
>GRDDL spec with GRDDL 2.0, I do not see how XProc or any other spec can
>solve this problem.

It cannot. Nobody can guarantee that Xinclude will not be used before the
GRDDL-aware agent sees it.

>The only way out of this dilemma that I can see is for the GRDDL spec to
>declare that the XML parser must do NO pre-processing, so that the GRDDL
>transformation *can* specify whatever processing the semantics of that
>particular document type require.

By and large, the spec supports nominal idempotence of the GRDDL-aware agent.
However, it does not impede the existence of local policy preferences for 
user agents.
This is a good thing. An organization may wish to ensure that either 
xinclude is
filtered out or expanded depending on security clearances. Another may wish to
ensure that DTD or Schema validation is always performed so that no data is 
lost.
Another may prefer to filter out boilerplate triples from DTDs or Schemas.


>I don't want to raise this as a formal issue if I'm simply
>misunderstanding something, but thus far I have not been able to figure
>this out.  And since I see GRDDL as the cornerstone to bridging the
>worlds of XML and RDF, and since GRDDL may last a *long* time -- note
>that both XML and RDF have been around for several years without being
>superceded, and I don't see any plans to supercede them on the horizon
>-- this question seems quite important and relevant to me.
>
>Can anyone shed some light on this?

The spec is clear about the potential for gotchas. But we are better off 
not defining
any preprocessing and leaving it up to the GRDDL-aware agent to sort out 
policy.

Triples might emerge under some policies that would not otherwise be evident.
Yes, it is true that some people may write transformations that fail to 
work as
expected in environments where different default processing obtains. 
"Faithful infoset"
is highlighted and explained so that expectations can be set appropriately.

Consider a more secure environment, such as a hospital or a military command:
a GRDDL transformation might be designed to yield bogus triples unless the
GRDDL-aware agent is capable of performing an xinclude or fetching a DTD
that is protected by security measures.

"Faithful infoset" may seem like a bug or a glaring hole in the spec,
but if you look at it just right, it is a feature.
Received on Thursday, 24 May 2007 18:42:11 UTC