Re: issue-dbooth-3: Ambiguity in an XML document's intended GRDDL results from Harry Halpin on 2007-05-29 (public-grddl-comments@w3.org from April to June 2007)

From: Harry Halpin <hhalpin@ibiblio.org>
Date: Tue, 29 May 2007 14:17:39 -0400 (EDT)
To: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
Cc: public-grddl-comments@w3.org, Jeremy Carroll <jjc@hpl.hp.com>, "McBride, Brian" <brian.mcbride@hp.com>
Message-ID: <Pine.LNX.4.64.0705291305350.20649@tribal.metalab.unc.edu>
I sympathize with the general line of comments, but do not see how GRDDL 
can remain WebArch compliant and not specify its own XML processing model.

  On Tue, 29 May 2007, Booth, David (HP Software - Boston) wrote:

[snip]
>
>
> Definition: By "XML instance document" I am referring to a concrete
> "representation" in the TAG WebArch sense -- not an "information
> resource".

Have you seen our test case document [1]? Again, many of these issues are 
dealt with explicitly in the test case document.

In particular, see the following section

> POINT 1: For any XML instance document, to the extent possible, the
> GRDDL spec should make it clear exactly what are the intended GRDDL
> results for that XML instance document.   Two implementations faithfully
> implementing the GRDDL spec should come to the same conclusions about
> what those intended GRDDL results should be, i.e., there should be no
> ambiguity.
>
> I do not think the GRDDL specification should be considered finished
> until the spec makes this clear, given that:
> - GRDDL is the cornerstone for bridging the worlds of XML and RDF.
> - A key purpose in expressing semantics in RDF is to make them
> *unambiguous*.
> - GRDDL is on track to become a W3C Recommendation.
> - GRDDL may have quite a long life.  Both XML and RDF have been around
> for several years with little change, and show no signs of being
> replaced.  I see no reason why GRDDL should not have a similar lifespan.

I agree. But XML has remained around with preprocessing indeterminacy for 
quite a long time and has been useful, and XSLT is Turing complete and not 
deterministic, yet has also proven to be useful and have a long life.

> POINT 2: At present, it is not clear what is the view of the Working
> Group (WG) toward ambiguity in an XML document's intended GRDDL results,
> i.e., whether the WG believes:
>
>  a. it is a problem, but we do not know a solution;
>  b. it is a problem now, but we expect the problem to go away
>     when the XProc or some other spec is completed; or
>  c. the WG does not consider it a problem.
>
> I would vehemently object to position c, for the reasons above.  In the
> case of position a, I believe there *are* ways to reduce or eliminate
> such unintended ambiguity, and I will be happy to suggest ways to do so.
> In the case of position b, I think it is important that the WG make
> clear exactly *how* XProc or some other spec is intended to make the
> problem go away, and indicate that in the spec.  At present, the spec
> explicitly allows the intended results to be implementation defined,
> which IMO is unacceptable for a spec of this kind.

The spec is not ambiguous, and neither are the test cases. However, they are not 
determinisitic across implementations in precisely the cases you describe.
I also see you have not responded to my previous email regarding the lack 
of determinism built into XML [2].


> POINT 3: The spec needs to define a notion of "complete GRDDL results"
> for a given XML instance document.  It is good that the specification
> describes how partial GRDDL results can be determined, because partial
> results may be adequate for many applications.  But the spec also needs
> to clearly define what  constitutes the *complete* GRDDL results
> indicated by a given XML instance document, i.e., all and only the
> intended GRDDL results for all GRDDL transformations indicated by that
> XML instance document.
>
> This is particularly important in supporting applications in which GRDDL
> is used to express the *entire* semantics of an XML instance document,
> such as a messaging application as described in issue-dbooth-9a,
> http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/006
> 9.html
> i.e., where custom XML document types are created or treated as custom
> serializations of RDF, as described in
> http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-paper.htm .
> One must be able to say with clarity: "For this XML instance document,
> the complete GRDDL results are intended to be precisely the following
> RDF triples -- no more and no less."

Given the fact that GRDDL is a client-side process that may rely upon 
accessing namespace or profile documents, it seems that if the author of 
an XML document wants to exchange exact and complete RDF representations 
of the same resource, should they not simply use content negotiation 
to serve a representation as RDF to begin with?

> (Note that the spec currently defines GRDDL results in relation to
> information resources rather than XML instance documents (i.e.,
> representations), and this is needed for namespace and profile URIs, but
> it is not sufficient.  GRDDL results *also* need to be defined in terms
> of XML instance documents (i.e., representations), because as pointed
> out in issue-dbooth-9a,
> http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/006
> 9.html , it *always* makes sense to talk about the GRDDL results of an
> XML instance document, but it does *not* always make sense to talk about
> the GRDDL results of an information resource.)

Again, see the test-cases [3]. It does make sense to talk aout the GRDDL 
results of an information resource, as it may just be the merge of GRDDL 
results done for each representation the information resource serves.

> Tellingly, I notice that the WG has routinely been using an implicit
> concept of the complete GRDDL results (though not using this term) when
> discussing and comparing test results, for example when two testers talk
> about whether they got "the same" results for a particular test case.

Except in the test cases for multiple representations and multiple 
infosets, which have been explicitly described and discussed by the WG.
The spec is not ambigous about what is acceptable, and neither are the 
testcases. The spec simply says _multiple results_ may be acceptable and 
are compatible with WebArch. This may be unfortunate for some usecases, in 
which case these usecases should not rely on the Web.

 	I cannot honestly see how, given the indeterminancy of the XML 
core specs regarding preprocessing and WebArch  content negotiation (and 
furthermore, that XSLT is Turing-complete and so  authors could perversely 
include random number generation [4], and so may  other programming 
languages used by GRDDL transforms) how we can mandate  all GRDDL transforms must be 
complete without making GRDDL incompatible  with WebArch by banning the 
use of URIs and without GRDDL making decisions  that are in the domain of 
the W3C XML Activity.

  > Furthermore, the algorithm given in sec 7 of 
the GRDDL spec > http://www.w3.org/2004/01/rdxh/spec#sec_agt
> describes most of the process needed to determine the complete GRDDL
> results for a particular XML instance document, but:
> - it does not define a conformance term for people to use;

The WG decided to only use conformance terms as regards security. What 
precise conformance term, with what precise definition, do you want added?

> - it is defined in terms of a URI as a starting point, which introduces
> much more ambiguity than being defined in terms of an XML instance
> document as the starting point;

If we do not define a URI as a starting point, what would have you have us
use? It seems to be Webarch requires us to use URIs with schemes such as 
http and to cope with the possibility of conneg. There is, however, 
nothing preventing a client from retrieving a  particular representation 
and using the "file" scheme. However, to prevent GRDDL from using http URIs would break WebArch.

> - it is intended for describing partial GRDDL results; and
> - more needs to be nailed down to define the notion of complete GRDDL
> results.

Does the text describing "maximal" results not satisfy you? [1]. If so, 
can you clarify exactly how one can both use URIs and be Webarch enabled wtih 
content negotiation and have "complete" GRDDL results? As usual, text that 
you believe can be added or test-cases are appreciated.

> Namespace and profile information URIs make it much more difficult to
> define the notion of complete GRDDL results, because there is no
> guarantee that the GRDDL processor is able to retrieve the correct
> namespace or profile representation that specifies all of the intended
> grddl:namespaceTransformations or grddl:profileTransformations that the
> author intended should be applied. However, this difficulty can be
> overcome by adding something to the Faithful Renditions section to the
> effect that:
>
>  "By specifying a GRDDL namespace transformation or profile
>  transformation in a representation of a namespace or profile
>  information resource, the creator of that namespace or
>  profile states that every other representation of that same
>  information resource that also specifies a GRDDL namespace
>  transformation or profile transformation is functionally
>  equivalent."

Again, with conneg and XML indeterminacy this cannot be guaranteed.

> If desired, I can describe in more detail how this can be done.

If you can specify exactly what XML preprocessing entails, please do, and 
respond in detail to my message[2].

> This approach will work when namespace and profile documents have
> representations available that define GRDDL transformations.  But many
> XML instance documents will need to make use of namespaces or profile
> documents that will not have such representations available, and since
> the dependency for defining complete GRDDL results is recursive through
> all namespace and profile documents, it seems likely that in many cases
> this approach will be infeasible.  Therefore, the GRDDL spec should also
> define a short-cut mechanism to allow an XML instance document to
> specify, for example, a grddl:completeTransformation attribute whose
> presence would indicate that namespace and profile documents do *not*
> need to be processed in order to determine the complete GRDDL results.

Yet one can never guarantee the namespace doc or profile doc will be 
there. It seems like if cetain transforms are not wanted by the author, 
they should not be specified. The only way one could have complete GRDDL 
results in this manner would be to guarantee the presence of the complete 
namespace and profile docs, which cannot be done. How can you specify the 
completeProfileTransformation will be accessible?


> To cover xhtml document types that cannot contain
> grddl:completeTransformation annotations directly, this approach *could*
> also be extended by defining a grddl:completeProfileTransformation
> property whose presence would have a similar effect of saying: "there is
> no need to look at any other profile documents".  However it may be less
> important to know the complete GRDDL results for xhtml documents than it
> is for XML documents in general, so such an attribute may not be
> necessary.
>
> POINT 4: The Faithful Rendition section is excellent for making clear
> how the semantics of GRDDL results should be interpreted.  However, I
> will note that its intent is somewhat unclear, as it could mean either
> or both of:
>
> - The RDF results of a GRDDL transformation reflect real-life semantics
> of the input XML instance document, however these semantics may be a
> subset of the full semantics of that document.  (In essence, they are
> whatever subset of the full semantics the GRDDL transformation author
> has chosen to expose via GRDDL.)
>
> - GRDDL results for a given XML instance document may be ambiguous
> (implementation defined), and it is the GRDDL transformation author's
> responsibility to anticipate this ambiguity and ensure that the results
> reflect real-life semantics of the input XML instance document anyway.

I believe it means both, and I cannot see how one can not include the 
second intepretation without restricting the client, since complete GRDDL 
results may violate their local polcy, and without making unreasonable 
assumptions about the accessibility of namespace or profile docs and 
banning use of conneg, and so, many URIs.

> I like the first interpretation, and I consider that as a feature of the
> spec.  I do not like the second  -- and I view it as a bug in the spec
> -- because it merely foists the ambiguity problem off to the GRDDL
> transformation author, and as I point out below, AFAICT it is not even
> *possible* for the GRDDL transformation author to always write
> transformations that produce correct, unambiguous results.
>
> POINT 5: In discussing the Faithful Rendition assurance, Section 6
> explicitly says: "Therefore, it is suggested that GRDDL transformations
> be written so that they perform all expected pre-processing . . . .".
> But if the GRDDL transformation requires a particular sequence of
> pre-processing, or it requires there to be *no* pre-processing, then
> AFAICT it is not possible for the transformation author to control this
> if pre-processing is explicitly permitted to be arbitrarily chosen by
> the implementation before the GRDDL transformation ever sees the input.
>
> For example, suppose my schema includes blocks of XML code from other
> documents, and I define a <myns:quote> tag to prevent the embedded
> chunks of XML from being interpreted, and suppose that one of those
> embedded chunks uses xinclude:
>
> <myns:myDoc . . . >
>   <myns:quote>
>      <otherNs:whatever>
>         <xi:include href="http://example.org/do-not-expand" />
>      </otherNx:whatever>
>   <myns:quote>
> </myns:myDoc>
>
> When this document is GRDDL transformed, the entire chunk of XML inside
> the <myns:quote> element is supposed to become the value of an RDF
> property *verbatim*, without expanding the xi:include directive.  If the
> XML parser is permitted to expand or not expand the xi:include directive
> at its discretion, before the GRDDL transformation even sees it, then it
> is not possible for the GRDDL transformation author to ensure that
> correct results will be produced.

Again, then do not use XInclude in your source document if this is your 
desire, or host the RDF you desire via conneg or some other means.

> Again, please let me know how I can be most helpful in resolving this
> issue.

Again, by suggesting exact text and testcases. It seems to me the best way 
to address your concerns is to add a secion of informative text for the 
Spec to the faithful infoset section or to the test-cases that recommends 
that in order for GRDDL authors to best guarantee a faithful rendition 
within their ability:

1) Minimize XML preprocessing by not having the source document use 
XInclude or schema validation.
2) Have only one representation of the information resource given by the 
URI be available, and so not use content negotiation.
3) Restrict GRDDL transformations to deterministic finite state automata. 
4) If an author wishes to guarantee that a XML document is reflected by 
some particular RDF document, that they author not use GRDDL be serve RDF 
directly and specify that using rel="alternate" in XHTML to link to a RDF 
document in the representation or serve it via  content negotiation in 
terms of XML docuemnts with URIs (Are there other ways for an XML document 
to directly link to an RDF document?)

Would this satisfy this comment? If not, please specify what would satisfy 
your comment, if possible without breaking WebArch by disallowing conneg 
and without forcing the GRDDL WG to develop its own XML processing model.

Again, by relying on a client-side processor some indeterminancy must be 
accepted by the server side authors. By relying on the Web one also brings 
indeterminancy into the equation.

I do think that if you want "XML preprocessing defined," which you imply, 
you should bring the issue up with the XML Activity, the XML Processing 
Model WG, and the TAG. Defining what "complete" XML preprocessing is 
outside of the mandate of the GRDDL WG, and as a W3C WG weof course must 
attempt to abide by Web Arch and the current indeterminancy in the XML 
implementations and as created by the Web itself.

Guaranteed determinism is lost as soon you use accessing namespace or 
profile docs on the Web, XML parsers, conneg-enabled URI schems and 
Turing complete programming  languages. One can make recommendations and 
make this explicit, but I cannot see how one can change this. A GRDDL 
client can at best try to apply all the available transformations it 
understands and can access, and merge those results.

[1] http://www.w3.org/TR/grddl-tests/
[2] http://lists.w3.org/Archives/Public/public-grddl-wg/2007May/0075.html
[3] http://www.w3.org/TR/grddl-tests/#multiple-representations
[4] http://www.biglist.com/lists/xsl-list/archives/200105/msg00167.html

> Thanks,
>
> David Booth, Ph.D.
> HP Software
> +1 617 629 8881 office  |  dbooth@hp.com
> http://www.hp.com/go/software
>
>

-- 
 				--harry

 	Harry Halpin
 	Informatics, University of Edinburgh
         http://www.ibiblio.org/hhalpin
Received on Tuesday, 29 May 2007 18:17:46 UTC