RE: issue-dbooth-3: Ambiguity in an XML document's intended GRDDL results

Chimezie,

Thanks for your comments. Detailed responses below.

> From: Chimezie Ogbuji [mailto:ogbujic@ccf.org]
>
> On Tue, 2007-05-29 at 00:33 -0400, Booth, David (HP Software -
> Boston) wrote:
> > This is a personal comment -- not on behalf of HP.
> >
> > This comment is about ambiguity in an XML instance document's
> > *intended* GRDDL results. Such ambiguity should be
> > distinguished from cases where the GRDDL-aware agent
> > *knowingly* chooses to deviate from the GRDDL transformation
> > author's expressed intent (for security or other reasons),
> > and thus accepts responsibility for any differences between
> > the computed results and the GRDDL transformation author's
> > intended results.
>
> Note that the only ambiguity in question here is in cases where
> there are multiple XML infosets / XPath DMs associated with the
> same XML concrete syntax (the bytes over the wire). As Murray
> has already mentioned
> (http://lists.w3.org/Archives/Public/public-grddl-wg/2007May/0
074.html)
> the primary motivation for being silent with respect to XML
> processing models is because GRDDL simply does not have the
> authority to dictate an XML processing model that accounts for
> this initial ambiguity in the source document (which already
> puts the Faithful Rendition 'promise' in jeopardy from 
> the beginning). 

Hold it.  There *is* no ambiguity in the source document.
The ambiguity comes into the infoset because the GRDDL
spec permits the source document to be parsed in an
implementation-defined way, in *spite* of what the document
may actually require.  Remember that the semantics of
an XML document are up to the root namespace owner to
define -- nobody else.  If I own the root namespace,
then *I* get to say exactly what the semantics of the
document are -- *including* exactly what pre-processing
the document may need to produce the correct infoset.
This is what the example in point 5 illustrates:
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007
8.html

In fact, GRDDL does not have the authority to permit
parsers to *deviate* from the correct semantics of the
document (as indicated by the document's root namespace)
by permitting the generated infoset to be implementation
defined.

> Especially, when most of the faithful
> 'rendering' is a function of the transformation, which GRDDL 
> simply delegates processing to.

I'm not concerned about any ambiguity introduced by the
transformation function itself, because that is up to
the transformation author.  The document parsing is *not*
up to the transformation author.  That is why it must be
unambiguous.

>
> > Definition: By "XML instance document" I am referring to a concrete
> > "representation" in the TAG WebArch sense -- not an "information
> > resource".
> >
> > POINT 1: For any XML instance document, to the extent
> > possible, the GRDDL spec should make it clear exactly what
> > are the intended GRDDL results for that XML instance
> > document. Two implementations faithfully implementing the
> > GRDDL spec should come to the same conclusions about what
> > those intended GRDDL results should be, i.e., there should be
> > no ambiguity.
>
> Once again, the WG's decision WRT the "Faithful Infoset"
> wording was motivated by the lack of (independent) authority
> required to ensure a deterministic RDF rendition in the face of
> an ambiguous infoset / XPath DM.

But the infoset is only ambiguous because the GRDDL spec
permits the pre-processing to be implementation defined!
The GRDDL WG certainly has the authority to specify
how GRDDL results should be computed from XML instance
documents.  That is its job.  

>
> > I do not think the GRDDL specification should be considered
> > finished until the spec makes this clear, given that:
> >  - GRDDL is the cornerstone for bridging the worlds of XML and RDF.
> >  - A key purpose in expressing semantics in RDF is to make them
> > *unambiguous*.
>
> I would argue that expressing completely *unambiguous*
> semantics via RDF is not the goal of RDF. RDF is simply not
> expressive enough by itself to ensure this. RDF, like any other
> knowledge representation is nothing more than an approximation
> of reality as best expressed by the language. 

Sure, but that's irrelevant.  The point is that a key
purpose of expressing semantics in RDF -- i.e., exposing
the semantics of an XML instance document in RDF -- is
to make them unambiguous to the extent that expressing
them in RDF does make them unambiguous.  I.e., being able
to determine exactly what assertions the input document
is making.  This key purpose is defeated if the intended
RDF result set is ambiguous.

> It is for this
> reason that a GRDDL result is a 'faithful' rendition and not a
> complete one.

Whether the "complete GRDDL results" reflect the entire semantics of the
input document or a proper subset of those semantics is up to
the GRDDL author to choose, in writing the GRDDL transformations.
See my definition of "complete GRDDL results" in point 3,
and my point 4 about the two potential interpretations of the
Faithful Renditions section.

>
> >  - GRDDL is on track to become a W3C Recommendation.
> >  - GRDDL may have quite a long life.  Both XML and RDF have been
around
> > for several years with little change, and show no signs of
> > being replaced. I see no reason why GRDDL should not have a
> > similar lifespan.
> >
> > POINT 2: At present, it is not clear what is the view of the
> > Working Group (WG) toward ambiguity in an XML document's
> > intended GRDDL results,
> > i.e., whether the WG believes:
> >
> >   a. it is a problem, but we do not know a solution;
> >   b. it is a problem now, but we expect the problem to go
> >   away
> >      when the XProc or some other spec is completed; or
> >   c. the WG does not consider it a problem.
>
> The wording of the "Faithful Infoset" section (and the
> conversation that lead up to the resolution) clearly indicates
> that the WG stance is clearly b with the additional
> 'motivation' of not having a proper mandate to dictate or
> micromanage the XML processing that occurs before the XPath
> Data Model is handed off to the transformation.

That wasn't clear to me in reading the spec.  Other comments I
have heard from other WG members suggest that there is at least
some element of position a involved also.

>
> > I would vehemently object to position c, for the reasons
> > above. In the case of position a, I believe there *are* ways
> > to reduce or eliminate such unintended ambiguity, and I will
> > be happy to suggest ways to do so. In the case of position b,
> > I think it is important that the WG make clear exactly *how*
> > XProc or some other spec is intended to make the problem go
> > away, and indicate that in the spec.
>
> I'm not sure how the sentence below doesn't describe how XProc
> addresses the infoset / XPath data model ambiguity:
>
> [[
> Using XProc, one could apply a sequence of operations such
> XInclude, validation, and transformation to a document,
> aborting if the result of an intermediate stage is not valid,
> for example.
> ]]

It is clear how an XProc pipeline could produce a
completely correct and unambiguous infoset from an XML
instance document.  It is *not* clear how the GRDDL spec
expects XProc to be used.  The example I show in point
5 very clearly illustrates how the GRDDL spec currently
makes it impossible for a GRDDL transformation to produce
the correct results for the example shown -- regardless
of whether or not that GRDDL transformation uses XProc
or anything else -- because the GRDDL transformation does
not get control until *after* the implementation-defined
parsing has occurred.

>
>
> >   At present, the spec
> > explicitly allows the intended results to be implementation
> > defined, which IMO is unacceptable for a spec of this kind.
>
> Once again, the only ambiguity (the only place where the result
> is implementation defined) is where the uncertainty originates
> from the source document - which (as Murray has emphasized)
> already puts the Faithful Rendition promise in jeopardy.

No, the source document has no ambiguity.  The ambiguity in
the infoset comes about because the GRDDL spec permits the
parser to deviate from the root namespce semantics by
using implementation-defined parsing.

>
> > POINT 3: The spec needs to define a notion of "complete GRDDL
> > results" for a given XML instance document.
>
> GRDDL does not have the authority (either in what it might
> dictate with XML processing or with an assumption that
> completeness can be guaranteed deterministically from *every*
> incoming infoset / XPath DM and expressed in RDF) to define a
> notion of a "complete GRDDL result". Hence the term
> "Faithful Rendition" instead of a "Complete Rendition".  See the
> conversation that led up to the resolution:
> http://lists.w3.org/Archives/Public/public-grddl-wg/2007Feb/at
t-0017/31-grddl-wg-minutes-edited.html#item02
>
> > It is good that the specification describes how partial GRDDL
> > results can be determined, because partial results may be
> > adequate for many applications. But the spec also needs to
> > clearly define what constitutes the *complete* GRDDL results
> > indicated by a given XML instance document, i.e., all and
> > only the intended GRDDL results for all GRDDL transformations
> > indicated by that XML instance document.
> >
> > This is particularly important in supporting applications in
> > which GRDDL is used to express the *entire* semantics of an
> > XML instance document, such as a messaging application as
> > described in issue-dbooth-9a,
> >
> http://lists.w3.org/Archives/Public/public-grddl-comments/2007
AprJun/006
> > 9.html
>
> Again, the idea that complete semantics of every XML source
> document can be computed (by GRDDL) and can be express in RDF
> is a non-starter.

That isn't at all what I suggested.  As I said in point 3, it is
good that GRDDL transformation authors have the discretion of
exposing only a subset of the complete semantics of the input
document.   However, *some* applications need to use GRDDL to
expose the *entire* semantics of the input document.  This
is the case when the input document represents a serialization
of RDF.

>
> > i.e., where custom XML document types are created or treated
> > as custom serializations of RDF, as described in
> > http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-paper.htm .
> > One must be able to say with clarity: "For this XML instance
> > document, the complete GRDDL results are intended to be
> > precisely the following RDF triples -- no more and no less."
>
> If this is the intent of the author, then it would behoove him
> / her to
> *not* use XInclude directives which only add uncertainty to his / her
> intent. In which case, the 'completeness' is guaranteed by
> leveraging the deterministic nature of GRDDL with respect to
> situation where there is *no* ambiguity in the infoset / XPath
> data model.

I do not think it is reasonable to limit the domain
of GRDDL to XML instance documents that only require
certain pre-processing sequences.  Remember that the GRDDL
transformation author may have no control over the format
of the input document.

>
> > Tellingly, I notice that the WG has routinely been using an
> > implicit concept of the complete GRDDL results (though not
> > using this term) when discussing and comparing test results,
> > for example when two testers talk about whether they got "the
> > same" results for a particular test case.
>
> Comparing results guarantee compliance with respect to the
> label 'GRDDL-aware agent'.  This label does imply computation 
> of 'complete' GRDDL results. 

Not true.  See the normative text in section 7:
http://www.w3.org/TR/grddl/#agt_obl
[[
2. Selectively apply any or all discovered transformations
to obtain GRDDL results. Note selection may be guided
by the agent's capabilities, local security policies and
possibly user/client intervention.
]]

> Notice, the only tests which have multiple
> results are those where there is ambiguity in the infoset /
> XPath DM, the representations served over the network protocol,
> and where multiple GRDDL mechanisms apply:
> http://www.w3.org/TR/grddl-tests/#multiple-output

Yes, that is the point of my point 3: the WG has been
implicitly using the concept of "complete GRDDL results"
without actually defining such a term.  Such a term is
important to define.  However, it is important to define
it based on an XML instance document (i.e., a representation
-- not an information resource) to avoid ambiguity caused
by dynamic information resources and content negotiation.
There is no harm in *also* defining such a notion for
an information resource, but it is not always meaningful,
and it is only needed in the case of namespace and
profile documents, which need special treatment anyway,
as explained in my reply to Harry:
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/008
3.html
[[
The starting point only needs to be a URI in the case of a namespace or
profile URI, where GRDDL results need to be determined for it.  And that
case needs to be treated specially because a GRDDL processor needs to be
able to know that if it finds a representation for that URI dereference,
and that representation specifies a GRDDL transformation, then the GRDDL
results of *that* representation can be considered complete without
having to worry about the possible existence of some other
as-yet-undiscovered representation that may specify other GRDDL results.
This is why the additional sentence for the Faithful Renditions section
is needed.
]]
And the "additional sentence for the Faithful Renditions section"
mentioned was in point 3 of issue-dbooth-3:
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007
8.html
[[
  "By specifying a GRDDL namespace transformation or profile
  transformation in a representation of a namespace or profile 
  information resource, the creator of that namespace or 
  profile states that every other representation of that same
  information resource that also specifies a GRDDL namespace 
  transformation or profile transformation is functionally 
  equivalent."
]]

>
> In such cases, the GRDDL pipeline is not deterministic and
> GRDDL would not have the authority to guarantee a functional
> mapping without dictates that would span XML processing and
> content negotiation.

I agree that the GRDDL pipeline is not deterministic and
the XML processing and content negotiation are two of the
reasons.  Regarding content negotiation, as explained above
that is one of the reasons why the notion of "complete GRDDL 
results" needs to be based on an XML instance document rather 
than an information resource.

>
> > Furthermore, the algorithm given in sec 7 of the GRDDL spec
> > http://www.w3.org/2004/01/rdxh/spec#sec_agt
> > describes most of the process needed to determine the
> > complete GRDDL results for a particular XML instance
> > document, but:
> >  - it does not define a conformance term for people to use;
>
> I was under the impression that 'GRDDL-aware agent' was such a
> term.

I meant it does not define a term for the concept of
"complete GRDDL results".

>
> >  - it is defined in terms of a URI as a starting point, which
introduces
> > much more ambiguity than being defined in terms of an XML
> > instance document as the starting point;
>
> The ambiguity introduced by speaking of IR's and not XML
> 'instances' is
> accounted for both in the specification (formally in the rules
> and informally by calling out the appropriate dependent
> specifications with respect to dereferencing URIs) and in the
> test collection (which identifies expected behavior - albeit
> non-deterministic - with respect to this ambiguity).

Yes, the spec and the test cases have both done a very good
job of *documenting* the ambiguity.  But that does not make
it go away.  The point is that people need to be able to
talk about the complete GRDDL results of a particular
representation.  As pointed out in issue-dbooth-9a
http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/006
9.html
it does not always make sense to talk about the GRDDL
results of an information resource -- particularly a
dynamic information resource -- but it *always* makes
sense to talk about the GRDDL results of a representation.

>
> >  - it is intended for describing partial GRDDL results; and
> >  - more needs to be nailed down to define the notion of complete
GRDDL results.
>
> See above.
>
>
> > Namespace and profile information URIs make it much more
> > difficult to define the notion of complete GRDDL results,
> > because there is no guarantee that the GRDDL processor is
> > able to retrieve the correct namespace or profile
> > representation that specifies all of the intended
> > grddl:namespaceTransformations or
> grddl:profileTransformations that the
> > author intended should be applied. However, this difficulty
> > can be overcome by adding something to the Faithful
> > Renditions section to the effect that:
> >
> >   "By specifying a GRDDL namespace transformation or profile
> >   transformation in a representation of a namespace or
> >   profile information resource, the creator of that namespace
> >   or profile states that every other representation of that
> >   same information resource that also specifies a GRDDL
> >   namespace transformation or profile transformation is
> >   functionally equivalent."
>
> Such text (though very helpful in clarifying this equivalence)
> would only describe in human-readable words what follows from
> the 'informal' mechanical rules (especially those that clearly
> outline how you get from an IR, to bytes, to an XPath DM, and
> so forth)

No, I think the addition to the Faithful Renditions section is
also needed, to preclude the possibility that there may be an
as-yet-undiscovered represention for the namespace or profile
IR that specifies additional GRDDL results, as I explain
above when I mention point 3.

>
>
> > This approach will work when namespace and profile documents
> > have representations available that define GRDDL
> > transformations. But many XML instance documents will need to
> > make use of namespaces or profile documents that will not
> > have such representations available, and since the dependency
> > for defining complete GRDDL results is recursive through all
> > namespace and profile documents, it seems likely that in many
> > cases this approach will be infeasible. Therefore, the GRDDL
> > spec should also define a short-cut mechanism to allow an XML
> > instance document to specify, for example, a
> > grddl:completeTransformation attribute whose presence would
> > indicate that namespace and profile documents do *not* need
> > to be processed in order to determine the complete GRDDL
> > results.
>
> Again, this would follow if the original intent was to define a
> 'complete' rendition.

I do not know what you mean.

>
> > To cover xhtml document types that cannot contain
> > grddl:completeTransformation annotations directly, this approach
*could*
> > also be extended by defining a
> > grddl:completeProfileTransformation property whose presence
> > would have a similar effect of
> saying: "there is
> > no need to look at any other profile documents". However it
> > may be less important to know the complete GRDDL results for
> > xhtml documents than it is for XML documents in general, so
> > such an attribute may not be necessary.
> >
> > POINT 4: The Faithful Rendition section is excellent for
> > making clear how the semantics of GRDDL results should be
> > interpreted. However, I will note that its intent is somewhat
> > unclear, as it could mean either or both of:
> >
> >  - The RDF results of a GRDDL transformation reflect real-life
semantics
> > of the input XML instance document, however these semantics
> > may be a subset of the full semantics of that document. (In
> > essence, they are whatever subset of the full semantics the
> > GRDDL transformation author has chosen to expose via GRDDL.)
> >
> >  - GRDDL results for a given XML instance document may be ambiguous
> > (implementation defined), and it is the GRDDL transformation
> > author's responsibility to anticipate this ambiguity and
> > ensure that the results reflect real-life semantics of the
> > input XML instance document anyway.
> >
> > I like the first interpretation, and I consider that as a
> > feature of the spec. I do not like the second -- and I view
> > it as a bug in the spec
> > -- because it merely foists the ambiguity problem off to the GRDDL
> > transformation author, and as I point out below, AFAICT it is
> > not even
> > *possible* for the GRDDL transformation author to always write
> > transformations that produce correct, unambiguous results.
>
> Right, this has more to do with the mechanisms at the infoset
> end than anything GRDDL is attempting to guarantee.
>
> > POINT 5: In discussing the Faithful Rendition assurance,
> > Section 6 explicitly says: "Therefore, it is suggested that
> > GRDDL transformations be written so that they perform all
> > expected pre-processing
> . . . .".
> > But if the GRDDL transformation requires a particular
> > sequence of pre-processing, or it requires there to be *no*
> > pre-processing, then AFAICT it is not possible for the
> > transformation author to control this if pre-processing is
> > explicitly permitted to be arbitrarily chosen by the
> > implementation before the GRDDL transformation ever sees the
> > input.
>
> Again, to emphasize Murray's earlier point (see link above)
> whether or not processing *should* happen depends on the
> authors intent as well as the environment in which the GRDDL
> agent exists (which might have it's own set of policies about
> XML processing). Being dictatorial about the processing only
> serves the purpose of guaranteeing a 'complete' rendition which
> is not the intent of GRDDL to begin with.

Hold on.  For a given XML instance document, you need to
distinguish between four different cases:

  a. The GRDDL processor properly produces RDF results
that are the same as the RDF results that the GRDDL
transformation author expressly intended.   These are what
I call the "complete GRDDL results".  This case is good.

  b. Due to security, network access or other limitations,
the GRDDL processor chooses to produce only a subset of the
complete GRDDL results.  These are what I call "partial
GRDDL results", and they may in fact be the same as the
complete GRDDL results, but the GRDDL processor cannot
know whether or not they are complete if it has chosen
not to perform some of the transformations that have been
expressly indicated.  This case is fine too, because
the GRDDL processor is knowingly making this choice.

  c. The GRDDL processor unknowingly applies a different
pre-processing sequence than the GRDDL transformation
author intended (but had no way to indicate), and
consequently the GRDDL processor unwittingly produces a
proper subset of the complete GRDDL results when it thinks
it is producing the complete GRDDL results.  This case is
*not* okay.

  d. The GRDDL processor unknowingly applies a different
pre-processing sequence than the GRDDL transformation
author intended and consequently the GRDDL processor
unwittingly produces RDF results that are just plain wrong,
i.e., they are not even a subset of the results that
the GRDDL transformation author inntended.  This case is
*not* okay.

>
> > For example, suppose my schema includes blocks of XML code
> > from other documents, and I define a <myns:quote> tag to
> > prevent the embedded chunks of XML from being interpreted,
> > and suppose that one of those embedded chunks uses xinclude:
> >
> > <myns:myDoc . . . >
> >    <myns:quote>
> >       <otherNs:whatever>
> >          <xi:include href="http://example.org/do-not-expand" />
> >       </otherNx:whatever>
> >    <myns:quote>
> > </myns:myDoc>
> >
> > When this document is GRDDL transformed, the entire chunk of
> > XML inside the <myns:quote> element is supposed to become the
> > value of an RDF property *verbatim*, without expanding the
> > xi:include directive. If the XML parser is permitted to
> > expand or not expand the
> xi:include directive
> > at its discretion, before the GRDDL transformation even sees
> > it, then it is not possible for the GRDDL transformation
> > author to ensure that correct results will be produced.
>
> Again, the problem here is with the author introducing the
> ambiguity with his/her use of the XInclude directive and not
> any failing of GRDDL. If the intent is to have the XInclude
> element be an XMLLiteral object of an assertion, that clashes
> with the semantics of the XInclude directive which has a
> specific (syntactic) meaning at the front end of the
> pipeline: to expand the infoset.

Incorrect.   As pointed out above, the semantics of an
XML document are determined by the root namespace.  If the
root namespace chooses to define a quoting mechanism that
prevents embedded xi:includes from being expanced, that
is its prerogative.

>
> > Again, please let me know how I can be most helpful in
> > resolving this issue.
>
> I hope my clarifications and/or highlighting of the main points
> of contention helps with indicating the WG's stance with
> respect to the Faithful Infoset resolution as well as the
> motivation(s) that lead to it.


David Booth, Ph.D. 
HP Software
+1 617 629 8881 office  |  dbooth@hp.com
http://www.hp.com/go/software

Received on Thursday, 31 May 2007 05:28:27 UTC