RE: How are correct, unambiguous results possible with implementation-defined XML pre-processing? from Chimezie Ogbuji on 2007-06-01 (public-grddl-wg@w3.org from June 2007)

From: Chimezie Ogbuji <ogbujic@ccf.org>
Date: Fri, 01 Jun 2007 11:08:36 -0400
To: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
cc: "Harry Halpin" <hhalpin@ibiblio.org>, "Murray Maloney" <murray@muzmo.com>, public-grddl-wg@w3.org
Message-ID: <1180710516.6378.64.camel@otherland>
So, I hope with this response email to highlight exactly why completely
removing ambiguity in how you go from a concrete XML syntax (angle
bracket bytes) to an Infoset or XPath is impossible given the current
state of the art (even in the absence of XInclude).

On Tue, 2007-05-29 at 17:09 -0400, Booth, David (HP Software - Boston)
wrote:
> > Hi Harry,
> > The problem is that the notion of preprocessing is 
> > underdefined for XML 
> > parsers in general. Can someone point me to a document that specifies 
> > exactly what finite number steps must be taken to preprocess an XML 
> > document so one can apply XPath to get a node (and here come up 
> > questions about how one gets from bytes on the wire to a data 
> > model). 
> 
> I think the point that Henry Thompson and others observed is that there
> is no *single* preprocessing sequence that would be appropriate for all
> XML documents.  Different documents require different preprocessing
> sequences.  Since the root namespace determines the overall semantics of
> the document (and thus the expected preprocessing sequence), it seems
> quite reasonable for a GRDDL transformation to explicitly specify what
> pre-processing needs to occur.

The root namespace cannot guarantee the overall semantics if there is
ambiguity in even bare-bones parsing.  By bare bones parsing, I mean the
explicit mapping that the XML Infoset specification provides from
(concrete) XML 1.0 document syntactic components to their corresponding
information items - by a non-validating XML parser.  In addition, the
mapping of this infoset to an XPath data model for GRDDL to use would be
the non-normative mapping defined in the XPath specification:

[[
The nodes in the XPath data model can be derived from the information
items provided by the XML Information Set
]] -- XPath 1.0 (B XML Information Set Mapping (Non-Normative))


Note that such a parsing of an XML document with XInclude directives
would result in an XML infoset which included the (unexpanded) XInclude
directives as an element information item (with appropriate namespace
components and attributes).

[[
The information set of an XML document is defined to be the one obtained
by parsing it according to the rules of the specification whose version
corresponds to that of the document.
]] - XML Infoset (Introduction: XML Versions)

Both the XML 1.0 specification and the XML Infoset admit that the
infoset is underdetermined as a result of validation and external entity
references:

[[
As noted above, an XML document need not be valid to have an information
set. However, certain kinds of invalidity affect the values assigned to
some properties. Entities, notations, elements and attributes may be
undeclared. Notations and elements may be multiply declared (multiple
declarations are valid for entities and attributes). An ID may be
undefined or multiply defined. Such cases are noted where relevant in
the Information Item definitions below.
]] -- XML Infoset (Introduction: Inconsistencies Resulting from
Invalidity)

[[
The information passed from the processor to the application may vary,
depending on whether the processor reads parameter and external
entities.
]] - XML 1.0 (5.2 Using XML Processors)


Note these ambiguities have to do with 'parsing' (creating an infoset
from an XML document) and XInclude is orthogonal to parsing (it happens
on an already constructed infoset):

[[
XInclude operates on information sets and thus is orthogonal to parsing.
]] -- XInclude (1.2 Relationship to XML External Entities)

So, even a policy which did not allow 'forward-firing' (i.e., an XML
processor which automatically handed it's XML infoset to an XInclude
processor before handing it off to a higher application) XInclude
directives would not eliminate infoset ambiguity.  

The only guarantee would be to use a validating parser instead.  This
limits which XML processors can be plugged into the front part of the
GRDDL pipeline (described below) and limits the domain of GRDDL further
to valid *and* well-formed XML.  I'm pretty sure the WG consensus (given
the number of invalid but approved test cases) is against such a
restriction (language in the current specification indicates this
explicitely).

Further more, GRDDL (or any upsteam XML application) would have a hard
time with a mandate that precluded XInclude forward-firing as XInclude
explicitely positions itself as a mechanism that happens at a lower
level:

[[
XInclude processing occurs at a low level, often by a generic XInclude
processor which makes the resulting information set available to higher
level applications.
]] -- XInclude (1.1 Relationship to XLink)

I hope I don't have to make the argument that GRDDL is a higher-level
application.  Afterall, the conformance label (even if we don't call it
this formally) we choose to use is one of an Agent not a Processor.
Even if it was a processor, GRDDL is not responsible for parsing and
delegates this responsibility to an XML processor (notice it's normative
dependency on XML 1.0)

XProc (the current draft) says nothing about it's XInclude component
other than:

[[
The XInclude component applies xinclude processing semantics to the
document.
]] -- XProc (1.6 XInclude)

So, XProc simply uses the infoset it gets (XProc operates on infosets)
locate XInclude directives and expand them.  If they have been already
expanded, the component will become a pass-thru / no-op.  The same risk
that GRDDL currently has with XInclude directives.  

If the motivation for defining a specific XML pre-processing model is to
guarantee completeness and a deterministic (functional) mapping from XML
-> RDF, the fact that such a mapping is impossible for even
non-validating XML parsers, suggests that the current conservative
silence on XML processing is prudent.  

Murray made a point much earlier about GRDDL that stuck with me:  A
conservative specification can always serve as the building block for
additional specifications which depend on it.  Consider a GRDDL Strict
specification which had a normative dependency on GRDDL but mandated
that the mapping from the XML document representation of an information
resource is determined by 'bare-bones' parsing or by the XPath 2.0
fn:doc function (which is used but only in the informative sections).

This would be a very minimal specification, would only have a dependency
on XML Infoset, XPath, and XML 1.0.  At most it would call out the
relevant sections from these specifications .  

Below is a diagram of the whole Pipeline which helps me with this
particular picture:

The portion of the pipeline between XML 1.0 and XML Infoset is where a
GRDDL strict mandate can be enforced.

Web Architecture (ambiguity introduced from web space)
----------------
* Information Resource
* Representation (determined via the URI dereference function)

   |
   V

XML 1.0 (ambiguity introduced from non-validating parsing)
-------
* XML Document (determined from representation dereferenced from web
space)

   |
   V

XML Infoset (ambiguity introduced from non-validating parsing)
-----------
* Information items (determined by mapping from XML document components)

   |
   V

XInclude (optional, low-level mechanism - introduces infoset ambiguity)
--------
* Infoset-to-Infoset transformation

   |
   V

XPath 1.0 (no ambiguity)
----------
* XML Data Model (determined by non-normative mapping from info items)

   |
   V

GRDDL (no additional ambiguity other than those inherited)
------
* Nominates (independent) transforms and applies them recursively 
* Generates GRDDL results.

   |
   V

RDF abstract graph

> The GRDDL spec mentions XProc, but does not indicate any dependency on
> it.  If such a dependency is intended, it would be helpful to clarify
> exactly what is the dependency and how it fits into GRDDL, as described
> in issue-dbooth-3, point 2:
> http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007
> 8.html

Note that GRDDL is agnostic of the actual transformation algorithm it
just sets up a workflow mechanism for them.  So a normative dependency
on XProc is not required.  It only advices the use of transformation
languages which have more explicit control (than say XSLT) to minimize
ambiguity with respect to the Faithful Rendition.  This ambiguity cannot
be eliminated (see above) without the use of validating parsers and even
then there are the issues of the other mechanisms GRDDL is purposely
silent about:

- XML Signatures 
- XML Decryption
- Dependencies on external entities (which introduces ambiguity to
well-formedness not to mention the infoset you produce)

> I appreciate the intent, but it does not solve the problem, and as this
> thread has pointed out, the advice given by the spec (quoted below) is
> not even possible to follow.

It is impossible due to the under-determined nature of XML parsing not
by anything that GRDDL could have *guaranteed* but didn't.  XML parsing
is not a functional mechanism.  Even the XPath 2.0 fn:doc function
admits this:

[[
By defining the semantics of this function in terms of a
string-to-document-node mapping in the dynamic context, the
specification is acknowledging that the results of this function are
outside the purview of the language specification itself, and depend
entirely on the run-time environment in which the expression is
evaluated. This run-time environment includes not only an unpredictable
collection of resources ("the web"), but configurable machinery for
locating resources and turning their contents into document nodes within
the XPath data model. Both the set of resources that are reachable, and
the mechanisms by which those resources are parsed and validated, are
·implementation dependent·.
]] -- XQuery 1.0 & XPath 2.0 Functions / Operators (15.5.4 fn:doc)

There is something to be said about the consistent admission of this
indeterminate process in XML Infoset, XML 1.*, and XQuery / XPath 2.0

> No, my comment here is about the above advice given in the GRDDL spec --
> not about the pre-processing problem in general.  This advice is unique
> to the GRDDL spec.  My intent in this thread was merely to confirm my
> suspicion that this particular advice is impossible to follow, as my
> example illustrated.

The advice is only impossible to follow where the author has to contend
with the 'natural' ambiguities associated with XML parsing.  The only
exception is XInclude, but in order for GRDDL to be explicit about *not*
forward-firing XInclude it would essentially need to be an XML processor
in it's own right (XInclude strongly suggests that inclusion happens at
a lower level).  GRDDL is a mechanism for an agent not a processor - it
has to negotiate with the dictates of its environment.

In addition, as mentioned above, it is relatively easy to build a
specification above GRDDL (GRDDL Strict) which enforces non-validating
(or even validating) XML 1.0 -> XML Infoset parsing (as a replacement
for GRDDL's silence).  But even such a specification would not be able
to claim victory on eliminating *all* ambiguity in XML processing.

> Yes, I think it is coherent, and it is obvious that significant thought
> went into it -- I very much like the way the normative rules are
> explicitly called out and formalized, BTW -- but I think the spec is
> biased toward applications that can afford to be somewhat loose about a
> document's semantics.  

I don't think it is fair to characterize a conservative stance in the
absence of any precedence a 'bias'.  Especially when you consider that a
majority of the infoset ambiguity is predetermined before GRDDL even
gets a handle on the XPath data model it uses as its source.  The
primary exception is XInclude (it is explicitely an infoset-to-infoset
transformation) and as I've demonstrated, one can easily mandate that
XInclude doesn't happen by a bare-bones-parsing scheme described in a
very light-weight specification with a dependency on GRDDL.

-- 
Chimezie Ogbuji
Lead Systems Analyst
Thoracic and Cardiovascular Surgery
Cleveland Clinic Foundation
9500 Euclid Avenue/ W26
Cleveland, Ohio 44195
Office: (216)444-8593
ogbujic@ccf.org


===================================




Cleveland Clinic is ranked one of the top 3 hospitals in
America by U.S.News & World Report. Visit us online at
http://www.clevelandclinic.org for a complete listing of
our services, staff and locations.


Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.
Received on Friday, 1 June 2007 15:09:03 UTC