- From: Chimezie Ogbuji <ogbujic@ccf.org>
- Date: Fri, 01 Jun 2007 11:08:36 -0400
- To: "Booth, David (HP Software - Boston)" <dbooth@hp.com>
- cc: "Harry Halpin" <hhalpin@ibiblio.org>, "Murray Maloney" <murray@muzmo.com>, public-grddl-wg@w3.org
So, I hope with this response email to highlight exactly why completely removing ambiguity in how you go from a concrete XML syntax (angle bracket bytes) to an Infoset or XPath is impossible given the current state of the art (even in the absence of XInclude). On Tue, 2007-05-29 at 17:09 -0400, Booth, David (HP Software - Boston) wrote: > > Hi Harry, > > The problem is that the notion of preprocessing is > > underdefined for XML > > parsers in general. Can someone point me to a document that specifies > > exactly what finite number steps must be taken to preprocess an XML > > document so one can apply XPath to get a node (and here come up > > questions about how one gets from bytes on the wire to a data > > model). > > I think the point that Henry Thompson and others observed is that there > is no *single* preprocessing sequence that would be appropriate for all > XML documents. Different documents require different preprocessing > sequences. Since the root namespace determines the overall semantics of > the document (and thus the expected preprocessing sequence), it seems > quite reasonable for a GRDDL transformation to explicitly specify what > pre-processing needs to occur. The root namespace cannot guarantee the overall semantics if there is ambiguity in even bare-bones parsing. By bare bones parsing, I mean the explicit mapping that the XML Infoset specification provides from (concrete) XML 1.0 document syntactic components to their corresponding information items - by a non-validating XML parser. In addition, the mapping of this infoset to an XPath data model for GRDDL to use would be the non-normative mapping defined in the XPath specification: [[ The nodes in the XPath data model can be derived from the information items provided by the XML Information Set ]] -- XPath 1.0 (B XML Information Set Mapping (Non-Normative)) Note that such a parsing of an XML document with XInclude directives would result in an XML infoset which included the (unexpanded) XInclude directives as an element information item (with appropriate namespace components and attributes). [[ The information set of an XML document is defined to be the one obtained by parsing it according to the rules of the specification whose version corresponds to that of the document. ]] - XML Infoset (Introduction: XML Versions) Both the XML 1.0 specification and the XML Infoset admit that the infoset is underdetermined as a result of validation and external entity references: [[ As noted above, an XML document need not be valid to have an information set. However, certain kinds of invalidity affect the values assigned to some properties. Entities, notations, elements and attributes may be undeclared. Notations and elements may be multiply declared (multiple declarations are valid for entities and attributes). An ID may be undefined or multiply defined. Such cases are noted where relevant in the Information Item definitions below. ]] -- XML Infoset (Introduction: Inconsistencies Resulting from Invalidity) [[ The information passed from the processor to the application may vary, depending on whether the processor reads parameter and external entities. ]] - XML 1.0 (5.2 Using XML Processors) Note these ambiguities have to do with 'parsing' (creating an infoset from an XML document) and XInclude is orthogonal to parsing (it happens on an already constructed infoset): [[ XInclude operates on information sets and thus is orthogonal to parsing. ]] -- XInclude (1.2 Relationship to XML External Entities) So, even a policy which did not allow 'forward-firing' (i.e., an XML processor which automatically handed it's XML infoset to an XInclude processor before handing it off to a higher application) XInclude directives would not eliminate infoset ambiguity. The only guarantee would be to use a validating parser instead. This limits which XML processors can be plugged into the front part of the GRDDL pipeline (described below) and limits the domain of GRDDL further to valid *and* well-formed XML. I'm pretty sure the WG consensus (given the number of invalid but approved test cases) is against such a restriction (language in the current specification indicates this explicitely). Further more, GRDDL (or any upsteam XML application) would have a hard time with a mandate that precluded XInclude forward-firing as XInclude explicitely positions itself as a mechanism that happens at a lower level: [[ XInclude processing occurs at a low level, often by a generic XInclude processor which makes the resulting information set available to higher level applications. ]] -- XInclude (1.1 Relationship to XLink) I hope I don't have to make the argument that GRDDL is a higher-level application. Afterall, the conformance label (even if we don't call it this formally) we choose to use is one of an Agent not a Processor. Even if it was a processor, GRDDL is not responsible for parsing and delegates this responsibility to an XML processor (notice it's normative dependency on XML 1.0) XProc (the current draft) says nothing about it's XInclude component other than: [[ The XInclude component applies xinclude processing semantics to the document. ]] -- XProc (1.6 XInclude) So, XProc simply uses the infoset it gets (XProc operates on infosets) locate XInclude directives and expand them. If they have been already expanded, the component will become a pass-thru / no-op. The same risk that GRDDL currently has with XInclude directives. If the motivation for defining a specific XML pre-processing model is to guarantee completeness and a deterministic (functional) mapping from XML -> RDF, the fact that such a mapping is impossible for even non-validating XML parsers, suggests that the current conservative silence on XML processing is prudent. Murray made a point much earlier about GRDDL that stuck with me: A conservative specification can always serve as the building block for additional specifications which depend on it. Consider a GRDDL Strict specification which had a normative dependency on GRDDL but mandated that the mapping from the XML document representation of an information resource is determined by 'bare-bones' parsing or by the XPath 2.0 fn:doc function (which is used but only in the informative sections). This would be a very minimal specification, would only have a dependency on XML Infoset, XPath, and XML 1.0. At most it would call out the relevant sections from these specifications . Below is a diagram of the whole Pipeline which helps me with this particular picture: The portion of the pipeline between XML 1.0 and XML Infoset is where a GRDDL strict mandate can be enforced. Web Architecture (ambiguity introduced from web space) ---------------- * Information Resource * Representation (determined via the URI dereference function) | V XML 1.0 (ambiguity introduced from non-validating parsing) ------- * XML Document (determined from representation dereferenced from web space) | V XML Infoset (ambiguity introduced from non-validating parsing) ----------- * Information items (determined by mapping from XML document components) | V XInclude (optional, low-level mechanism - introduces infoset ambiguity) -------- * Infoset-to-Infoset transformation | V XPath 1.0 (no ambiguity) ---------- * XML Data Model (determined by non-normative mapping from info items) | V GRDDL (no additional ambiguity other than those inherited) ------ * Nominates (independent) transforms and applies them recursively * Generates GRDDL results. | V RDF abstract graph > The GRDDL spec mentions XProc, but does not indicate any dependency on > it. If such a dependency is intended, it would be helpful to clarify > exactly what is the dependency and how it fits into GRDDL, as described > in issue-dbooth-3, point 2: > http://lists.w3.org/Archives/Public/public-grddl-comments/2007AprJun/007 > 8.html Note that GRDDL is agnostic of the actual transformation algorithm it just sets up a workflow mechanism for them. So a normative dependency on XProc is not required. It only advices the use of transformation languages which have more explicit control (than say XSLT) to minimize ambiguity with respect to the Faithful Rendition. This ambiguity cannot be eliminated (see above) without the use of validating parsers and even then there are the issues of the other mechanisms GRDDL is purposely silent about: - XML Signatures - XML Decryption - Dependencies on external entities (which introduces ambiguity to well-formedness not to mention the infoset you produce) > I appreciate the intent, but it does not solve the problem, and as this > thread has pointed out, the advice given by the spec (quoted below) is > not even possible to follow. It is impossible due to the under-determined nature of XML parsing not by anything that GRDDL could have *guaranteed* but didn't. XML parsing is not a functional mechanism. Even the XPath 2.0 fn:doc function admits this: [[ By defining the semantics of this function in terms of a string-to-document-node mapping in the dynamic context, the specification is acknowledging that the results of this function are outside the purview of the language specification itself, and depend entirely on the run-time environment in which the expression is evaluated. This run-time environment includes not only an unpredictable collection of resources ("the web"), but configurable machinery for locating resources and turning their contents into document nodes within the XPath data model. Both the set of resources that are reachable, and the mechanisms by which those resources are parsed and validated, are ·implementation dependent·. ]] -- XQuery 1.0 & XPath 2.0 Functions / Operators (15.5.4 fn:doc) There is something to be said about the consistent admission of this indeterminate process in XML Infoset, XML 1.*, and XQuery / XPath 2.0 > No, my comment here is about the above advice given in the GRDDL spec -- > not about the pre-processing problem in general. This advice is unique > to the GRDDL spec. My intent in this thread was merely to confirm my > suspicion that this particular advice is impossible to follow, as my > example illustrated. The advice is only impossible to follow where the author has to contend with the 'natural' ambiguities associated with XML parsing. The only exception is XInclude, but in order for GRDDL to be explicit about *not* forward-firing XInclude it would essentially need to be an XML processor in it's own right (XInclude strongly suggests that inclusion happens at a lower level). GRDDL is a mechanism for an agent not a processor - it has to negotiate with the dictates of its environment. In addition, as mentioned above, it is relatively easy to build a specification above GRDDL (GRDDL Strict) which enforces non-validating (or even validating) XML 1.0 -> XML Infoset parsing (as a replacement for GRDDL's silence). But even such a specification would not be able to claim victory on eliminating *all* ambiguity in XML processing. > Yes, I think it is coherent, and it is obvious that significant thought > went into it -- I very much like the way the normative rules are > explicitly called out and formalized, BTW -- but I think the spec is > biased toward applications that can afford to be somewhat loose about a > document's semantics. I don't think it is fair to characterize a conservative stance in the absence of any precedence a 'bias'. Especially when you consider that a majority of the infoset ambiguity is predetermined before GRDDL even gets a handle on the XPath data model it uses as its source. The primary exception is XInclude (it is explicitely an infoset-to-infoset transformation) and as I've demonstrated, one can easily mandate that XInclude doesn't happen by a bare-bones-parsing scheme described in a very light-weight specification with a dependency on GRDDL. -- Chimezie Ogbuji Lead Systems Analyst Thoracic and Cardiovascular Surgery Cleveland Clinic Foundation 9500 Euclid Avenue/ W26 Cleveland, Ohio 44195 Office: (216)444-8593 ogbujic@ccf.org =================================== Cleveland Clinic is ranked one of the top 3 hospitals in America by U.S.News & World Report. Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations. Confidentiality Note: This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy. Thank you.
Received on Friday, 1 June 2007 15:09:03 UTC