W3C home > Mailing lists > Public > www-xml-infoset-comments@w3.org > July to September 2000

Parents and Grove Plans

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Thu, 14 Sep 2000 12:43:56 +1200 (NZST)
Message-Id: <200009140043.MAA06799@atlas.otago.ac.nz>
To: mailto:www-xml-infoset-comments@w3.org
These comments are based on WD-xml-infoset-20000726.
There are two major problems in that draft.

1.  XML InfoSet is not supposed to be tied to specific programming
    languages.  It is meant to be applicable to any reasonable language.

    Just as Java is portable to any machine that sufficiently resembles
    a SPARC, so the current InfoSet is applicable to any language that
    sufficiently resembles C.

    The specific problem is that all sorts of nodes are supposed not
    merely to HAVE unique parents, but to KNOW their parents.

    This has two effects.  First, it is extremely difficult, if not
    impossible without gross inefficiency, to model this in strict
    functional programming languages (such as Standard ML or O'CAML)
    or strict logic programming languages (such as Mercury, notable for
    its efficiently and its support under Microsoft .NET).  In a lazy
    functional language (such as Haskell or Clean) or a somewhat sloppy
    logic programming language (such as Prolog, which omits the occurs
    check) it is technically possible to implement this model of nodes
    knowing their own unique parent, but it is highly unpleasant to
    deal with such data structures.

    Second, it prohibits the storage optimisation technique called
    "hashed consing".  This matters for ALL programmers in ALL programming
    languages.  For example, I have written an XML parser in C.  Not
    altogether to my surprise, it was easier to write the parser than to
    write efficient code for character set conversion on input and output.
    Thanks to XML's rule that elements with element content may contain
    white space character data, which must be preserved, large XML
    documents often contain a large number of repetitions of the same
    white space strings.  If my parser is asked to build a tree, it uses
    unique strings for element names, attribute names, attribute values,
    PCDATA, CDATA sections, comments, and processing instructions.  When
    there is little repetition, this may slightly increase storage needs,
    but more commonly it reduces them to a worthwhile extent.  As yet,
    my parser does not store nodes uniquely, but I have been thinking
    about changing that.  In the revised parser (if it ever gets built,
    an element will have an array of pointers to child nodes, not a
    linked list, and nodes will be uniquely stored.  So a document
    that defines
	<!ENTITY rfc2279 "<std rfc="2279">UTF-8</std>">
    and then uses &rfc2279; many times would have just one copy of the
    <std> subtree.

    That optimisation is flatly forbiddeb by the InfoSet proposal as
    it stands.  To be sure, Javascript programmers find it convenient
    to have rigidly doubly linked unshareable structures, but removing
    the requirement that nodes have a unique parent and know it would
    not require Javascript programmers to change their ways.

2.  InfoSets is supposed to be an abstract model for the information
    content of an XML document.  There is a lot of prior art for this.
    For SGML we have ESIS (in the SGML standard) and GROVES (in HyTime
    and DSSSL).

    XML is a simplified version of SGML in that it is simpler to PARSE
    (unless you are using XSchema, in which case it is harder to parse).
    It is rather more complicated to USE.  To demonstrate that, here is
    a sequence of abstract descriptions.

    /* XML Version 0.  This covers the information that one would
       expect an SGML parser to recover from an XML document,
       excluding information that can be found in the DTD.
    */

    :: Document = { root :: Element, dict :: String -> Element,
		    before :: {PI}, after :: {PI} }

    :: PI :== String

    :: PCData :== String

    :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} }

    :: Name :== String

    :: TreeNode = Element Element | PCData PCData | PI PI

    :: AttrNode = { attr :: Name, value :: AttrVal }

    :: AttrVal = Implied | Single String | Multi {String}

    Here
	:: Name :== Type	defines a synonym for an existing type,
        :: Name = Type		names a new type,
        String			is built in (sequence of Char)
        {Type}			stands for sequence of Type
        { f1::t1, ..., fn::tn }	stands for record with a field f1 of type
				t1, ..., and a field fn of type tn.
	Tag1 t1 | ... | Tagn tn is a discriminated union

    For example, here is the document
	<Greeting>hello world</Greeting>

    doc       = { before = {}, root = r, after = {}}
     where r  = { name = "Greeting", children = {c1}, atts = {} }
           c1 = PCData "hello world"

    Note that an Element knows its children and their order, but a TreeNode
    does not know its parent or siblings.  It so happens that this is real
    code that can be typechecked and run, but it serves nicely as an
    ABSTRACT data type.  An implementation in Javascript might well have
    more links definable in terms of this structure, but an abstract model
    has no business including redundant links that hamper adoption in
    more concise languages.

    /*  XML Version 1.  This adds Comments and CData sections,  */
    /*  splits processing instructions into target and data, */
    /*  and allows references to unexpanded entities. */

    :: Document = { root :: Element, dict :: String -> Element,
                    before :: {PICom}, after :: {PICom} }

    :: PICom = PI PI | Comment Comment

    :: PI = { target :: String, data :: String }

    :: Comment :== String

    :: PCData :== String

    :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} }

    :: Name :== String

    :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom
                | Unexpanded String

    :: AttrNode = { attr :: Name, value :: AttrVal }

    :: AttrVal = Implied | Single String | Multi {String}

    For example, here is the document
	<!--hairy--><!--stuff--><Greeting
	>hello<![CDATA[ world]]></Greeting><!--hairy--><!--stuff-->

    doc       = { before = o, root = r, after = o}
     where r  = { name = "Greeting", children = {c1,c2}, atts = {} }
           c1 = PCData "hello"
           c2 = CData " world"
           o  = {Comment "hairy",Comment "stuff"}


    /*  XML Version 2.  This adds Namespaces.  */

    :: Document = { root :: Element, dict :: String -> Element,
		    before :: {PICom}, after :: {PICom} }

    :: PICom = PI PI | Comment Comment

    :: PI = { target :: String, data :: String }

    :: Comment :== String

    :: PCData :== String

    :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} }

    :: Name = { orig_name :: String, space :: URI }

    :: URI :== String

    :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom
                | Unexpanded String

    :: AttrNode = { attr :: NSName, value :: AttrVal }

    :: AttrVal = Implied | Single String | Multi {String}

    For example, here is the document
	<!--hairy--><!--stuff--><s:Greeting xmlns:s="file:://localhost/a/b/c"
	s:label="tom dick harry"
        >hello<![CDATA[ world]]></s:Greeting><!--hairy--><!--stuff-->

    doc     = { before = o, root = r, after = o}
     where r  = { name = n1, children = {c1,c2}, atts = {a1} }
           c1 = PCData "hello"
           c2 = CData " world"
           o  = {Comment "hairy",Comment "stuff"}
           n1 = { orig_name = "s:Greeting", space = s }
           a1 = { attr = n2, value = Single "tom dick harry" }
           n2 = { orig_name = "s:label", space = s }
           s  = "file://localhost/a/b/c"

    /*  XML Version 3.  This adds XBase.  Note of these examples is to
        be taken very seriously.  They were composed in a hurry and have
        not been reviewed.  This one is particularly half-baked.  Note
	that a PCData node, PI node, or AttrName may have many occurrences;
	each occurrence is to be interpreted in its own context.  To get
	external entity references right, we need another alternative for
	TreeNode, making it harder to count children.
    */

    :: Document = { root :: Element, dict :: String -> Element,
                    before :: {PICom}, after :: {PiCom}, base :: URI }

    :: PICom = PI PI | Comment Comment

    :: PI = { target :: String, data :: String }

    :: Comment :== String

    :: PCData :== String

    :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode},
                   base :: URI }
    :: Name = { orig_name :: String, space :: URI }

    :: URI :== String

    :: Expanded = { name :: String, children :: {TreeNode}, base :: URI }

    :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom
                | Unexpanded String | Expanded Expanded

    :: AttrNode = { attr :: Name, value :: AttrVal }

    :: AttrVal = Implied | Single String | Multi {String}

    For example, here is the document
	<!--hairy--><!--stuff--><s:Greeting xmlns:s="file:://localhost/a/b/c"
	s:label="tom dick harry"
	xml:base="file://localhost//home/user/fred/docs/root.xml"
        >hello<![CDATA[ world]]></s:Greeting><!--hairy--><!--stuff-->

    doc       = { before = o, root = r, after = o, base = b }
     where r  = { name = n1, children = {c1,c2}, atts = {a1}, base = b }
           c1 = PCData "hello"
           c2 = CData " world"
           o  = {Comment "hairy",Comment "stuff"}
           n1 = { orig_name = "s:Greeting", space = s }
           a1 = { attr = n2, value = Single "tom dick harry" }
           n2 = { orig_name = "s:label", space = s }
           s  = "file://localhost/a/b/c"
           b  = "file://localhost//home/user/fred/docs/root.xml"


    If we elide type synonyms, XML0 needs only 5 new type declarations,
    6 SLOC, to describe it.

    If we elide type synonyms, XML3 needs 9 new type declarations,
    12 SLOC, to describe it.

    Considering that including external entity boundaries in the model
    makes it harder to locate the children, it's fair to say that the
    XML InfoSet model, even when stripped of parent and sibling links,
    is at least twice as complicated to use as basic SGML.

    For GROVEs, there is an answer, and it is an answer that XML InfoSet
    needs.  That is the notion (rather loosely interpreted) of a Grove
    Plan, whereby an application can tell an SGML processor what it is
    interested in.

    There is a risk that the XML InfoSet model will be misunderstood.
    That it will be seen as specifying the information that every
    application HAS to deal with, instead of the correct interpretation,
    which is that it is the information AVAILABLE to an application that
    wants it.  That would leave us with something as hideously clumsy
    as the Document Object Model.  A few months ago I set out to write
    a DOM implementation, but found that (a) it was so underspecified
    that for several key questions I had no idea what I was supposed to
    do, and (b) I couldn't imagine anyone wanting to use it unless
    perhaps they were writing an XML editor, which I wasn't.

    The InfoSet specification should explicitly state that an InfoSet
    processor should provide some means that an application can use to
    say which features of the model it wants to be informed of.  At a
    minimum, an application should be able to
    - choose whether or not to be told about comments
    - choose whether CDATA sections are to be split off PCData or not
    - choose whether to be informed of Xbase or not
    - choose whether unexpanded entity references are to be returned
      or to cause a parsing error.
Received on Wednesday, 13 September 2000 20:44:04 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 16 March 2009 11:12:22 GMT