These comments are based on WD-xml-infoset-20000726. There are two major problems in that draft. 1. XML InfoSet is not supposed to be tied to specific programming languages. It is meant to be applicable to any reasonable language. Just as Java is portable to any machine that sufficiently resembles a SPARC, so the current InfoSet is applicable to any language that sufficiently resembles C. The specific problem is that all sorts of nodes are supposed not merely to HAVE unique parents, but to KNOW their parents. This has two effects. First, it is extremely difficult, if not impossible without gross inefficiency, to model this in strict functional programming languages (such as Standard ML or O'CAML) or strict logic programming languages (such as Mercury, notable for its efficiently and its support under Microsoft .NET). In a lazy functional language (such as Haskell or Clean) or a somewhat sloppy logic programming language (such as Prolog, which omits the occurs check) it is technically possible to implement this model of nodes knowing their own unique parent, but it is highly unpleasant to deal with such data structures. Second, it prohibits the storage optimisation technique called "hashed consing". This matters for ALL programmers in ALL programming languages. For example, I have written an XML parser in C. Not altogether to my surprise, it was easier to write the parser than to write efficient code for character set conversion on input and output. Thanks to XML's rule that elements with element content may contain white space character data, which must be preserved, large XML documents often contain a large number of repetitions of the same white space strings. If my parser is asked to build a tree, it uses unique strings for element names, attribute names, attribute values, PCDATA, CDATA sections, comments, and processing instructions. When there is little repetition, this may slightly increase storage needs, but more commonly it reduces them to a worthwhile extent. As yet, my parser does not store nodes uniquely, but I have been thinking about changing that. In the revised parser (if it ever gets built, an element will have an array of pointers to child nodes, not a linked list, and nodes will be uniquely stored. So a document that defines <!ENTITY rfc2279 "<std rfc="2279">UTF-8</std>"> and then uses &rfc2279; many times would have just one copy of the <std> subtree. That optimisation is flatly forbiddeb by the InfoSet proposal as it stands. To be sure, Javascript programmers find it convenient to have rigidly doubly linked unshareable structures, but removing the requirement that nodes have a unique parent and know it would not require Javascript programmers to change their ways. 2. InfoSets is supposed to be an abstract model for the information content of an XML document. There is a lot of prior art for this. For SGML we have ESIS (in the SGML standard) and GROVES (in HyTime and DSSSL). XML is a simplified version of SGML in that it is simpler to PARSE (unless you are using XSchema, in which case it is harder to parse). It is rather more complicated to USE. To demonstrate that, here is a sequence of abstract descriptions. /* XML Version 0. This covers the information that one would expect an SGML parser to recover from an XML document, excluding information that can be found in the DTD. */ :: Document = { root :: Element, dict :: String -> Element, before :: {PI}, after :: {PI} } :: PI :== String :: PCData :== String :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} } :: Name :== String :: TreeNode = Element Element | PCData PCData | PI PI :: AttrNode = { attr :: Name, value :: AttrVal } :: AttrVal = Implied | Single String | Multi {String} Here :: Name :== Type defines a synonym for an existing type, :: Name = Type names a new type, String is built in (sequence of Char) {Type} stands for sequence of Type { f1::t1, ..., fn::tn } stands for record with a field f1 of type t1, ..., and a field fn of type tn. Tag1 t1 | ... | Tagn tn is a discriminated union For example, here is the document <Greeting>hello world</Greeting> doc = { before = {}, root = r, after = {}} where r = { name = "Greeting", children = {c1}, atts = {} } c1 = PCData "hello world" Note that an Element knows its children and their order, but a TreeNode does not know its parent or siblings. It so happens that this is real code that can be typechecked and run, but it serves nicely as an ABSTRACT data type. An implementation in Javascript might well have more links definable in terms of this structure, but an abstract model has no business including redundant links that hamper adoption in more concise languages. /* XML Version 1. This adds Comments and CData sections, */ /* splits processing instructions into target and data, */ /* and allows references to unexpanded entities. */ :: Document = { root :: Element, dict :: String -> Element, before :: {PICom}, after :: {PICom} } :: PICom = PI PI | Comment Comment :: PI = { target :: String, data :: String } :: Comment :== String :: PCData :== String :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} } :: Name :== String :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom | Unexpanded String :: AttrNode = { attr :: Name, value :: AttrVal } :: AttrVal = Implied | Single String | Multi {String} For example, here is the document <!--hairy--><!--stuff--><Greeting >hello<![CDATA[ world]]></Greeting><!--hairy--><!--stuff--> doc = { before = o, root = r, after = o} where r = { name = "Greeting", children = {c1,c2}, atts = {} } c1 = PCData "hello" c2 = CData " world" o = {Comment "hairy",Comment "stuff"} /* XML Version 2. This adds Namespaces. */ :: Document = { root :: Element, dict :: String -> Element, before :: {PICom}, after :: {PICom} } :: PICom = PI PI | Comment Comment :: PI = { target :: String, data :: String } :: Comment :== String :: PCData :== String :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode} } :: Name = { orig_name :: String, space :: URI } :: URI :== String :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom | Unexpanded String :: AttrNode = { attr :: NSName, value :: AttrVal } :: AttrVal = Implied | Single String | Multi {String} For example, here is the document <!--hairy--><!--stuff--><s:Greeting xmlns:s="file:://localhost/a/b/c" s:label="tom dick harry" >hello<![CDATA[ world]]></s:Greeting><!--hairy--><!--stuff--> doc = { before = o, root = r, after = o} where r = { name = n1, children = {c1,c2}, atts = {a1} } c1 = PCData "hello" c2 = CData " world" o = {Comment "hairy",Comment "stuff"} n1 = { orig_name = "s:Greeting", space = s } a1 = { attr = n2, value = Single "tom dick harry" } n2 = { orig_name = "s:label", space = s } s = "file://localhost/a/b/c" /* XML Version 3. This adds XBase. Note of these examples is to be taken very seriously. They were composed in a hurry and have not been reviewed. This one is particularly half-baked. Note that a PCData node, PI node, or AttrName may have many occurrences; each occurrence is to be interpreted in its own context. To get external entity references right, we need another alternative for TreeNode, making it harder to count children. */ :: Document = { root :: Element, dict :: String -> Element, before :: {PICom}, after :: {PiCom}, base :: URI } :: PICom = PI PI | Comment Comment :: PI = { target :: String, data :: String } :: Comment :== String :: PCData :== String :: Element = { name :: Name, children :: {TreeNode}, atts :: {AttrNode}, base :: URI } :: Name = { orig_name :: String, space :: URI } :: URI :== String :: Expanded = { name :: String, children :: {TreeNode}, base :: URI } :: TreeNode = Element Element | PCData PCData | CData PCData | PICom PICom | Unexpanded String | Expanded Expanded :: AttrNode = { attr :: Name, value :: AttrVal } :: AttrVal = Implied | Single String | Multi {String} For example, here is the document <!--hairy--><!--stuff--><s:Greeting xmlns:s="file:://localhost/a/b/c" s:label="tom dick harry" xml:base="file://localhost//home/user/fred/docs/root.xml" >hello<![CDATA[ world]]></s:Greeting><!--hairy--><!--stuff--> doc = { before = o, root = r, after = o, base = b } where r = { name = n1, children = {c1,c2}, atts = {a1}, base = b } c1 = PCData "hello" c2 = CData " world" o = {Comment "hairy",Comment "stuff"} n1 = { orig_name = "s:Greeting", space = s } a1 = { attr = n2, value = Single "tom dick harry" } n2 = { orig_name = "s:label", space = s } s = "file://localhost/a/b/c" b = "file://localhost//home/user/fred/docs/root.xml" If we elide type synonyms, XML0 needs only 5 new type declarations, 6 SLOC, to describe it. If we elide type synonyms, XML3 needs 9 new type declarations, 12 SLOC, to describe it. Considering that including external entity boundaries in the model makes it harder to locate the children, it's fair to say that the XML InfoSet model, even when stripped of parent and sibling links, is at least twice as complicated to use as basic SGML. For GROVEs, there is an answer, and it is an answer that XML InfoSet needs. That is the notion (rather loosely interpreted) of a Grove Plan, whereby an application can tell an SGML processor what it is interested in. There is a risk that the XML InfoSet model will be misunderstood. That it will be seen as specifying the information that every application HAS to deal with, instead of the correct interpretation, which is that it is the information AVAILABLE to an application that wants it. That would leave us with something as hideously clumsy as the Document Object Model. A few months ago I set out to write a DOM implementation, but found that (a) it was so underspecified that for several key questions I had no idea what I was supposed to do, and (b) I couldn't imagine anyone wanting to use it unless perhaps they were writing an XML editor, which I wasn't. The InfoSet specification should explicitly state that an InfoSet processor should provide some means that an application can use to say which features of the model it wants to be informed of. At a minimum, an application should be able to - choose whether or not to be told about comments - choose whether CDATA sections are to be split off PCData or not - choose whether to be informed of Xbase or not - choose whether unexpanded entity references are to be returned or to cause a parsing error.Received on Wednesday, 13 September 2000 20:44:04 GMT
This archive was generated by hypermail 2.2.0 + w3c-0.30 : Thursday, 19 May 2005 09:46:41 GMT