- From: <roddey@us.ibm.com>
- Date: Fri, 4 Jun 1999 11:46:57 -0600
- To: www-xml-schema-comments@w3.org
My name is Dean Roddey. I am the author of the XML4C2 parser, IBM's C++ XML parser. I also did a good bit of work on IBM's Java XML parser. I will be prototyping Schema support in our parsers, so I've been studying the spec. I have the following comments. These comments are purely related to the structural stuff, not the data types. I'll comment on those later. 1 - First of all, let me just give a general perception type of comment. Its far too complex. I forsee that it is doomed if it stays this complicated. Complex specs never seem to do well, and this one falls into that category. I feel, personally, that the spec writers might be addressing their own needs as programmers and not the needs of the end users who will be using the tools based on this specification, which is why complex specs don't do well, AFAIK. If its not significantly simplied, I don't believe it will ever achieve widespread use, anywhere near complete conformance among implementations, or anywhere near the performance required by the primary applications of XML in e-business. The comments below will mostly be comments on the things I feel are too complex. 2 - Open Element Types. I have no trouble with the concept of an open element type, but the requirement that it still be validated (by skipping over any elements not listed in the allowable children of the current open modelled element) is too much, IMHO. What's the real point? The whole point of validating the content of an element is so that you know it has a particular structure. If it doesn't, then any software that has to deal with the output is gaining very little from having it validated. So it just adds overhead and implementation complexity for very little gain. 3 - Most of the 'Definition' sections are, nothing personal, mostly to wholely incomprehensible. Having prose definitions with many sub-clauses is a guaranteed way to insure that implementations are incompabible. These sentences should be broken out into terse, bulleted items, all of which are explained. 4 - Aggregate data types. The spec mentions that aggregate data type validation has not been dealt with. Just to head of any moves in that area, I would warn that doing so would have massive complications when it comes to performance. Effectively, parsers would have to save up all text for an entire element and its children before validating, or validation would have to assume the presence of a tree structure (such as DOM.) This would effectively either preclude the validation of streaming data (totally unacceptible) or would place potentially heavy performance burdens on the parsers that do not exist now. I would definitely warn against even attempting to deal with such issues. The best one can hope for is that each child, having its own constraints and being seen as being valid, would provide a reasonable assurance that the overall element is valid. Having any requirement that inter-child constraint interactions exist would be equally undesirable. This stuff must be left to application level validation. 5 - 3.4.1 kind of implies that a FIXED attribute value must be validated as matching the data type of the associated attribute even before its used in an instance. This may be an issue according to how the evaluation of entity references are defined for the schema world. This sort of validation would require that the value be fully expanded, but how could that happen if user defined entities (are internal entity type things supportd here) were part of the expansion and therefore had not been seen until the actual content document is parsed. 6 - Related to #4, even the requirement for the parser to save up all the text within the direct content of an element in order to validate it against some data type will have a significant burden. It will have to be buffered up, normalized, the element's type looked up, and then validated. I understand the need for this, but be aware that its a significant burden for larger documents that does not exist in the current scheme. Getting rid of the ability to have comments and PIs interleaved into such elements would significantly reduce the overhead. 7 - 3.3.4 The meaning of the NOTE in this section escapes me. 8 - Is it really wise or desirable to allow element declarations within other element declarations? This seems like a nicety that will make things much more complex and allow the naive user (who is the primary user of this stuff) to really get confused. It would seem that the formalism of naming elements at the global level is not so big a deal relative to significant reduction in complexity it would provide. I think that this one falls way outside the 80/20 rule. The same applies to archetypes as well. 9 - 3.4.6 The NOTE in this section seems very contradictory to me. Mixed content models inherently are designed to allow child elements, albeit in any order or number. 10 - 3.4.7 Ambiguous content models. I feel that the ability to catch an ambiguous content model is very important for support reasons. However, as I've mentioned before, the types of content you allow cannot reasonably be determined to be ambiguous, because of the combinations of repetition counts and the AND connector. 11 - Wouldn't this perhaps be a good time to get rid of the XML'ism of allowing totally out of order declarations? I fail to understand how this ever was considered a good thing, either for the user or the implementor. 12 - 3.4.9 Re #8 above. This ability to have multiple types with the same name (since they are nested within other element declarations) is just going to confuse the general user and make reading of instances more difficult. This would be avoided if you drop this altogether and force top level element declarations. If you have two things in the same schema with the same name, that mean different things, I think you have a problem there. If they are in different schemas, then its a non-issue since they are in different namespaces. 13 - 3.4.9 What are the references to 'global' here? Is this something left over that should be gotten rid of? Are you implying, as per the old DCD spec, that nested declarations can be global as though they were top level? If so, please drop that because it serves no one in particular and just complicates things. If its global, declare it at the top level. Otherwise, what does this stuff mean? 14 - 3.5 The refinement of content models is way too complicated. This will never be understood by end users and it will make for very significant increases in the complexity of the parsing of schemas. In my opinion, this should be dropped. The only 'refinement' supported should be the natural result of just creating new element types which are based on a previous archetype and which add new child elements to the end. I doubt seriously whether most users could even understand the rules, short as they are, in this section regarding the legalities of refinement. Most programmers probably won't for that matter. 15 - The import/export rules are about an order of magnitude more complicated than the average user will every understand. Its more complex than either Java or C++ include/namespace/import rules, and most programmers don't even use all of those. You should simplify this down to the fact that one schema can fully include another, to support modular construction, and leave it at that. The scheme you propose would put a very large burden on the parser just to build up the data structures to get ready to validate. I seriously would urge you to drop out this stuff and just do it simple and straightforward. Having to do a transitive closure on every element and its attributes and to maintain lists of included/imported schemas, and keep up with how the got imported/included, and applying the rules thereof is not even, IMHO, in the 99/1 rule really. 16 - Re #15. Importation of schema bits which retain their own namespace/identity should be dropped and replaced with the much more straightfoward use of multiple schemas by the target document, which is already supported in the spec. 17 - 4.6 I think the 'override' rules on elements might not to work. You claim that any local overrides, but you also say that any multiples in external sources uses the first one. If so, you definitely have to guarantee that you don't create a rule that forces me to make any decisions until the very end of the schema. Otherwise, what if I've seen the external one used 200 times already, but then I see a local one? Is there anything that would have made me think differently about the use of that declaration in the previous 200 references that would be changed now that I've seen a new one? Would I really want to have to parse the entire thing only to discover than its wrong after I see some local definition? DTD didn't have that problem because only the first instance was used, period. 18 - 6.1 You imply that DTDs and schemas can coexist in the same document. I believe that this should not be allowed because it raises the issue of massive confusion on the part of users, and it makes things much more complicated for implementors. Schemas should replace DTDs, not live with them. 19 - 6.1 You imply that not all elements in a document are even governed by any schema? I think that this is a big mistake, again which will horrendously confuse the average user. In the DTD world, everything must be accounted for and you know that everything in the document matched the DTD. 20 - 6.1 The whole 'nearly' well formed concept needs to be revisted. I have serious qualms about creating a quasi-WF category of documents. 21 - 6.2 Most of the steps described in 6.2 to validate very much seem to assume the presense of the data in a tree format that can be reiterated. This is not true of streaming protocols. If schema cannot be applied to streaming protocols, then its usefulness is questionable for the real world. If it requires that effectively all of the data in any top level element and its children have to be safed by the parser before it can be validated (because there is no tree structure elsewhere to put it), then the overhead will be truely huge compared to existing validation mechanisms. That's all I have for now. I know this probably seemed pretty brutal, but it all needed to be said. I personally believe that this spec must be heavily pared down or it will never survive. We have to think about what its primary purposes and users are: e-business and relatively untechnical end users, IMHO. In both places, complexity is the enemy both for reasons of performance over the wire and understandability. Turning XML into a programming language is counter productive to me. If the goal is to put more control into non-programmer's hands, that's fine but the complexities of this spec are easily as bad as most programming languages, IMHO. To the end user, they are both probably equally obtuse and relative measures don't matter. Once again I would argue for a small, simple core that can be fast, efficient, small, and comprehensible. Build value added layers for more complex work, does as totally separate specifications. Build them on top of DOM perhaps, since a lot of this stuff only seems to make sense if a full tree is available for reiteration. In the end, you will provide a better spec for the 80%. And the 20% wouldn't be satisfied with what you've done anyway, so the complexity would still not prevent siginficant user provided validation anyway. Specifically, I believe that the following parts should be tossed out in order to make the spec tenable. If these were removed at least I would feel that it has a chance. 1) AND connector 2) Repetition counts for elements 3) The complex include/import mechanisms 4) The overly complicated 'derivation' mechanism Of these, #1 and #2 are by far the worst. If either of these mechanisms are included, the overhead for validation will go up substantially. The content models supported by XML were chosen for a reason. They are validatable via a finite automata, which means they very fast and very compact and tests all possibilities in parallel. Anything that forces validation to move from a DFA to a much higher overhead mechanism will have serious impact on validation overhead, which will be bad for e-business uses of XML. I do not believe, given the use of AND and repetition counts, that a content model can be proven to be ambiguous in any reasonable amount of effort. And, I do not believe that a pattern can be proven not to match a content model, in many cases, without a brute force search of all possibilities. A very simple example of the problems involved are: (A{1..2}, B, C?, D?) | (A{3..4}, B, C, D) In this model, which would be totally unambiguous by Schema rules, whether C and D are required depends upon how many As were seen. Extend this scenario to a situation where multiple such counted elements are nested within complex patterns, and then throw in AND, where its not possible to know what position an input will be in, and things will get much worse. A simple example is: (A{1..2}, (C&D&A), F?, G?) | (A{3..4), (C&D&A), F, G) Here, this would be ambiguous, even in the Schema world, however proving this in a generalized way would be difficult. In the current types of content models, determination of ambiguity is relatively trivial and falls out of the building of the DFA. But in the types of models proposed by Schema, the work could be very complex. And, if you cannot provide ambiguity, how can you prove that a particular (failed) path you took through the pattern was the only one and that another might not match? I'm sure that I could come up with some far worse ones given a little more thought, but I believe that it falls on the spec writers shoulders to prove the practical viability of any proposed content model mechanisms. If the spec forces the use of a particular style of content model, then the spec should provide the proof that it is both theoretically doable and practically applicable. Another major concern is that the complexity of the various namespaces and the import mechanisms will require data structures so complex and layered lookups of high enough overhead, that in many cases parsing and creating the internal representation of a schema will begin to outweigh the overhead of parsing and processing the data being validated. Right now, in our parsers anyway, validation is a pretty small fraction of the overhead of the overall work. Parsing and setting up the validation data structures is a very small fraction, even for pretty complicated DTDs. The added overhead of having Schema expressed in XML with the added complexity of the somewhat baroque set of intermediate structures required to build it and track namespaces and importation, is going to force this overhead up much further. I feel that this does not bode well for transaction oriented XML in the e-business space. In closing, I just want to give the usual admonition against trying to turn XML into the universal hammer. Doing so will damage its usefullness and make it no longer the product that originally gave it its appeal. This unfortunate progression has happened to so many products over time, but still we fail to learn from past mistakes. If XML continues to grow such that it cannot be architected to be layered and progressive in its comlexity, it will become SGML which it was created specifically not to be. And what will have been the point? Don't look at Schema as some high level piece of work that can fix any and all lackings in XML validation. Schema will be a core piece, and hence will be in almost every implementation of XML. If it is large, complex, and slow, it will fail. Instead it should be layered where complex structural analsis, for this folks who understand it and are willing to pay the price, should be provided by another XML related product. When you throw the data types spec into the mix, the growth in the core services of XML will have far more than doubled, probably closer to have quadrupled. And the complexity of use will have grown by an order of magnitude. Those are my comments for the time being. Thanks for listening.
Received on Friday, 4 June 1999 13:47:18 UTC