Comments on schema spec

My name is Dean Roddey. I am the author of the XML4C2 parser, IBM's C++ XML
parser. I also did a good bit of work on IBM's Java XML parser. I will be
prototyping Schema support in our parsers, so I've been studying the spec. I
have the following comments. These comments are purely related to the structural
stuff, not the data types. I'll comment on those later.


1 - First of all, let me just give a general perception type of comment. Its far
too complex. I forsee that it is doomed if it stays this complicated. Complex
specs never seem to do well, and this one falls into that category. I feel,
personally, that the spec writers might be addressing their own needs as
programmers and not the needs of the end users who will be using the tools based
on this specification, which is why complex specs don't do well, AFAIK. If its
not significantly simplied, I don't believe it will ever achieve widespread use,
anywhere near complete conformance among implementations, or anywhere near the
performance required by the primary applications of XML in e-business. The
comments below will mostly be comments on the things I feel are too complex.

2 - Open Element Types. I have no trouble with the concept of an open element
type, but the requirement that it still be validated (by skipping over any
elements not listed in the allowable children of the current open modelled
element) is too much, IMHO. What's the real point? The whole point of validating
the content of an element is so that you know it has a particular structure. If
it doesn't, then any software that has to deal with the output is gaining very
little from having it validated. So it just adds overhead and implementation
complexity for very little gain.

3 - Most of the 'Definition' sections are, nothing personal, mostly to wholely
incomprehensible. Having prose definitions with many sub-clauses is a guaranteed
way to insure that implementations are incompabible. These sentences should be
broken out into terse, bulleted items, all of which are explained.

4 - Aggregate data types. The spec mentions that aggregate data type validation
has not been dealt with. Just to head of any moves in that area, I would warn
that doing so would have massive complications when it comes to performance.
Effectively, parsers would have to save up all text for an entire element and
its children before validating, or validation would have to assume the presence
of a tree structure (such as DOM.) This would effectively either preclude the
validation of streaming data (totally unacceptible) or would place potentially
heavy performance burdens on the parsers that do not exist now. I would
definitely warn against even attempting to deal with such issues. The best one
can hope for is that each child, having its own constraints and being seen as
being valid, would provide a reasonable assurance that the overall element is
valid. Having any requirement that inter-child constraint interactions exist
would be equally undesirable. This stuff must be left to application level
validation.

5 - 3.4.1 kind of implies that a FIXED attribute value must be validated as
matching the data type of the associated attribute even before its used in an
instance. This may be an issue according to how the evaluation of entity
references are defined for the schema world. This sort of validation would
require that the value be fully expanded, but how could that happen if user
defined entities (are internal entity type things supportd here) were part of
the expansion and therefore had not been seen until the actual content document
is parsed.

6 - Related to #4, even the requirement for the parser to save up all the text
within the direct content of an element in order to validate it against some
data type will have a significant burden. It will have to be buffered up,
normalized, the element's type looked up, and then validated. I understand the
need for this, but be aware that its a significant burden for larger documents
that does not exist in the current scheme. Getting rid of the ability to have
comments and PIs interleaved into such elements would significantly reduce the
overhead.

7 - 3.3.4 The meaning of the NOTE in this section escapes me.

8 - Is it really wise or desirable to allow element declarations within other
element declarations? This seems like a nicety that will make things much more
complex and allow the naive user (who is the primary user of this stuff) to
really get confused. It would seem that the formalism of naming elements at the
global level is not so big a deal relative to significant reduction in
complexity it would provide. I think that this one falls way outside the 80/20
rule. The same applies to archetypes as well.

9 - 3.4.6 The NOTE in this section seems very contradictory to me. Mixed content
models inherently are designed to allow child elements, albeit in any order or
number.

10 - 3.4.7 Ambiguous content models. I feel that the ability to catch an
ambiguous content model is very important for support reasons. However, as I've
mentioned before, the types of content you allow cannot reasonably be determined
to be ambiguous, because of the combinations of repetition counts and the AND
connector.

11 - Wouldn't this perhaps be a good time to get rid of the XML'ism of allowing
totally out of order declarations? I fail to understand how this ever was
considered a good thing, either for the user or the implementor.

12 - 3.4.9 Re #8 above. This ability to have multiple types with the same name
(since they are nested within other element declarations) is just going to
confuse the general user and make reading of instances more difficult. This
would be avoided if you drop this altogether and force top level element
declarations. If you have two things in the same schema with the same name, that
mean different things, I think you have a problem there. If they are in
different schemas, then its a non-issue since they are in different namespaces.

13 - 3.4.9 What are the references to 'global' here? Is this something left over
that should be gotten rid of? Are you implying, as per the old DCD spec, that
nested declarations can be global as though they were top level? If so, please
drop that because it serves no one in particular and just complicates things. If
its global, declare it at the top level. Otherwise, what does this stuff mean?

14  - 3.5 The refinement of content models is way too complicated. This will
never be understood by end users and it will make for very significant increases
in the complexity of the parsing of schemas. In my opinion, this should be
dropped. The only 'refinement' supported should be the natural result of just
creating new element types which are based on a previous archetype and which add
new child elements to the end.  I doubt seriously whether most users could even
understand the rules, short as they are, in this section regarding the
legalities of refinement. Most programmers probably won't for that matter.

15 - The import/export rules are about an order of magnitude more complicated
than the average user will every understand. Its more complex than either Java
or C++ include/namespace/import rules, and most programmers don't even use all
of those. You should simplify this down to the fact that one schema can fully
include another, to support modular construction, and leave it at that. The
scheme you propose would put a very large burden on the parser just to build up
the data structures to get ready to validate. I seriously would urge you to drop
out this stuff and just do it simple and straightforward. Having to do a
transitive closure on every element and its attributes and to maintain lists of
included/imported schemas, and keep up with how the got imported/included, and
applying the rules thereof is not even, IMHO, in the 99/1 rule really.

16 - Re #15. Importation of schema bits which retain their own
namespace/identity should be dropped and replaced with the much more
straightfoward use of multiple schemas by the target document, which is already
supported in the spec.

17 - 4.6 I think the 'override' rules on elements might not to work. You claim
that any local overrides, but you also say that any multiples in external
sources uses the first one. If so, you definitely have to guarantee that you
don't create a rule that forces me to make any decisions until the very end of
the schema. Otherwise, what if I've seen the external one used 200 times
already, but then I see a local one? Is there anything that would have made me
think differently about the use of that declaration in the previous 200
references that would be changed now that I've seen a new one? Would I really
want to have to parse the entire thing only to discover than its wrong after I
see some local definition? DTD didn't have that problem because only the first
instance was used, period.

18 - 6.1 You imply that DTDs and schemas can coexist in the same document. I
believe that this should not be allowed because it raises the issue of massive
confusion on the part of users, and it makes things much more complicated for
implementors. Schemas should replace DTDs, not live with them.

19 - 6.1 You imply that not all elements in a document are even governed by any
schema? I think that this is a big mistake, again which will horrendously
confuse the average user. In the DTD world, everything must be accounted for and
you know that everything in the document matched the DTD.

20 - 6.1 The whole 'nearly' well formed concept needs to be revisted. I have
serious qualms about creating a quasi-WF category of documents.

21 - 6.2 Most of the steps described in 6.2 to validate very much seem to assume
the presense of the data in a tree format that can be reiterated. This is not
true of streaming protocols. If schema cannot be applied to streaming protocols,
then its usefulness is questionable for the real world. If it requires that
effectively all of the data in any top level element and its children have to be
safed by the parser before it can be validated (because there is no tree
structure elsewhere to put it), then the overhead will be truely huge compared
to existing validation mechanisms.



That's all I have for now. I know this probably seemed pretty brutal, but it all
needed to be said. I personally believe that this spec must be heavily pared
down or it will never survive. We have to think about what its primary purposes
and users are: e-business and relatively untechnical end users, IMHO. In both
places, complexity is the enemy both for reasons of performance over the wire
and understandability. Turning XML into a programming language is counter
productive to me. If the goal is to put more control into non-programmer's
hands, that's fine but the complexities of this spec are easily as bad as most
programming languages, IMHO. To the end user, they are both probably equally
obtuse and relative measures don't matter.

Once again I would argue for a small, simple core that can be fast, efficient,
small, and comprehensible. Build value added layers for more complex work, does
as totally separate specifications. Build them on top of DOM perhaps, since a
lot of this stuff only seems to make sense if a full tree is available for
reiteration. In the end, you will provide a better spec for the 80%. And the 20%
wouldn't be satisfied with what you've done anyway, so the complexity would
still not prevent siginficant user provided validation anyway.

Specifically, I believe that the following parts should be tossed out in order
to make the spec tenable. If these were removed at least I would feel that it
has a chance.

1) AND connector
2) Repetition counts for elements
3) The complex include/import mechanisms
4) The overly complicated 'derivation' mechanism

Of these, #1 and #2 are by far the worst. If either of these mechanisms are
included, the overhead for validation will go up substantially. The content
models supported by XML were chosen for a reason. They are validatable via a
finite automata, which means they very fast and very compact and tests all
possibilities in parallel. Anything that forces validation to move from a DFA to
a much higher overhead mechanism will have serious impact on validation
overhead, which will be bad for e-business uses of XML. I do not believe, given
the use of AND and repetition counts, that a content model can be proven to be
ambiguous in any reasonable amount of effort. And, I do not believe that a
pattern can be proven not to match a content model, in many cases, without a
brute force search of all possibilities. A very simple example of the problems
involved are:

           (A{1..2}, B, C?, D?) | (A{3..4}, B, C, D)

In this model, which would be totally unambiguous by Schema rules, whether C and
D are required depends upon how many As were seen. Extend this scenario to a
situation where multiple such counted elements are nested within complex
patterns, and then throw in AND, where its not possible to know what position an
input will be in, and things will get much worse. A simple example is:

          (A{1..2}, (C&D&A), F?, G?) | (A{3..4), (C&D&A), F, G)

Here, this would be ambiguous, even in the Schema world, however proving this in
a generalized way would be difficult. In the current types of content models,
determination of ambiguity is relatively trivial and falls out of the building
of the DFA. But in the types of models proposed by Schema, the work could be
very complex. And, if you cannot provide ambiguity, how can you prove that a
particular (failed) path you took through the pattern was the only one and that
another might not match?

I'm sure that I could come up with some far worse ones given a little more
thought, but I believe that it falls on the spec writers shoulders to prove the
practical viability of any proposed content model mechanisms. If the spec forces
the use of a particular style of content model, then the spec should provide the
proof that it is both theoretically doable and practically applicable.

Another major concern is that the complexity of the various namespaces and the
import mechanisms will require data structures so complex and layered lookups of
high enough overhead, that in many cases parsing and creating the internal
representation of a schema will begin to outweigh the overhead of parsing and
processing the data being validated. Right now, in our parsers anyway,
validation is a pretty small fraction of the overhead of the overall work.
Parsing and setting up the validation data structures is a very small fraction,
even for pretty complicated DTDs. The added overhead of having Schema expressed
in XML with the added complexity of the somewhat baroque set of intermediate
structures required to build it and track namespaces and importation, is going
to force this overhead up much further. I feel that this does not bode well for
transaction oriented XML in the e-business space.

In closing, I just want to give the usual admonition against trying to turn XML
into the universal hammer. Doing so will damage its usefullness and make it no
longer the product that originally gave it its appeal. This unfortunate
progression has happened to so many products over time, but still we fail to
learn from past mistakes. If XML continues to grow such that it cannot be
architected to be layered and progressive in its comlexity, it will become SGML
which it was created specifically not to be. And what will have been the point?
Don't look at Schema as some high level piece of work that can fix any and all
lackings in XML validation. Schema will be a core piece, and hence will be in
almost every implementation of XML. If it is large, complex, and slow, it will
fail. Instead it should be layered where complex structural analsis, for this
folks who understand it and are willing to pay the price, should be provided by
another XML related product. When you throw the data types spec into the mix,
the growth in the core services of XML will have far more than doubled, probably
closer to have quadrupled. And the complexity of use will have grown by an order
of magnitude.

Those are my comments for the time being. Thanks for listening.

Received on Friday, 4 June 1999 13:47:18 UTC