Mixed content considered harmful... from Paul Prescod on 1999-05-10 (www-xml-schema-comments@w3.org from April to June 1999)

From: Paul Prescod <paul@prescod.net>
Date: Mon, 10 May 1999 17:32:01 -0500
To: www-xml-schema-comments@w3.org, xml-dev <xml-dev@ic.ac.uk>
Message-ID: <37375E61.42B0CA2F@prescod.net>

XML Schema Part 1 seems to import a mistake from SGML and XML. This is the
idea that content models must either be text-containing, "mixed" or
element containing and that the former sort of model must not constrain
the ordering of elements and text nodes.

"A content model for mixed content provides for mixing elements with
character data in document instances. The allowed element types are named,
but neither their order or their number of occurrences are constrained."

SGML had a separation between mixed and text-containing nodes but it did
not have this constraint that it not be possible to constrain the order
and occurence of text nodes and element nodes. #PCDATA was just a token
and you could use it where you wanted. 

What it did have was a massive bug in its parsing algorithm that made
these "constrained" mixed content models impossible to use. The bug had
nothing to do with validation -- it was a parser problem.

There sprung up a superstition that these mixed content models were evil
when the truth is that the particular bug in SGML was the real problem.
Before it was clear that we could change SGML, XML adopted a ridiculously
confusing rule about the use of mixed content. It didn't occur to me (or
probably anyone else) that it would have been better to just fix the bug.
We probably didn't know at that point that we had that option.

Now this rule has been imported into XML Schema. The rule is even more out
of place in XML schema than it was in XML itself. Then we had the
opportunity to fix the bug. Today the bug is not even relevant -- XML
schema works on the result of the parse....it does not influence the
parse.

#PCDATA is just a data type that is unconstrained. You should be able to
mix data type refs, #PCDATA and element type refs in content models with
impunity (barring real parsing ambiguity). Using old syntax:

<!ELEMENT SECTION (#PCDATA, P+)>
<!ELEMENT FIG (#PCDATA|IMG)>
<!ELEMENT HTML (TITLE,(#PCDATA|P)+)>

You can handle any of these with wrappers but I claim that the instinct to
wrap these things arises more from exposure to the superstition than from
fundamental design considerations. We can make XSchema more uniform by
removing the concept of "mixed content" and by introducing a PCDATA
content token type.
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

And so, in one of history's little ironies, the global triumph of bad
software in the age of the PC was reversed by a surprising combination
of forces: the social transformation initiated by the network, a
long-discarded European theory of political economy, and a small band
of programmers throughout the world mobilized by a single simple idea. 
 - http://old.law.columbia.edu/my_pubs/anarchism.html

Received on Monday, 10 May 1999 19:33:00 UTC