Trying to sum up a bit
On this whitespace bit, we're going to have to sum up, make some sort of
compromise, and move forward. If I've been bad, when I die they're going to
put me me in a room with the SGML WG for million-year arguments about RS/RE,
with breaks every other millenium to discuss URNs and FPIs.
1. It is highly desirable that the many useful things you can do without
DTD's (display, indexing) be done without DTD's.
2. Not having a DTD makes it impossible in principle to detect certain
classes of cannot-possibly-be-data white space, i.e. in element content
3. Not being able to distinguish element from mixed content means that
certain addressing constructs (e.g. those that depend on HyTime
pseudo-elements) won't work in DTD-free mode if they take the DTD's view of
what is mixed & what is element content.
4. This element-content white space, along with certain other classes of white
space that are not data will inevitably be introduced as a result of normal
usage (e.g. editors' line-breaks) and users will be irritated if these are
Thus, it seems like there are a very small number of alternatives:
1. Require the DTD at all times.
Pro: the problems go away.
Con: the DTD is required at all times.
2. Work significant WS into the definition of well-formedness (Grosso)
Adopt a tight set of rules for well-formed documents such that all white
space, except RE's after tags, may safely assumed by a DTD-less processor
to be significant.
Pro: the problem goes away
Con: 1. it's a lot harder to author XML without an XML-savvy tool
2. you can't test well-formedness without looking at a DTD
3. All non-markup bytes are signicant, whitespace or not (Durand)
Pro: Everyone can understand the rules, it's easy to implement
Con: You lose certain Hytime addressing facilities, and the application
gets no help from the XML processor in ignoring WS that to the user
is "obviously" irrelevant.
4. Use an mechanism *in the instance* to signal a DTD-less application
what's going on.
4.1 The PI-based DTD summary (Sperberg-McQueen)
4.2 Explicit quoting of significant character data (Goldfarb)
4.3.1 -XML-SPACE with one value, PRESERVE (Paoli)
4.3.2 -XML-SPACE with two values, PRESERVE/COLLAPSE (Current ERB)
4.3.3 -XML-SPACE with three values, PRESERVE/COLLAPSE/SUPPRESS (Bray)
[which we might want to rename, if what we're really doing is
signalling element content, mixed content, and verbatim content]
4.4 Escaping of non-significant WS (Prescod)
Pro: Solves the problems
Con: Requires extra work from authors, possibly duplicates DTD info with
potential for loss of sync, and tends to look ugly & unnatural
Under (4.3), there are some behavioral options:
4.3a1. the XML processor does what -XML-SPACE says, eats the whitespace
or not, and consumes the attribute, not passing it to the application
4.3a2. the XML processor does what -XML-SPACE says *and* passes it on to
4.3a3. the XML processor merely passes -XML-SPACE along to the application,
along with all the bytes of data; it is the application's job to
do the right thing
4.3b1. A validating XML processor reading a DTD generates synthetic
-XML-SPACE attributes for the application's consumption, based on
its knowledge of mixed/element content
4.3b2. -XML-SPACE is only ever inserted by humans.
Face facts, folk. There is just not a solution that is going to solve
this problem and be free of some cost. And please remember the cost
of explanation and education is very real. For myself, my preference
would be (in descending order) #3, #4.3.3 (a3/b2), 4.3.2, 4.3.1.
Cheers, Tim Bray
firstname.lastname@example.org http://www.textuality.com/ +1-604-488-1167