RS/RE, again (sorry)
On September 11th, the ERB spent considerable time revisiting the
RS/RE question. We're just a fun-loving bunch, and it became apparent
that the WG is going to have to put some more thought into this.
At the moment, we have the -XML-SPACE mechanism, which toggles two
behaviors: collapsing of leading and multiple spaces (essentially the HTML
semantic), and passing through all the bytes to the application.
Some feel the mechanism is unaesthetic and prone to misinterpretation
as it stands, and should simply be discarded, with all non-markup
bytes being passed to the application to do with as it will. This
can be made 8879-compliant in the short term via a mechanism proposed by
Charles Goldfarb, and in the medium/long term with a TC via WG8.
There are some problems with both the current and revised approaches:
o -XML-SPACE, although this is not documented, really only deals with
mixed content; many feel it's important to ignore white space in
element content; <list> <item>..</item> <item>..</item> </list>
e.g. the above
but XML, when there's no DTD, doesn't know where element content is and
*cannot* be made to do this.
o SGML's world-view tripartions the set of characters:
those that are text, those that are markup, and those that are
insignificant white space. Can XML really afford to discard this
o Many real-world editors, largely to deal with the fact that
text (whether or not we like it) is stored in files in what amounts
to a series of records, freely insert line breaks and other white
space because they know SGML processors will ignore it. Can we
afford to make that white space significant?
o Some applications, e.g. full-text indexers, really need to know where
everything is by byte offset, whether or not the bytes are significant;
thus the -XML-SPACE="COLLAPSE" behavior means they can't read the text
with an XML processor (unless they can turn off -XML-SPACE processing
through the API)
So there are a few things we could do, which are not entirely
1. Go to the RE Delenda Est model. This has the advantage that it's
trivially easy to explain, document, and implement. It has some of the
disadvantages listed above; there is some very strong sentiment on the
ERB against this - look from a follow-up from other ERB-folk.
2. Expand the -XML-SPACE attribute from two values to three. The third would
be named REMOVE or DISCARD or something, and would be designed to signal
element content, i.e. all this white space can be safely ignored.
3. Add language to the spec allowing the application to force the
processor to pass through all the bytes regardless of the -XML-SPACE
Your input is requested.
Cheers, Tim Bray
email@example.com http://www.textuality.com/ +1-604-488-1167