RS/RE, again (sorry)

On September 11th, the ERB spent considerable time revisiting the 
RS/RE question.  We're just a fun-loving bunch, and it became apparent
that the WG is going to have to put some more thought into this.

At the moment, we have the -XML-SPACE mechanism, which toggles two
behaviors: collapsing of leading and multiple spaces (essentially the HTML
semantic), and passing through all the bytes to the application.

Some feel the mechanism is unaesthetic and prone to misinterpretation
as it stands, and should simply be discarded, with all non-markup 
bytes being passed to the application to do with as it will.  This
can be made 8879-compliant in the short term via a mechanism proposed by 
Charles Goldfarb, and in the medium/long term with a TC via WG8.

There are some problems with both the current and revised approaches:

o -XML-SPACE, although this is not documented, really only deals with
  mixed content; many feel it's important to ignore white space in
  element content; <list> <item>..</item>  <item>..</item> </list>
                                         ^^
                                     e.g. the above
  but XML, when there's no DTD, doesn't know where element content is and 
  *cannot* be made to do this.

o SGML's world-view tripartions the set of characters:
  those that are text, those that are markup, and those that are 
  insignificant white space.  Can XML really afford to discard this
  distinction?

o Many real-world editors, largely to deal with the fact that 
  text (whether or not we like it) is stored in files in what amounts
  to a series of records, freely insert line breaks and other white
  space because they know SGML processors will ignore it.  Can we
  afford to make that white space significant?

o Some applications, e.g. full-text indexers, really need to know where
  everything is by byte offset, whether or not the bytes are significant;
  thus the -XML-SPACE="COLLAPSE" behavior means they can't read the text
  with an XML processor (unless they can turn off -XML-SPACE processing
  through the API)

So there are a few things we could do, which are not entirely
mutually exclusive.

1. Go to the RE Delenda Est model.  This has the advantage that it's
   trivially easy to explain, document, and implement.  It has some of the 
   disadvantages listed above; there is some very strong sentiment on the 
   ERB against this - look from a follow-up from other ERB-folk.
2. Expand the -XML-SPACE attribute from two values to three.  The third would
   be named REMOVE or DISCARD or something, and would be designed to signal
   element content, i.e. all this white space can be safely ignored.
3. Add language to the spec allowing the application to force the 
   processor to pass through all the bytes regardless of the -XML-SPACE
   setting.

Your input is requested.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167

Received on Wednesday, 11 December 1996 15:01:58 UTC