- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Mon, 30 Sep 96 17:45:46 CDT
- To: "W. Eliot Kimber" <kimber@passage.com>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Thu, 26 Sep 1996 13:46:22 -0400 Eliot Kimber said: >The rules we came up with are: > >An XML parser shall interpret white space and record ends in XML >documents as follows: > >1. All white space, including RS and RE, immediately following start > tags and immediately preceding end tags is not significant. > >2. All other RS/REs are collapsed to a single space. > >This approach has the effect that the white space and RS/RE >collapsing can be done before or after SGML RE rules are applied >without affecting the result. The only place this is not true is >record ends followed by one or more PIs followed by data. In SGML, >the RE will be considered to have occurred *after* the PIs, whereas >in XML it will be considered to have occurred *before* the PIs (there >are many who consider this behavior of SGML to be a bug that should >be fixed, or at least made optional, in the SGML revision). > >This approach also requires that truly significant record ends in >data must be escaped in some way. Sorry to be so slow responding to this post, but I've been out of network contact. This is slightly different from the proposal I thought the ERB had discussed, with consequences for some of the discussion. My memory may deceive me, but I thought the ERB's discussion led to a proposal that in XML, RE is handled this way: 1 element content is treated as in SGML 2 white space at the beginning and ending of element ESIS be removed (this differs from Eliot's formulation primarily in making clear that <p> A paragraph. </p> and <p> <!-- i'd like some significant REs before the text --> A paragraph. </p> have the same ESIS under the rules: comments don't affect this phase of white-space stripping. If you use a vacuous PI instead of the comment, the second example works, since PIs are part of ESIS. 3 sequences of RS and RE are merged into a single RE (or RE/RS sequence, if RS isn't being ignored) This means that verbatim elements CAN be handled using this scheme, since most record ends ARE preserved, though getting blank lines may involve ensuring somehow that there is at least a blank character on the line, to make the REs and RSs non-contiguous. (A record consisting of a blank followed by <!> will surely do the trick.) In XML systems, this is all handled by the parser; in SGML systems handling XML documents, it's an application convention required by XML. As James Clark has pointed out, requiring this white-space handling helps ensure that SGML and XML tools can interoperate reliably on the same data -- if systems sometimes eat an RE and sometimes doesn't, based on whether under the hood they use native XML parsers or unmodified SGML parsers, then XML will seem flaky and unreliable. Note that within elements, this scheme, like 8879, exposes the user to apparently capricious changes in behavior, depending on whether the editor last used (a) strips all trailing white space, (b) strips all trailing white space and then adds a single blank at the end of each line, or (c) preserves what you type, so that lines may or may not have trailing white space, without the difference being visible under normal operation. This exposure to caprice is neither an advantage nor a disadvantage of this scheme, since it's shared by 8879. It does mean the proposed white-space stripping at the beginning and ending of elements is probably more robust and predictable than 8879's rule, which as far as I can tell distinguishes between <p>@ (insignificant) and <p> @ (significant). Since the difference between Eliot's point 2 and my point 3 seems germane to subsequent discussion, I'm uneasy that no other ERB members have made this correction. The fact that they haven't suggests to me that my recollection may be deceiving me, and Eliot's summary may be more accurate than mine, as regards the ERB discussion. But as regards RE handling in XML, I think it's better to reduce (RE | RS)+ to RE,RS than to " ", because it makes verbatim elements so much simpler. (N.B. the purist view that if REs are really significant there should be markup there does appeal to me a lot. But it fails the Stooopid test. Who will believe that failing to handle PRE in a natural way is the sign of a system more intelligent than HTML? Can anyone name a widely used production text processor that has no equivalent to the \obeylines of TeX or the XMP of the GML starter set? -C. M. Sperberg-McQueen
Received on Monday, 30 September 1996 19:15:01 UTC