Re: Current Status of Discussion on RE/RS Handling

On Thu, 26 Sep 1996 13:46:22 -0400 Eliot Kimber said:
>The rules we came up with are:
>
>An XML parser shall interpret white space and record ends in XML
>documents as follows:
>
>1. All white space, including RS and RE, immediately following start
>   tags and immediately preceding end tags is not significant.
>
>2. All other RS/REs are collapsed to a single space.
>
>This approach has the effect that the white space and RS/RE
>collapsing can be done before or after SGML RE rules are applied
>without affecting the result.  The only place this is not true is
>record ends followed by one or more PIs followed by data. In SGML,
>the RE will be considered to have occurred *after* the PIs, whereas
>in XML it will be considered to have occurred *before* the PIs (there
>are many who consider this behavior of SGML to be a bug that should
>be fixed, or at least made optional, in the SGML revision).
>
>This approach also requires that truly significant record ends in
>data must be escaped in some way.

Sorry to be so slow responding to this post, but I've been out of
network contact.

This is slightly different from the proposal I thought the ERB had
discussed, with consequences for some of the discussion.  My memory may
deceive me, but I thought the ERB's discussion led to a proposal that
in XML, RE is handled this way:

1 element content is treated as in SGML

2 white space at the beginning and ending of element ESIS be removed
(this differs from Eliot's formulation primarily in making clear that


  <p>

  A paragraph.

  </p>

and

  <p>

  <!-- i'd like some significant REs before the text -->


  A paragraph.

  </p>

have the same ESIS under the rules:  comments don't affect this phase of
white-space stripping.  If you use a vacuous PI instead of the comment,
the second example works, since PIs are part of ESIS.

3 sequences of RS and RE are merged into a single RE (or RE/RS sequence,
if RS isn't being ignored)

This means that verbatim elements CAN be handled using this scheme,
since most record ends ARE preserved, though getting blank lines may
involve ensuring somehow that there is at least a blank character on the
line, to make the REs and RSs non-contiguous.  (A record consisting of
a blank followed by <!> will surely do the trick.)

In XML systems, this is all handled by the parser; in SGML systems
handling XML documents, it's an application convention required by XML.
As James Clark has pointed out, requiring this white-space handling
helps ensure that SGML and XML tools can interoperate reliably on the
same data -- if systems sometimes eat an RE and sometimes doesn't, based
on whether under the hood they use native XML parsers or unmodified SGML
parsers, then XML will seem flaky and unreliable.


Note that within elements, this scheme, like 8879, exposes the
user to apparently capricious changes in behavior, depending on whether
the editor last used (a) strips all trailing white space, (b) strips all
trailing white space and then adds a single blank at the end of each
line, or (c) preserves what you type, so that lines may or may not have
trailing white space, without the difference being visible under normal
operation.  This exposure to caprice is neither an advantage nor a
disadvantage of this scheme, since it's shared by 8879.  It does mean
the proposed white-space stripping at the beginning and ending of
elements is probably more robust and predictable than 8879's rule, which
as far as I can tell distinguishes between <p>@ (insignificant) and
<p> @ (significant).

Since the difference between Eliot's point 2 and my point 3 seems
germane to subsequent discussion, I'm uneasy that no other ERB members
have made this correction.  The fact that they haven't suggests to me
that my recollection may be deceiving me, and Eliot's summary may be
more accurate than mine, as regards the ERB discussion.  But as regards
RE handling in XML, I think it's better to reduce (RE | RS)+ to RE,RS
than to " ", because it makes verbatim elements so much simpler.

(N.B. the purist view that if REs are really significant there should be
markup there does appeal to me a lot.  But it fails the Stooopid test.
Who will believe that failing to handle PRE in a natural way is the sign
of a system more intelligent than HTML?  Can anyone name a widely used
production text processor that has no equivalent to the \obeylines of
TeX or the XMP of the GML starter set?

-C. M. Sperberg-McQueen

Received on Monday, 30 September 1996 19:15:01 UTC