Re: RS/RE considered confusing?

The Durand/Nicol proposal on RS/RE is, if I've understood it
correctly, that when an XML document is parsed by an SGML parser, then
it will use an SGML declaration that includes something like:

  RE 65531  -- not a legal ISO 10646 character --
  RS 65532  -- not a legal ISO 10646 character --
  SPACE 32       

I have a number of concerns about this proposal.

1. Most SGML parsers won't handle such an SGML declaration.  The
practical benefit for XML of SGML compatibility is to enable SGML
tools to be used on XML documents.  If XML uses features of SGML that
aren't implemented in most SGML parsers, this practical benefit is

2. It's not clear to me that this in general will work with a
conforming SGML system.  The entity manager is supposed to transform
whatever mechanism the OS uses for representing lines into RS/RE.  So
for example, on a Unix system it is supposed to replace the newline
that the terminates each line with an RE and insert an RS at the
beginning of the line.  So with the SGML declaration above, the entity
manager should put character number 65532 at the beginning of each
line and 65531 at the end.  (SP doesn't actually do this: the entity
manager always uses 10 and 13 as RS and RE even if they have been
changed for the parser by the SGML declaration.  But I don't think I
could argue that SP is correct to do so.)

3. This proposal isn't solving the RE problem, it is (at most) working
around it.  I would like to see XML be something that could be
embraced by WG8 and incorporated into the SGML revision.  I can't see
this sort of hack getting into the SGML revision.

4. It is a fact of life that OSs delimit lines using different methods.
One of the ideas underlying the RS/RE concept in SGML is that these
different methods should canonicalized into a single form, so that
applications are isolated from these differences.  This is a sound
idea which is used in other places (eg the C language), and I don't
think we should throw it out.  I certainly don't want a style sheet
language to have to care about how lines are delimited.

5. In mixed content there are some newlines that must get ignored by
some part of the system (whether the parser or the application).  For
example, if I have

This is a paragraph
of text.

I don't want the newline after the <p> to result in a space at the
beginning of the paragraph.  Who is going to specify these rules?  Are
TEI, DocBook etc each going to specify their own rules about RE

6. Consider a paragraph like this:

<p>A line
<!-- a comment -->
Another line

The SGML rules specify that the newline folling the comment is
ignored.  Now an XML application won't see the comment, so an XML
application will see the same information for this document as for:

<p>A line

Another line

But the SGML rules say that the newline that terminates the blank line
is *not* ignored.  This means that an XML application wouldn't be able
to implement rules that process existing SGML documents (which don't
redeclare RS/RE) in the same way as SGML systems.

7. Moving the responsibility of ignoring newlines to a part of the
system other than the XML parser is going to mean that XML documents
and SGML documents will need different processing.  For example, when
I write a style sheet I would have to know whether I've got an SGML
document or an XML document so that I can specify that newlines are
ignored in the XML document in the way I want.

There may be other problems, but that's all I can think of for now.

There are a few of other approaches that occur to me for dealing with
the RS/RE problem:

a. An approch similar to HTML.  In HTML (except in PRE), line
terminators are just white-space, adjacent white-space is collapsed
and leading/trailing whitespace in an element is stripped.  The
interesting thing about this is that if you perform this process it
doesn't matter whether or not you ignored the REs that SGML said you
should: you'll end up with the same result.  Well, almost: you'll
still have the problems with REs getting moved past inclusions and
PIs.  The main problem with this is how to handle verbatim type
elements.  Can we live without these?

b. Another approach is to restrict what's allowed in mixed content.
For example, you could say that PIs were only allowed in element
content; this avoid the problem of the interaction of REs and PIs in
mixed content.

c. Another possible approach is not to try to be 100% compatible in
this area with SGML *as it now is*.  Instead aim to be compatible with
SGML as it will be revised.  Devise some simplification of the SGML
rules that can be incorporated in the SGML revision (maybe enabled by
some flag in the SGML declaration).

These aren't mutually exclusive.  I suspect that some combination may
be the best way forward.