Re: RS/RE, again (sorry)

<Paul sn="Prescod">
> As I mentioned before, it would be better off leaving out whitespace between
> elements. We can't have *both* all whitespace significant *and* reliable
> whitespace removal for pretty printing.

Without a DTD, there is no such thing as element content, since #PCDATA
is allowed and recognised everywhere.

With a DTD, the distinction is possible.

With a DTD, you can also distinguish, as SGML does, between whitespace
that does not match anything in the content model, and #PCDATA that happens
to contain only whitespace characters.

Multiple successive whitespace characters can be collapsed into either
zero or one space, depending on the context, or passed through as-is to
the receiver (an SGML system or application).

In SGML, this is done within a document depending on the element's content
model, and within a DTD based on relatively complex but fixed rules.
(e.g. multiiple spaces between a GI and an attribute name in <A   X="y"> are
collapsed into a single space (or SEPCHAR), but outside the angle brackets
spaces are retained or collapsed into zero spaces depending on the
content model; there is no explicit control over this behaviour in SGML)

Unfortunately, this is not sufficient in practice.  HTML, for example,
uses additional application-defined rules, and many people on this list
wanted the ability to have an element in which RS/RE processing was not
done, all whitespace matched PCDATA and space was not collapsed.
Other people wanted to be able to have multiple successive spaces collapsed
in certain places but not others, either for "pretty printing" or in
at least one case because they were using editors that put trailing spaces
on the input lines.

I agree that for those not using SGML-aware tools, pretty-printing is
very useful.

        Unfortunately, although you can do this if you like, SGML doesn't
        actually say that the spaces and/or tabs at the start of this line
        are not significant, and has no way of doing so.
        You can arrange to surround each input line with a tag automatically
        with the "B" feature of shortref, but then the space hasn't been
        ignored or collapsed,and you have extra markup.
         Although SGML gurus can say that they only want to indent the tags,
         and not the content, and that it is intuitively obvious where spaces
         are ignored and where they are not, the length of this debate has
         showed clearly that it is not obvious.
         If we are going to support this kind of pretty-printing in XML,
         we must support it properly.  That is, we must support it in a way
         that is simple to understand and explain.

It is clear to me that the SGML rules for RS/RE handling are too complex for
XML.  Charles' proposal would be a good short-term way of disabling RS/RE
processing, and adding "RSRE NONE" to the SGML declaration (or some other
syntax) to disable special treatment of RS and RE would be very helpful.
We may need to add ASCII NL and CR to SEPCHAR as a result of that, depending
on how it's worded, but obviously that's no problem.

Entirely separate from that are the questions
(1) can a validating XML parser ignore whitespace in element content?
    I think we all agree that the answer should be Yes -- in other words,
    a space or tab or newline or carriage return between two elements where
    #PCDATA is not allowed should not be an error.
    (an XML validator might have an option to produce a warning, though,
    as if the file is processed by something not looking at a DTD, that
    space will obviously and necessarily be treated as #PCDATA)

(2) in a well-formed XML document, whether or not there is a DTD, should
    multiple successive spaces that are matched as PCDATA be collapsed into
    a single space?
    I think this should depend on the application.
    For database input & output, I might need spaces retained, for example.
    Therefore, an XML reader should pass all whitespace through to its client.

In other words, white space should be retained by the XML reader, but should
be treated as whitespace and not PCDATA by a validating parser
checking a content model.


Long Note: I have used the term "XML reader" to mean the code that reads the
sequence of input characters and turns it into a stream of tokens or
potential tree nodes for a program that will build a tree or do whatever
else it wants with the information; in computer science this is called a
parser, but an SGML parser does something quite different.  An SGML
parser such as SP would, when reading an XML document, include (conceptually)
both an XML reader and also content validation and other processing code.
The term "XML reader" is not proposed here as a term for the XML spec, but
only to try and clarify this message.