- From: <lee@sq.com>
- Date: Thu, 12 Dec 96 17:47:58 EST
- To: w3c-sgml-wg@w3.org
<Paul sn="Prescod"> > As I mentioned before, it would be better off leaving out whitespace between > elements. We can't have *both* all whitespace significant *and* reliable > whitespace removal for pretty printing. </Paul> Without a DTD, there is no such thing as element content, since #PCDATA is allowed and recognised everywhere. With a DTD, the distinction is possible. With a DTD, you can also distinguish, as SGML does, between whitespace that does not match anything in the content model, and #PCDATA that happens to contain only whitespace characters. Multiple successive whitespace characters can be collapsed into either zero or one space, depending on the context, or passed through as-is to the receiver (an SGML system or application). In SGML, this is done within a document depending on the element's content model, and within a DTD based on relatively complex but fixed rules. (e.g. multiiple spaces between a GI and an attribute name in <A X="y"> are collapsed into a single space (or SEPCHAR), but outside the angle brackets spaces are retained or collapsed into zero spaces depending on the content model; there is no explicit control over this behaviour in SGML) Unfortunately, this is not sufficient in practice. HTML, for example, uses additional application-defined rules, and many people on this list wanted the ability to have an element in which RS/RE processing was not done, all whitespace matched PCDATA and space was not collapsed. Other people wanted to be able to have multiple successive spaces collapsed in certain places but not others, either for "pretty printing" or in at least one case because they were using editors that put trailing spaces on the input lines. I agree that for those not using SGML-aware tools, pretty-printing is very useful. <list> <item> Unfortunately, although you can do this if you like, SGML doesn't actually say that the spaces and/or tabs at the start of this line are not significant, and has no way of doing so. You can arrange to surround each input line with a tag automatically with the "B" feature of shortref, but then the space hasn't been ignored or collapsed,and you have extra markup. </item> <item> Although SGML gurus can say that they only want to indent the tags, and not the content, and that it is intuitively obvious where spaces are ignored and where they are not, the length of this debate has showed clearly that it is not obvious. </item> <item> If we are going to support this kind of pretty-printing in XML, we must support it properly. That is, we must support it in a way that is simple to understand and explain. </item> </list> It is clear to me that the SGML rules for RS/RE handling are too complex for XML. Charles' proposal would be a good short-term way of disabling RS/RE processing, and adding "RSRE NONE" to the SGML declaration (or some other syntax) to disable special treatment of RS and RE would be very helpful. We may need to add ASCII NL and CR to SEPCHAR as a result of that, depending on how it's worded, but obviously that's no problem. Entirely separate from that are the questions (1) can a validating XML parser ignore whitespace in element content? I think we all agree that the answer should be Yes -- in other words, a space or tab or newline or carriage return between two elements where #PCDATA is not allowed should not be an error. (an XML validator might have an option to produce a warning, though, as if the file is processed by something not looking at a DTD, that space will obviously and necessarily be treated as #PCDATA) (2) in a well-formed XML document, whether or not there is a DTD, should multiple successive spaces that are matched as PCDATA be collapsed into a single space? I think this should depend on the application. For database input & output, I might need spaces retained, for example. Therefore, an XML reader should pass all whitespace through to its client. In other words, white space should be retained by the XML reader, but should be treated as whitespace and not PCDATA by a validating parser checking a content model. Lee Long Note: I have used the term "XML reader" to mean the code that reads the sequence of input characters and turns it into a stream of tokens or potential tree nodes for a program that will build a tree or do whatever else it wants with the information; in computer science this is called a parser, but an SGML parser does something quite different. An SGML parser such as SP would, when reading an XML document, include (conceptually) both an XML reader and also content validation and other processing code. The term "XML reader" is not proposed here as a term for the XML spec, but only to try and clarify this message.
Received on Thursday, 12 December 1996 17:49:02 UTC