- From: James Clark <jjc@jclark.com>
- Date: Tue, 17 Sep 1996 11:41:31 +0000
- To: w3c-sgml-wg@w3.org
The Durand/Nicol proposal on RS/RE is, if I've understood it correctly, that when an XML document is parsed by an SGML parser, then it will use an SGML declaration that includes something like: FUNCTION RE 65531 -- not a legal ISO 10646 character -- RS 65532 -- not a legal ISO 10646 character -- SPACE 32 TAB SEPCHAR 9 CR SEPCHAR 10 LF SEPCHAR 13 I have a number of concerns about this proposal. 1. Most SGML parsers won't handle such an SGML declaration. The practical benefit for XML of SGML compatibility is to enable SGML tools to be used on XML documents. If XML uses features of SGML that aren't implemented in most SGML parsers, this practical benefit is lost. 2. It's not clear to me that this in general will work with a conforming SGML system. The entity manager is supposed to transform whatever mechanism the OS uses for representing lines into RS/RE. So for example, on a Unix system it is supposed to replace the newline that the terminates each line with an RE and insert an RS at the beginning of the line. So with the SGML declaration above, the entity manager should put character number 65532 at the beginning of each line and 65531 at the end. (SP doesn't actually do this: the entity manager always uses 10 and 13 as RS and RE even if they have been changed for the parser by the SGML declaration. But I don't think I could argue that SP is correct to do so.) 3. This proposal isn't solving the RE problem, it is (at most) working around it. I would like to see XML be something that could be embraced by WG8 and incorporated into the SGML revision. I can't see this sort of hack getting into the SGML revision. 4. It is a fact of life that OSs delimit lines using different methods. One of the ideas underlying the RS/RE concept in SGML is that these different methods should canonicalized into a single form, so that applications are isolated from these differences. This is a sound idea which is used in other places (eg the C language), and I don't think we should throw it out. I certainly don't want a style sheet language to have to care about how lines are delimited. 5. In mixed content there are some newlines that must get ignored by some part of the system (whether the parser or the application). For example, if I have <p> This is a paragraph of text. I don't want the newline after the <p> to result in a space at the beginning of the paragraph. Who is going to specify these rules? Are TEI, DocBook etc each going to specify their own rules about RE significance? 6. Consider a paragraph like this: <p>A line <!-- a comment --> Another line The SGML rules specify that the newline folling the comment is ignored. Now an XML application won't see the comment, so an XML application will see the same information for this document as for: <p>A line Another line But the SGML rules say that the newline that terminates the blank line is *not* ignored. This means that an XML application wouldn't be able to implement rules that process existing SGML documents (which don't redeclare RS/RE) in the same way as SGML systems. 7. Moving the responsibility of ignoring newlines to a part of the system other than the XML parser is going to mean that XML documents and SGML documents will need different processing. For example, when I write a style sheet I would have to know whether I've got an SGML document or an XML document so that I can specify that newlines are ignored in the XML document in the way I want. There may be other problems, but that's all I can think of for now. There are a few of other approaches that occur to me for dealing with the RS/RE problem: a. An approch similar to HTML. In HTML (except in PRE), line terminators are just white-space, adjacent white-space is collapsed and leading/trailing whitespace in an element is stripped. The interesting thing about this is that if you perform this process it doesn't matter whether or not you ignored the REs that SGML said you should: you'll end up with the same result. Well, almost: you'll still have the problems with REs getting moved past inclusions and PIs. The main problem with this is how to handle verbatim type elements. Can we live without these? b. Another approach is to restrict what's allowed in mixed content. For example, you could say that PIs were only allowed in element content; this avoid the problem of the interaction of REs and PIs in mixed content. c. Another possible approach is not to try to be 100% compatible in this area with SGML *as it now is*. Instead aim to be compatible with SGML as it will be revised. Devise some simplification of the SGML rules that can be incorporated in the SGML revision (maybe enabled by some flag in the SGML declaration). These aren't mutually exclusive. I suspect that some combination may be the best way forward. James
Received on Tuesday, 17 September 1996 06:46:57 UTC