- From: Christopher R. Maden <crm@ebt.com>
- Date: Wed, 2 Oct 1996 22:23:18 GMT
- To: w3c-sgml-wg@w3.org
Here's a proposal, constructed with help from Gavin Nicol and David Durand: 1) All XML entities are single records or part of records, with no record boundary characters. This is the hardest part of the proposal. I do not believe that it is incompatible with 8879, though I am open to arguments. Here's just about everything 8879 has about records: 4.252 record: A division of an SGML entity, bounded by a record start and a record end character, normally corresponding to an input line on a text entry device. NOTES 1 It is called a "record" rather than a "line" to distinguish it from the output lines created by a text formatter. 2 An SGML entity could consist of many records, a single record, or text with no record boundary characters at all (which can be thought of as being part of a record or without records, depending on whether record boundary characters occur elsewhere in the document). 4.253 record boundary (character): The record start (RS) or record end (RE) character. 4.254 record end: A _function character_ ([54]), assigned by the concrete syntax, that represents the end of a record. 4.255 record start: A _function character_ ([54]), assigned by the concrete syntax, that represents the start of a record. It is absolutely clear that record boundaries are something that *can* occur in entities. It is clear from note 2 to 4.252 that entities need not have record boundaries. It is *not* clear that carriage return or newline sequences need be turned into record boundary characters by an entity manager. If it is acceptable to the working group and ERB, this makes XML easy to implement, for both SGML-based and non-SGML based solutions; easy to generate, either by hand, from an editor, or from a script of some sort; and easy to understand. 2) A mechanism is defined for identifying this record non-creation to SGML systems. Currently, SGML has no rules (as far as I can see) for identifying records within system objects. If that's correct, then a mechanism is needed to determine what record-creation rules should be used for XML. I propose using the APPINFO declaration in the SGML declaration. APPINFO XML should suffice. It's true that no SGML parser currently treats entities as single records. However, I (and others) don't think it would be too difficult to change that. We could be wrong. 3) The RE and RS characters are defined to non-occuring code points. James Clark interpreted the definitions of these characters to be the signals that the entity manager should use when communicating with the parser. Given (1), the record boundaries will not be occurring, but the parser must not mistake code points 10 and 13 in data for record boundaries. 4) Code points 10 and 13 are defined as SEPCHAR. This allows them to occur in element content without error. 5) Whitespace handling is defined as an application convention. 5.1) In verbatim-styled content elements, all whitespace is preserved. 5.2) In non-verbatim content elements, leading and trailing whitespace is stripped. Intermediate whitespace is normalized to a single space before formatting. 5.3) In certain special elements, whitespace is eliminated. For 5.3, it doesn't need to involve special element names or parsing rules, just a formatting convention. The example I'm thinking of is a CR between table rows. Clearly, when formatting <tbody> as a set of table row objects, data in between is not meaningful. If it's non- whitespace, issue an error, but if it's whitespace, ignore it. (See below on why there might be whitespace there after parsing.) -=-=-=- I envision three types of parsing of XML documents. a) SGML parsing. This will require a DTD, even if a FRED-generated or ANY one, per 8879. b) XML parsing with DTD. For the purposes of this discussion, the only relevant part of the DTD is whether a given content model is mixed or element. c) XML parsing without DTD. In (a): o Whitespace is ignored in element content. o All whitespace (including CR and LF) is preserved in mixed content. o After application conventions are applied, data should be identical to (b) and (c). In (b), parsing should be the same as (a); the data should be identical to (a) *before* application conventions are applied to either one. In (c): o Assume mixed content for all elements. o All whitespace is preserved. o After application conventions are applied, data should be identical to (a) and (b). - element-content space is all leading or trailing in that element, or - is between elements and is normalized away. - Most occurences of element content involve line-breaks for the children when styled, anyway, but that should not make a difference. - Potentially damaging whitespace (like the table-row problem in Navigator) is eliminated when formatting; whitespace that doesn't make sense in the context of a certain flow object is ignored. Comments, please? If you disagree with (1), please concentrate on that; there's no point in arguing with the other points if that doesn't fly. If you agree with (1), does the rest of this make sense? Is it workable? -Chris -- <!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN"> <!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM "<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030 <USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>
Received on Wednesday, 2 October 1996 18:32:49 UTC