RS/RE: Yet Another Proposal

Here's a proposal, constructed with help from Gavin Nicol and David
Durand:

1) All XML entities are single records or part of records, with no
   record boundary characters.

This is the hardest part of the proposal.  I do not believe that it is
incompatible with 8879, though I am open to arguments.  Here's just
about everything 8879 has about records:

   4.252 record: A division of an SGML entity, bounded by a record
   start and a record end character, normally corresponding to an
   input line on a text entry device.
   
   NOTES
   
   1 It is called a "record" rather than a "line" to distinguish it
   from the output lines created by a text formatter.
   
   2 An SGML entity could consist of many records, a single record, or
   text with no record boundary characters at all (which can be
   thought of as being part of a record or without records, depending
   on whether record boundary characters occur elsewhere in the
   document).
   
   4.253 record boundary (character): The record start (RS) or record
   end (RE) character.
   
   4.254 record end: A _function character_ ([54]), assigned by the
   concrete syntax, that represents the end of a record.
   
   4.255 record start: A _function character_ ([54]), assigned by the
   concrete syntax, that represents the start of a record.

It is absolutely clear that record boundaries are something that *can*
occur in entities.  It is clear from note 2 to 4.252 that entities
need not have record boundaries.

It is *not* clear that carriage return or newline sequences need be
turned into record boundary characters by an entity manager.  If it is
acceptable to the working group and ERB, this makes XML easy to
implement, for both SGML-based and non-SGML based solutions; easy to
generate, either by hand, from an editor, or from a script of some
sort; and easy to understand.

2) A mechanism is defined for identifying this record non-creation to
   SGML systems.

Currently, SGML has no rules (as far as I can see) for identifying
records within system objects.  If that's correct, then a mechanism is
needed to determine what record-creation rules should be used for XML.
I propose using the APPINFO declaration in the SGML declaration.

   APPINFO XML

should suffice.

It's true that no SGML parser currently treats entities as single
records.  However, I (and others) don't think it would be too
difficult to change that.  We could be wrong.

3) The RE and RS characters are defined to non-occuring code points.

James Clark interpreted the definitions of these characters to be the
signals that the entity manager should use when communicating with the
parser.  Given (1), the record boundaries will not be occurring, but
the parser must not mistake code points 10 and 13 in data for record
boundaries.

4) Code points 10 and 13 are defined as SEPCHAR.

This allows them to occur in element content without error.

5) Whitespace handling is defined as an application convention.
   5.1) In verbatim-styled content elements, all whitespace is
        preserved.
   5.2) In non-verbatim content elements, leading and trailing
        whitespace is stripped.  Intermediate whitespace is normalized
        to a single space before formatting.
   5.3) In certain special elements, whitespace is eliminated.

For 5.3, it doesn't need to involve special element names or parsing
rules, just a formatting convention.  The example I'm thinking of is a
CR between table rows.  Clearly, when formatting <tbody> as a set of
table row objects, data in between is not meaningful.  If it's non-
whitespace, issue an error, but if it's whitespace, ignore it.  (See
below on why there might be whitespace there after parsing.)

-=-=-=-

I envision three types of parsing of XML documents.

a) SGML parsing.  This will require a DTD, even if a FRED-generated or
   ANY one, per 8879.

b) XML parsing with DTD.  For the purposes of this discussion, the
   only relevant part of the DTD is whether a given content model is
   mixed or element.

c) XML parsing without DTD.

In (a):

o Whitespace is ignored in element content.
o All whitespace (including CR and LF) is preserved in mixed content.
o After application conventions are applied, data should be identical
  to (b) and (c).

In (b), parsing should be the same as (a); the data should be
identical to (a) *before* application conventions are applied to
either one.

In (c):

o Assume mixed content for all elements.
o All whitespace is preserved.
o After application conventions are applied, data should be identical
  to (a) and (b).
  - element-content space is all leading or trailing in that element,
    or
  - is between elements and is normalized away.
  - Most occurences of element content involve line-breaks for the
    children when styled, anyway, but that should not make a
    difference.
  - Potentially damaging whitespace (like the table-row problem in
    Navigator) is eliminated when formatting; whitespace that doesn't
    make sense in the context of a certain flow object is ignored.

Comments, please?  If you disagree with (1), please concentrate on
that; there's no point in arguing with the other points if that
doesn't fly.  If you agree with (1), does the rest of this make sense?
Is it workable?

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM
"<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030
<USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>

Received on Wednesday, 2 October 1996 18:32:49 UTC