[Prev][Next][Index][Thread]

RE/RS Options: Trying to Focus



I'm trying to make some sense of the RE/RS issue and not having much luck.
What follows is an attempt to state as tersely and precisely as possible
what the issue are and what the alternatives are. 

As James put it so well, the problem is mixed content: if you allow mixed
content, then you must provide some mechanism for distinquishing data
record ends from non-data record ends. If you don't allow mixed content,
then there is no problem, but then you have the problem of explicitly
delimiting character data content, which currently neither SGML nor HTML
require.  Note too that it's not just an RE/RS problem, but an SSEP problem.

The proposals, as I understand them, are:

A. Disallow mixed content. Furthermore, disallow PIs and markup
declarations in 
   character data (with the possible exception of CDATA and RCDATA marked
   sections).  This requires an element type whose only semantic
   is to contain character data.  Short references can be used to enable
   using a single character to quote character data.  

   This solves the problem by making it clear when record ends are in 
   character data context.  Record end handling rules are not changed
   in any way.

   Assumes that inclusions are not allowed (at least within the character
   data containing element), thus avoiding the "record ends following
   included subelements are not taken as data" rule.

B. Treat XML documents as a single record by mapping RS and RE to character
   codes that cannot occur in documents.  There are *no* record ends. This
   has the disadvantage that some other mechanism must be used to indicate
   data record ends, one that must be understood and processed by 
   rendering systems, thus raising the likely possibility that different
   tools will provide different results for the same input data.  It also
   has the problem that many SGML tools do not support this kind of 
   remapping, making it difficult or impossible to process XML documents
   as SGML.  It would also require transformation of SGML documents
   before they could be processed accurately as XML documents.

   This also doesn't solve the SSEP problem generally.

C. Treat all record ends as data. This requires that authors must do things
   like put record ends before tag closes in order to format their 
   markup on multiple lines.  It also means that SGML documents can't be
   made into XML documents simply by quoting character data but must move all
   SSEP inside of markup.  Talk about making 5-line Perl hacking harder.

Note that if we want to allow DTD-less parsing, we can't use the SGML rules
as-is and keep mixed content because without the DTD you have no way to
knowing when you're in element content and when you're in mixed content
(for a dramatic example of this problem, create an SGML document with lots
of SSEP in element context then view it with Panorama with and without the
DTD).

This also means that there's no way to define a "simpler" set of RE/RS
rules and keep mixed content, because you'll have the same problem.

My conclusion is that eliminating mixed content by quoting character data
is the simplest solution overall and retains the most compatibility with
SGML as is.  While quoting may seem unnatural to those of us who grew up
typing SGML markup (it was to me when I put together some examples), I
don't think it will be hard for newcomers to learn and it should be easy
for SGML editors to add the quotes as an export option.

Cheers,

E.


--
W. Eliot Kimber (kimber@passage.com) 
Senior SGML Consultant and HyTime Specialist
Passage Systems, Inc., (512)339-1400
10596 N. Tantau Ave., Cupertino, CA 95014-3535 (408) 366-0300, (408)
366-0320 (fax)
2608 Pinewood Terrace, Austin, TX 78757 (512) 339-1400 (fone/fax)
http://www.passage.com (work) http://www.drmacro.com (home)
"If I never had existed, would you still remember me?..."
                                   --Austin Lounge Lizards, "1984 Blues"