Re: RS/RE: Yet Another Proposal

[Paul Prescod]
> Okay, an SGML application knows that the newlines are insignificant
> because of the DTD. Now what about a formatter? It will, by default,
> leave the newlines in (because the central tenet of this proposal is
> that newlines are significant). So how do you turn them off?

The newlines are really there, as "true data".  They are not
*displayed*, per application convention dictating whitespace
normalization after parsing (e.g., in the DSSSL grove).

> Okay, maybe with a stylesheet. Now your document looks okay in an
> SGML editor and in Netscape (presuming that SGML editors handle the
> RS/RE remapping hack). Now you want to convert it to RTF. Okay,
> maybe your conversion program has a stylesheet language that allows
> you to strip out newlines.

RTF is a presentation format.  To map the XML into RTF, a stylesheet
must be invoked; the stylesheet engine for XML normalizes the
whitespace.

> Now you want to put it in an "XML database". Each element will be
> stored individually in the database, for later retrieval alone. How
> does the database determine which newlines go in the database, and
> which are "formatting". I guess you need some other "style sheet
> thing" (or a DTD).

The newlines go into the database.  All of them.  They are part of the
data.

> The problem is not immediate/practical. It is
> long-term/theoretical. I think that long-term degredation of your
> data will occur if you depend on "application conventions" (like
> "table smarts") to determine what is the real information and what
> is formatting. Therefore, the only safe way to encode this
> bibliography in the proposed markup language is with no
> insignificant newlines.

I probably should have phrased the normalization rules better.

o All new lines are data, except those known to be in element content
  (by virtue of SEPCHAR).
o When formatting XML (for display, transformation into RTF, or
  printing):
  - In verbatim-styled blocks, preserve all whitespace.
  - In non-verbatim content blocks, eliminate leading and trailing
    whitespace, and normalize internal whitespace.
  - Whitespace between blocks is ignored.  ("Block" meaning a
    paragraph flow object, table flow object, table part (row, entry,
    etc.), figure flow object, or other object for which intermediate
    spacing is a stylesheet function, not a content function.)

> In this case, the author can either have convenient editing or
> unambiguous true content. If we go with this proposal, we should be
> clear on that and encourage users to sacrifice convenience in favour
> of rigour.

The problem is that true content, without one hack or another, is
different between an SGML parse and an XML parse.  Quoting is going to
make XML unusable, IMO.  By making *all* newlines data, handling is
unambiguous.  An SGML (or XML with DTD) parse will not be ESIS-
identical to an XML parse without DTD, but after application
conventions are applied, the result will be identical.  Isn't that
what matters?

> I could accept this, but would rather go the opposite way, like most
> markup languages (I think) and make newlines and tabs insignificant
> unless you declare them to be so (in some form of verbatim
> section). In most SGML documents, this is what authors intend most
> of the time.

True.  But I see the alternatives this way:

1) Implement a shortref-based hack that won't work in most current
   SGML systems and complicates the markup, for a reason that won't be
   explainable to most users or implementers.
2) Define a simple application convention that won't work in most
   current SGML systems, simplifies markup, and is easy to explain.

> Unlike most markup languages, however I would proclaim that space
> characters are significant outside of markup(as Liam Quin said to
> me: "I kinda need the spaces between words." =) ). Certain kinds of
> formatting would have to be done with tabs and comments intead of
> spaces, and authors would have to be careful to put a space at the
> end of each line if they don't want their words concatenated. Most
> editors do this for you automatically. On both Windows and the Mac,
> the standard text editor widgets Do the Right Thing.

They do?  I've never had Notepad, Write, Wordpad, SimpleText, or
Claris Works insert spaces at the end of lines for me.  I think I
would be upset if they did when I wasn't editing XML.  And I don't see
them implementing an XML mode any time soon.

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//GCA//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//EBT//NONSGML Christopher R. Maden//EN" SYSTEM
"<URL>http://www.ebt.com <TEL>+1.401.421.9550 <FAX>+1.401.521.2030
<USMAIL>One Richmond Square, Providence, RI 02906 USA" NDATA SGML.Geek>

Received on Thursday, 3 October 1996 10:16:01 UTC