Re: [oh no, Mr. Bill! not] B.3 Record-end handling? from Tim Bray on 1996-10-16 (w3c-sgml-wg@w3.org from October 1996)

From: Tim Bray <tbray@textuality.com>
Date: Tue, 15 Oct 1996 21:16:50 -0700
To: w3c-sgml-wg@w3.org
Message-Id: <3.0b33.32.19961015211643.00b3c11c@pop.intergate.bc.ca>

I can't believe I'm addressing this again.  But in a lengthy discussion
of the Vancouver SGML ERB/WG caucus this afternoon, Peter Sharpe dreamed
up the following and won't have time to post it, so I agreed to.  So it's
all his fault.  It smells to me like it might work.  

1. If you have a DTD and you know where element content is, you lose
   all white space in element content.
2. DTD or none, if a line contains some markup, and aside from that 
   only white space, you lose said white space and the trailing record
   boundary character(s).  "Markup" meaning tags, comments, and PIs.
3. Every byte in mixed or PCDATA content that is not lost in this fashion
   and is not markup is passed to the application.

"Lose" means "don't pass to the application".  This has the virtues that

 - it can be explained *very* briefly
 - the behavior is an awful lot like what ordinary people think
   that 8879 is trying to do
 - it's easy to build
 - it allows users to put all sorts of gratuitous white space in their
   data without getting in the way
 - it doesn't use the [inaccurate and counter-intuitive to most 
   programmers] terms "RS" and "RE"

On the downside, I suspect that this will eat a few white spaces between
tags and RE's, and around comments and PIs, and maybe a few REs after
comments and PIs, that a real SGML parser would pass on.  But (a) few will 
ever notice, and (b) those that do will be surprised at the SGML behavior.

It may not be perfect.  But it does provide an example of the maximum level 
of complexity in this area that I for one am willing to tolerate in XML.


Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167

Received on Wednesday, 16 October 1996 00:17:14 UTC