W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > December 1996

Re: RS/RE, again (sorry)

From: David G. Durand <dgd@cs.bu.edu>
Date: Thu, 12 Dec 1996 12:51:20 -0800
Message-Id: <v02130503aed5a14be812@[]>
To: w3c-sgml-wg@w3.org
At 12:01 PM 12/11/96, Tim Bray wrote:
I'll try to be brief, as I've said enough on this previously, but since my
position is maybe slightly different this time around, I'll comment. I will
also refer to some other responses, to this note.

>Some feel the mechanism is unaesthetic and prone to misinterpretation
>as it stands, and should simply be discarded, with all non-markup
>bytes being passed to the application to do with as it will.  This
>can be made 8879-compliant in the short term via a mechanism proposed by
>Charles Goldfarb, and in the medium/long term with a TC via WG8.

as noted, documents parsed w/ and w/out DTDs will be different in this
approach. Paul P. and Eve Mahler have claimed that ignoring whitespace in
element content will be a great hardship -- but I've not seen the proof of
this yet. Validating parsers could ignore those spaces for applications.

>There are some problems with both the current and revised approaches:
>o -XML-SPACE, although this is not documented, really only deals with
>  mixed content; many feel it's important to ignore white space in
>  element content; <list> <item>..</item>  <item>..</item> </list>
>                                         ^^
>                                     e.g. the above
>  but XML, when there's no DTD, doesn't know where element content is and
>  *cannot* be made to do this.

If this is the case, the only good justification for this hack has died, I
>o SGML's world-view tripartions the set of characters:
>  those that are text, those that are markup, and those that are
>  insignificant white space.  Can XML really afford to discard this
>  distinction?
I think it must. See below for one reason why. The other is even simpler.
This tripartite division confuses almost everybody -- If we have trouble
keeping it straight, without dedicating significant effort, how could we
propagate it further to a more naive, less-motivated public?

>o Many real-world editors, largely to deal with the fact that
>  text (whether or not we like it) is stored in files in what amounts
>  to a series of records, freely insert line breaks and other white
>  space because they know SGML processors will ignore it.  Can we
>  afford to make that white space significant?
Since those editors will already require a minor facelift to work with XML
anyway, removing a few "\n"s in the code is likely to be easy. I think this
issue is less-important than some make it. Yes, we might have to change
software, but no, it is not a hard change, even added to the few others we
have already admitted.

>o Some applications, e.g. full-text indexers, really need to know where
>  everything is by byte offset, whether or not the bytes are significant;
>  thus the -XML-SPACE="COLLAPSE" behavior means they can't read the text
>  with an XML processor (unless they can turn off -XML-SPACE processing
>  through the API)

This explanation of byte-offset requirements is a bit confusing: we could
always add an interface to parsers to give the current byte-offset within
the underlying entity. The real problem is that in many linking
applications we would like to address characters within an element: and
here the distinction becomes more problematic, as the user's view of the
element differs from the system's view by some number of "insignificant"
characters. That's why "insignificance" should die.

>So there are a few things we could do, which are not entirely
>mutually exclusive.
>1. Go to the RE Delenda Est model.  This has the advantage that it's
>   trivially easy to explain, document, and implement.  It has some of the
>   disadvantages listed above; there is some very strong sentiment on the
>   ERB against this - look from a follow-up from other ERB-folk.

I doubt the pretty-printing contingent will accept this (although CRLF
before the end ">" of tags actually allows almost-pretty printing). I still
like this, but given its overall unpopularity, don't expect it to live.
(Just to simple to survive, I guess).

>3. Add language to the spec allowing the application to force the
>   processor to pass through all the bytes regardless of the -XML-SPACE
>   setting.

4. Pass all whitespace through in DTD-free parsing mode. For the
pretty-printing contingent, we make validation parsing mode different: It
will strip all whitespace in element content. I guess this is "RE delenda
est" with a failover.

In addition, optionally allow 3: validating parsers may be made to pass all
space (and comments) to applications (like indexers) that require it. We
should add a note that processors that depend on such space for any form of
human-presentation or formatting violate the intent of the standard.

   -- David

I am not a number. I am an undefined character.
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Thursday, 12 December 1996 12:51:54 EST

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:03:48 EDT