W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > December 1996

Attempt at listing requirements, constraints and proposal fragments

From: David G. Durand <dgd@cs.bu.edu>
Date: Sun, 15 Dec 1996 14:11:39 -0500
Message-Id: <v02130502aed9f7031d46@[]>
To: w3c-sgml-wg@w3.org
Summary: I think we are arguing at cross purposes, because we are not being
explicit enough about requirements and strategies. This message attempts to
summarize key points, and does not advocate (or intend to) slant towards
any conclusion.

One of the reasons that Jean's posting has created confusion is that he is
intorducing a new requirement: easy of whitespace insertion for editors to
break long "lines". I'm just going to try liting the requirements as I
think they are: if we can get a list that we can agree on, maybe we can
grease ourselves up and get detached from this tar-baby.

   1. Almost everyone desires that the parse tree be identical (or as
similar as possible) when parsing with and without a DTD.

   2. Some people believe very strongly that the ignoring of whitespace in
elment content is important because:
      a) current SGML editors indent in element content mode.
      b) human readers of documents find whitespace around tags useful for
      c) stylesheets and applications should no be made responsible for
removing whitespace, as they will then not handle things consistently.

   3. Jean Paoli has identified a need for future editing software to
insert linebreaks as desired (This might, in some cases even occur in mixed

   4. Most people seem to agree that mixed content should preserve almost
all whitespace. There is now even an SGML-compatible way to preserve intial
RE in mixed content.

Some factors constrain solutions:

   A. SGML, as defined, ignores whitespace in element content.

   B. XML, when parsing without a DTD, cannot detect element content reliably.

   C. SGML does not normally preserve initial RE in mixed content.

Some possible techniques may address some of these factors in reaching some
of the goals.

   1. Charles' shortref RE hack can disable C. without affecting anything
else. This would allow SGML-compatible application processing of all
whitespace in _mixed_ content.

   2. Explicit flags of some sort on elements could signal to an
application that it should apply a particular whitespace strategy. Thus, we
could pass whitespace to the application, and make the author responsible
for marking whether or not they are to be significant.

   3. The application could set these flags automatically, in some cases,
when parsing with a DTD (element content automatically -XML-SPACE=IGNORE,
unless the _author_ requests otherwise, for instance).

   4. We could take the Perl approach and have the parser guess and set any
missing  WS flags. FOr instance, it would set the  -XML-SPACE=IGNORE flag
for elements containing only WS and other subelements. It could set the
-XML-SPACE=COLLAPSE flag for elements containing non-space characters, and
(optionally) other markup. The drawback would be complicated rules for the
default processing, that can at least be defeated by explicitly tagging
elements. This also leads to more-complex implementations.

   5. We could take a strict RE delenda est approach: needing no flags, but
also only satisfying requirements 1 and 4.

   6. Since we can put whitespace before TAGC in any of these proposals.
This may satisfy requirement 3. Many feel that this technique utterly fails
to satisfy requirements 2a, and 2b, because editors don't work that way
now, and the syntax is ugly, respectively. ]

[ Ed: Personally I think this technoque completely solves requirement 3
without any work by us at all. It is trivial to implement, and also works
in mixed content, providing a better solution to the algorithmic problem.]

   There are probably missing points and misrepresentations, but can we try
to get a list like this finalized before we continue arguing specific
proposals? If you like this idea, and want to make a correction, edit the
list and re-post it, so that we can converge on a single list without too
much pain.

  -- David

I am not a number. I am an undefined character.
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Sunday, 15 December 1996 14:05:07 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:25:05 UTC