Re: Element content the real issue?... from Paul Prescod on 1996-09-30 (w3c-sgml-wg@w3.org from September 1996)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Mon, 30 Sep 1996 16:00:51 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <1.5.4.32.19960930200051.00d6d164@csclub.uwaterloo.ca>

At 11:57 AM 9/30/96 -0700, Joe English wrote:
>> According to the proposal:
>What about (where '@' denotes an RE and '.' denotes a space):
>
>    <aaa>@
>    ...<bbb>XXX</bbb>@
>    ...<ccc>XXX</ccc>@
>    </aaa>@
>
>The RE and space characters preceding the CCC element
>are not deemed insignificant by rule 1, whereas an
>SGML parser would treat them as such if AAA had
>element content.

Good point! I was reading the first rule wrong.

What about a third rule:

1. All white space, including RS and RE, immediately following start tags and
   immediately preceding end tags is not significant.

2. All other RS/REs are collapsed to a single space.

3. All quasi-elements containing only whitespace characters are not significant.

I know that the wording is very shaky, but the point is to get rid of the
space in this:

<P>Some <EM>emph text</EM>  </P>

but not in this:

<P>Some <EM>emph</EM> text</P>

The latter has a "quasi-element" containing only whitespace characters. Just
as with the first two rules, you can either apply it a) directly in an XML
parser, b) in a preprocessor for SGML (a quick Perl hack) or c) after
receiving the results of an SGML parser. In a sense, I've reintroduced the
concept of mixed content, but my new mixed content means: "content where
whitespace characters are mixed with other data characters". You don't have
to look at the DTD to figure that out.

If you wanted a quasi-element containing only whitespace characters, you
could use the escaping mechanisms, or the verbatim delimiter that I
suggested a few messages back.

 Paul Prescod

Received on Monday, 30 September 1996 16:06:02 UTC