Re: Fun with ignorable whitespace definition. (fwd)

jose.kahan@w3.org
Mon, 15 Dec 1997 15:08:30 +0100 (MET)


Message-Id: <199712151408.PAA06175@tuvalu.inrialpes.fr>
To: www-html@w3.org
Date: Mon, 15 Dec 1997 15:08:30 +0100 (MET)
From: jose.kahan@w3.org
Subject: Re: Fun with ignorable whitespace definition. (fwd)

In our previous episode, Peter Flynn said:
From www-html-request@w3.org Mon Dec 15 14:57 MET 1997
Date: Mon, 15 Dec 1997 08:56:22 -0500 (EST)
X-Envelope-From: www-html-request@www10.w3.org  Mon Dec 15 08:55:54 1997
Old-Date: 15 Dec 1997 13:56:19 +0000 (GMT)
From: Peter Flynn <pflynn@imbolc.ucc.ie>
In-reply-to: <Pine.SOL.3.95.971207192903.26097C-100000@sally> (message from
 Alexandre Rafalovitch on 07 Dec 1997 19:40:32 +1100 (EST))
To: arafalov@socs.uts.EDU.AU
Cc: www-html@w3.org
Message-id: <199712151356.NAA12179@imbolc.ucc.ie>
X-Envelope-to: www-html@w3.org
Content-transfer-encoding: 7BIT
X-Diagnostic: Not on the accept list
Subject: [Spam?] Re: Fun with ignorable whitespace definition.
X-Diagnostic: Mail coming from a daemon, ignored
X-Envelope-To: www-html
Content-Type: text
Content-Length: 3456

   Consider how the following html will be parsed.

   <html>
   <title>Title text</title>

   <meta foo=bar>
   Now we have the body.
   </html>

   In here, we have some whitespace after </title>. It is not quite
   ignorable, so it should be treated as text. Therefore, it should close the
   head element and start body element. As a result, 'meta' would go into the
   body and not the head and will be ignored.

Not quite. The spec is very clear on this: you are missing <HEAD>, so
assuming your DTD allows this, the presence of <TITLE> means you are
inside the <HEAD> at that stage. Now <HEAD> is allowed to contain ONLY
element content, never any character data (the only allowable
character data is inside <TITLE>, <SCRIPT> or <STYLE>). The
white-space between </TITLE> and <META> is therefore still within the
<HEAD> and MUST therefore be discarded by the parser as insignificant.
This is not an option, it is compulsory.

The character data after </META> implies the end of </HEAD> and the
start of <BODY>, which has mixed content in most HTML DTDs, so any
white space between there and </HTML> is _significant_ and must be
retained.

The result of parsing your example would therefore give (normalized,
eliding the invalid foo=bar, and making some rather large assumptions
about the DTD):

   <html><head><title>Title text</title><meta foo=bar></head><body>
   Now we have the body.
   </body></html>

I strongly advise you to use real HTML and not your imagination,
otherwise when XML becomes usable you will be left with a load of
untranslatable pseudo-HTML.

   That presents IMHO a problem, since the meaning was clearly to ignore all
   whitespace in the head, but having optional end of head and optional start
   for body messes it up.

No, the HTML spec is real SGML and parses correctly unless you mess it
up.

   On another hand, parser cannot just ignore that whitespace as it does not
   know (in a generic html parsing world) if content of html (or head) can be
   displayed and CSS might declare it to have non-collapsable whitespace
   (like in PRE).

   It looks to me, that either this requires a heavy special case or html4
   draft is missing a section on how a whitespace treated in non-displayable
   optional start/end elements... :=}

   I hope I am missing something, because it sure got me thinking. :-}

The rules on white-space in SGML are tricky, but basically 

   1. in element content (ie places where only more markup is allowed,
      never any character data), all white-space must be removed.

   2. in mixed content (ie places where intermingled markup and
      character data are allowed), white-space is preserved because it
      is a part of the character data.

   3. line-breaks are also character data in mixed content.

I know it's moot while HTML-only browsers continue to ignore SGML, but
XML has much simpler rules and from the look of last week's
SGML/XML'97 Conference, it's going to arrive quickly (I already have a
couple of very neat beta XML editors and there are at least a dozen
more on the way, and even some stylesheet editors for XSL).

If you currently create HTML, I do recommend that you start to shift
NOW to creating only valid, parsable HTML, so that if/when you want to
move into XML, you can translate your files automatically. Otherwise
you are going to have an appalling manual job to do (you may already
be facing one if your existing HTML is currently invalid).

///Peter