Re: Help on "whitespace"

From: P. T. Rourke <ptrourke@mediaone.net>
Date: Fri, 24 Mar 2000 11:45:35 -0600
To: "Martin Brunecky" <mbrunecky@OneRealm.com>, <html-tidy@w3.org>
According to the List-Owner's own specification of HTML 3.2
( http://www.w3.org/TR/REC-html32.html ; sometimes a little easier to read
than the current 4.01 recommendation, especially for students):

"Except within literal text (e.g. the PRE element), HTML treats contiguous
sequences of white space characters as being equivalent to a single space
character (ASCII decimal 32). These rules allow authors considerable
flexibility when editing the marked-up text directly. "

Another way to put it is that any sequence of 1 or more blank spaces and 0
or more hard returns is treated as a single white space unless it occurs
within an element classed as literal text (like PRE); however, the &nbsp;
(and its numeric equivalent) entity, though displayed as a space, is
for the purposes of text flow as a non-space character, and so spaces
and after an &nbsp; count as a space additional to the &nbsp;  .

0 blank spaces and 1 or more hard returns usually behave like a white
except between most element tags (e.g., between a </ td> and a <td>, or a
</p> and a <p>), where they're ignored. A special situation: when images
to occur side by side, most browsers require that the close of the first
image tag be immediately followed by the open of the second image tag, like

<img src="one.png" /><img src="two.png" />

or so

<img src="one.png" /><img
not so

<img src="one.png" />
<img src="two.png" />

because some browsers will add a white space between the images.  (The
quotation marks around the attribute value and the slash at the end of the
empty elements are XML-valid conventions that as far as I know all browsers
in use accept; see the XHTML 1.0 proposed rec.). I don't know whether that
behavior (the hard return between images = white space behavior) is
accounted in the Recommendation or not; it is analogous to the way a hard
return is treated between text strings; for instance,


is treated as two words, not one. Thus I imagine this behavior has
to do with the way image elements are described in the DTD.

P. T. Rourke

can someone point me to a document (document portion) or an article
which clearly defines the rules for treatment of white-space and
new-line characters within the HTML source ? Including the
rules for using entities such as &nbsp;

All my HTML books seem to be really vague about this subject, and
I am probably stupid 'cause I don't see how to get this info from the
