White Space Bugs in Popular Browsers

The HTML 4.0 specification contains the following phrase in
     http://www.w3.org/TR/WD-html40/struct/text.html#h-7.3.1

> A line break occurring immediately following a
> start tag should be discarded, as should a line
> break occurring immediately before an end tag. 
> This applies to all HTML elements without exceptions.

(HTML 3.2 and earlier versions had a similar statement,
if not as strongly worded.)  Of course this rule is not 
always followed by popular browsers such as Navigator and
IE.  Even Lynx (2.4 and 2.7) renders the following with an
extra space:

<A HREF="http://www.w3.org/">The W3C
</A>

I have a recently-updated page that demonstrates how this
browser behavior can affect the presentation of tables and
images.  The document is available at:
    http://www.emf.net/~estephen/htmlner/whitespacebugs.html

The above page establishes the position, and documents some
inconsistencies.  I'd like to take the argument further and
explore some esoteric parts of white space rules.

I have two questions arising from ambiguities of the implications
of discarding carriage returns.

1. Should multiple carriage returns also be discarded?  The next
   sentence in the specification is:

> In addition, for all elements except PRE, a sequence
> of contiguous white space characters such as spaces,
> horizontal tabs, form feeds and line breaks, should be
> replaced by a single word space. 

(I agree with Arnoud's earlier comment that the specification
should mention non-breaking spaces here, or else specifically
disqualify non-breaking spaces if that is the intention.)

"In addition" is ambiguous here.  Consider the following:

-----8<-----begin_code-----8<-----
<A HREF="http://www.w3.org/">The W3C


</A>
-----8<-----end_code-----8<-----

A user agent is told by the HTML 4.0 specification to collapse
multiple
carriage return into a single space, which turns this into:

<A HREF="http://www.w3.org/">The W3C </A>

(These rules of white space apply to all HTML elements, so don't
object that I'm using the anchor as an example here.  It applies
equally well to </TD>, say.)

This means that the number of carriage returns before an end
tag is significant, which seems counter to HTML's white space
philosophy to me; but I'm not sure of the intention here,
and I'm not familiar enough with the underlying SGML end token
rules.  So perhaps it's intentional that two carriage returns
are different than one carriage return.

If we agree that there shouldn't be a difference between one
carriage return and two carriage returns, perhaps the specification
could demand that "Any number of line breaks" after a start tag or
before an end tag be discarded.


2. What about end tags that are implied by the presence of a start
   tag?

Consider a table example.

<TABLE border="1">
<TR>
<TD>Data Cell One
<TD>Data Cell Two
<TR>
<TD>Data Cell Three
<TD>Data Cell Four
</TABLE>

Popular browsers actually treat the following arrangement of the
table elements differently:

<TABLE border="1">
<TR><TD>Data Cell One</TD>
<TD>Data Cell Two</TD></TR>
<TR><TD>Data Cell Three</TD>
<TD>Data Cell Four</TD></TR></TABLE>

The removal of white space before the end tags causes the
table to be rendered without any trailing spaces in the data
cells.

However, if the end tags are not present or if there is white
space, then trailing spaces *are* rendered in the data cells.

I believe the specification should state explicitly that

<TD>Data Cell One
<TD>Data Cell Two

is equivalent to:

<TD>Data Cell One
</TD><TD>Data Cell Two

and therefore equivalent to:

<TD>Data Cell One</TD><TD>Data Cell Two

The current statement in the 4.0 draft specification doesn't 
apply to the removal of line breaks *before* start tags.

                              * * *

The presence or absence of a single trailing space in the rendering
may not seem particularly important since the structure is
presented,
but sometimes the presentation of the extra white space can be
confusing to viewers.

Consider the examples of using images in tables.  A trailing space
causes the images to not be rendered next to each other, when the
images should be considered continuous.  This has structural
implications.

Furthermore, an image being used as an anchor for a link
such as:

<A HREF="http://www.foo.com/"><IMG SRC="foo.gif">
</A>

is rendered incorrectly by browsers with a trailing underlined
space that is confusing to viewers.

To avoid these ambiguities and behaviors, I applaud the strong
wording of the current HTML 4.0 draft.  I just want to make sure
that the demand for line break removal is as robust as possible,
so that user agents don't have any excuse for non-compliance.
-- 
E. Stephen Mack <estephen@emf.net>
http://www.emf.net/~estephen/

Received on Saturday, 12 July 1997 17:46:02 UTC