W3C home > Mailing lists > Public > html-tidy@w3.org > July to September 1999

Erroneously-generated backslashes (was: Re: TrimEmptyElement()...)

From: Advocate <wotw@inetnebr.com>
Date: Sat, 28 Aug 1999 17:35:34 -0500
Message-Id: <v04205500b3ee0d8195f3@[]>
To: html-tidy@w3.org
Terry Teague <teague@mailandnews.com> wrote:
>At 6:58 PM -0700 8/21/99, Terry Teague wrote:
>>At 12:16 AM -0700 8/16/99, Terry Teague wrote:
>>>At 6:13 PM -0500 8/15/99, Advocate wrote:

>>>> <table border="1" summary="Demonstration of HTML Tidy's erroneous \
>>>> elimination of empty rows.">
>>>> (Hmm, I seem to have found a bug in MacTidy 1.0b2, specifically that
>>>> trailing slash in the summary.  That can't be right, can it?  No, it
>>>> isn't being inserted by my mailer, I've double-checked that already.

>>> Would like to see what happens on other platforms.

>> I have now confirmed this bug occurs also on other platforms

> Just to followup - this is NOT a bug. Tidy's pretty print splits up long
> strings (e.g. "Demonstration of HTML Tidy's erroneous elimination of empty
> rows.") across lines, using a "\". This appears to be valid HTML - i.e. if
> you run the output described above through Tidy again, it finds no
> problems; also I tested in Netscape, and it was happy.

Only because backslash is a legal character in an attribute value of 
format CDATA.  It is still adding information that wasn't there 
before, and would likely add it elsewhere where it would be in the 
way of visual browsers (such as a long ALT attribute).

Unless someone can point me to where in the HTML specification it 
says that newlines in attribute values are to be escaped with "\" 
(which is the only purpose I could see for HTML Tidy in generating 
it), I will still consider it to be a bug needing repair.

In <http://www.w3.org/TR/html40/types.html#type-cdata>,
it is written:

| o CDATA is a sequence of characters from the document character
|   set and may include character entities. User agents should
|   interpret attribute values as follows:
|       o Replace character entities with characters,
|       o Ignore line feeds,
|       o Replace each carriage return or tab with a single space.
|   User agents may ignore leading and trailing white space in
|   CDATA attribute values (e.g., "   myval   " may be interpreted as
|   "myval"). Authors should not declare attribute values with leading
|   or trailing white space.

Note the second sub-item: "Ignore line feeds".  It does not say, 
"Ignore '\'-escaped line feeds" or assign any special meaning to '\' 
in CDATA attributes.  Also note the third: "Replace each carriage 
return... with a single space", again lacking reference to a special 
meaning for backslash.  Thus, HTML Tidy is in error generating a 
backslash where there was not one before.

It _is_ a bug.

(HTML observation: those two sub-items pointed out above seems to 
convey a prejudice over end-of-line markers.  Under a document with 
UNIX line-ends, the linefeed would be ignored; under a document with 
Mac line-ends, the carriage return would be translated to a space; 
and under a document with PC line-ends, the CR would be turned to a 
space and the LF ignored.  And you can't mix-and-match line-ends in a 
single document.  One has to know what kind of line-ends they are 
authoring with and provide or not provide a whitespace character 
accordingly.  One could argue that one can err on the side of 
tolerance and always provide the whitespace, either at the end of the 
previous line or the start of the next, except if whitespace isn't 
being collapsed in that part of the document, for examples when PRE 
or the style rule "white-space: pre;" is being used.)
          ,=<#)-=#  <http://www.war-of-the-worlds.org/>
  _-~_-(####)-_~-_  "Did you see that Parkins boy's body in the tunnels?" "Just
(#>_--'~--~`--_<#)  the photos.  Worst thing I've ever seen; kid had no face."
Received on Saturday, 28 August 1999 18:36:11 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:46 UTC