- From: Lee Passey <lee@novonyx.com>
- Date: Fri, 31 Aug 2001 10:53:28 -0600
- To: html-tidy@w3.org
The default behavior of Tidy is to replace <p></p> with <br />.
I have some html files which contain the phrase <p> <p>, and I
expected that same behavior would hold for them. This is not the case.
I got into the source code and discovered there are two reasons for
this:
(1) in TrimTrailingSpace(), a check is made for a character having the
value of 160 (0xa0), which is the code for a non-breaking space.
However, the code is encoded in the buffer in utf-8, which makes it a
two-byte character. Interestingly, the second character of the sequence
is 160! Thus, the routine thinks it has found a non-breaking space,
which it has, but only removes the second character, leaving a roque 194
in the TextNode
(2) in TrimSpaces(), no check is made for text nodes which have trimmed
into oblivion.
I presume newer versions of tidy should include these fixes, so I am
including here the diffs from the 8-2000 version that I used to
accomplish this.
288a289
> /*! NOTE: is utf-8 encoded as two bytes */
293a295,299
> if ( (unsigned char)lexer->lexbuf[last->end - 1] == 0xc2
> && c == 0xa0)
> {
> last->end -= 1;
> }
297a304,308
> if ( (unsigned char) (lexer->lexbuf[ last->end - 1]) == 0xc2
> && c == 0xa0)
> {
> last->end -= 1;
> }
378a390
> {
379a392,394
> if (text->start == text->end)
> TrimEmptyElement( lexer, text );
> }
383a399
> {
384a401,403
> if (text->start == text->end)
> TrimEmptyElement( lexer, text );
> }
Received on Friday, 31 August 2001 12:51:20 UTC