Re: Bug + fix for illegal ampersands and character entities

Thanks Randy for the explanation and bug fix, I will incorporate
this in the next release, which I expect to make in March.


On Sat, 17 Feb 2001, Randy Waki wrote:

> Hi Dave,
> 
> (I hope your new job is going well.)
> 
> 4-Aug-2000 Tidy's handling of illegal ampersands such as "id=1&lang=en"
> is inconsistent with browsers.  This is especially important when the
> ampersand occurs in a URL, where a mistake results in a broken link.
> There are two reasonable interpretations:  1) the ampersand should have
> been escaped; 2) the entity name "lang" should have been terminated with
> a semicolon.
> 
> Tidy's current rule is: assume #2 if it would result in a valid HTML
> entity; otherwise assume #1.  So Tidy interprets the above as
> "id=1⟨=en".
> 
> However, based on the example document below, IE 5.5 and Netscape 4.7
> appear to use a slightly different rule:  Assume #2 if it would result
> in a valid HTML entity WHOSE CHARACTER CODE IS < 256; otherwise assume
> #1.  (I suspect this is a side-effect of their implementation.  They
> probably have a table somewhere with 256 entries.  Grumble.)  Since the
> character code for the entity "lang" is 9001 decimal, IE and Netscape
> interpret the above as "id=1&amp;lang=en".
> 
> Tidy's rule can be fixed by changing the following if statement in
> lexer.c (search for "ch <= 0"):
> 
>     /* deal with unrecognized entities */
>     if (ch <= 0)
>     {
> 
> to:
> 
>     /* deal with unrecognized entities */
>     if (ch <= 0 || (ch >= 256 && c != ';'))
>     {

Regards,

-- Dave Raggett <dsr@w3.org> or <dave.raggett@openwave.com>
W3C Visiting Fellow, see http://www.w3.org/People/Raggett 
tel/fax: +44 122 578 3011 (or 2521) +44 771 213 7629 (mobile)

Received on Monday, 19 February 2001 09:54:20 UTC