- From: Dave Raggett <dsr@w3.org>
- Date: Mon, 19 Feb 2001 14:54:11 +0000 (GMT Standard Time)
- To: Randy Waki <rwaki@flipdog.com>
- cc: html-tidy@w3.org
Thanks Randy for the explanation and bug fix, I will incorporate
this in the next release, which I expect to make in March.
On Sat, 17 Feb 2001, Randy Waki wrote:
> Hi Dave,
>
> (I hope your new job is going well.)
>
> 4-Aug-2000 Tidy's handling of illegal ampersands such as "id=1&lang=en"
> is inconsistent with browsers. This is especially important when the
> ampersand occurs in a URL, where a mistake results in a broken link.
> There are two reasonable interpretations: 1) the ampersand should have
> been escaped; 2) the entity name "lang" should have been terminated with
> a semicolon.
>
> Tidy's current rule is: assume #2 if it would result in a valid HTML
> entity; otherwise assume #1. So Tidy interprets the above as
> "id=1⟨=en".
>
> However, based on the example document below, IE 5.5 and Netscape 4.7
> appear to use a slightly different rule: Assume #2 if it would result
> in a valid HTML entity WHOSE CHARACTER CODE IS < 256; otherwise assume
> #1. (I suspect this is a side-effect of their implementation. They
> probably have a table somewhere with 256 entries. Grumble.) Since the
> character code for the entity "lang" is 9001 decimal, IE and Netscape
> interpret the above as "id=1&lang=en".
>
> Tidy's rule can be fixed by changing the following if statement in
> lexer.c (search for "ch <= 0"):
>
> /* deal with unrecognized entities */
> if (ch <= 0)
> {
>
> to:
>
> /* deal with unrecognized entities */
> if (ch <= 0 || (ch >= 256 && c != ';'))
> {
Regards,
-- Dave Raggett <dsr@w3.org> or <dave.raggett@openwave.com>
W3C Visiting Fellow, see http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 771 213 7629 (mobile)
Received on Monday, 19 February 2001 09:54:20 UTC