- From: Dave Raggett <dsr@w3.org>
- Date: Mon, 19 Feb 2001 14:54:11 +0000 (GMT Standard Time)
- To: Randy Waki <rwaki@flipdog.com>
- cc: html-tidy@w3.org
Thanks Randy for the explanation and bug fix, I will incorporate this in the next release, which I expect to make in March. On Sat, 17 Feb 2001, Randy Waki wrote: > Hi Dave, > > (I hope your new job is going well.) > > 4-Aug-2000 Tidy's handling of illegal ampersands such as "id=1&lang=en" > is inconsistent with browsers. This is especially important when the > ampersand occurs in a URL, where a mistake results in a broken link. > There are two reasonable interpretations: 1) the ampersand should have > been escaped; 2) the entity name "lang" should have been terminated with > a semicolon. > > Tidy's current rule is: assume #2 if it would result in a valid HTML > entity; otherwise assume #1. So Tidy interprets the above as > "id=1⟨=en". > > However, based on the example document below, IE 5.5 and Netscape 4.7 > appear to use a slightly different rule: Assume #2 if it would result > in a valid HTML entity WHOSE CHARACTER CODE IS < 256; otherwise assume > #1. (I suspect this is a side-effect of their implementation. They > probably have a table somewhere with 256 entries. Grumble.) Since the > character code for the entity "lang" is 9001 decimal, IE and Netscape > interpret the above as "id=1&lang=en". > > Tidy's rule can be fixed by changing the following if statement in > lexer.c (search for "ch <= 0"): > > /* deal with unrecognized entities */ > if (ch <= 0) > { > > to: > > /* deal with unrecognized entities */ > if (ch <= 0 || (ch >= 256 && c != ';')) > { Regards, -- Dave Raggett <dsr@w3.org> or <dave.raggett@openwave.com> W3C Visiting Fellow, see http://www.w3.org/People/Raggett tel/fax: +44 122 578 3011 (or 2521) +44 771 213 7629 (mobile)
Received on Monday, 19 February 2001 09:54:20 UTC