- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Sun, 03 Feb 2008 13:28:35 +0000
- To: Sam Ruby <rubys@us.ibm.com>
- CC: HTML WG <public-html@w3.org>
On 03/02/08 12:50, Sam Ruby wrote: > Philip Taylor wrote: >> http://philip.html5.org/data/spaced-uris.txt shows some offending URIs. >> >> [...] > > I would be curious to find out what parser you used to produce these > results. The Validator.nu HTML Parser. > Noting that not a single tag shown contains a so much as a > title or a class attribute Sorry, that comes from a confusing presentation of the data - it's just showing the element name, attribute name (for the attribute containing the URI) and attribute value, written in a format that happens to look like an XML element. It's not a copy or reserialisation of the input element, so it never has any other attributes. > http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;sql=1:162971 > <a href="/cg/avg.dll?p=avg&sql=34: title="/> > > Fetching either that page (or a page with the &amp;amp; replaced by > a single & (Also sorry. The original URI from dmoz.org had "&" in it, so that part is not my fault. But I treated that as "&amp;" when downloading pages, since I didn't unescape the XML-encoded list of URIs, and after noticing the problem I didn't want to re-download everything since it only affects 1% of the pages and the data is already noisy enough that it wouldn't make a significant difference. And then I didn't bother unescaping the XML-encoded output data when converting it to the .txt file, hence the "&amp;amp;".) > I find the following: > <a href="/cg/avg.dll?p=avg&sql=34:" title="New Releases" class="left"> On http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;sql=1:162971 I also find: <div class="bottom_tab"><a href="/cg/avg.dll?p=avg&sql=34: title="click for full list"><img src="/img/nr_tab.gif" alt="full article" width="74" height="20px" /></a></div> which has the missing quote. -- Philip Taylor pjt47@cam.ac.uk
Received on Sunday, 3 February 2008 13:28:52 UTC