Re: Spaces in IRIs

On 03/02/08 12:50, Sam Ruby wrote:
> Philip Taylor wrote:
>> http://philip.html5.org/data/spaced-uris.txt shows some offending URIs.
>>
>> [...]
> 
> I would be curious to find out what parser you used to produce these 
> results.

The Validator.nu HTML Parser.

> Noting that not a single tag shown contains a so much as a 
> title or a class attribute

Sorry, that comes from a confusing presentation of the data - it's just 
showing the element name, attribute name (for the attribute containing 
the URI) and attribute value, written in a format that happens to look 
like an XML element. It's not a copy or reserialisation of the input 
element, so it never has any other attributes.

> http://www.allmovie.com/cg/avg.dll?p=avg&sql=1:162971
>   <a href="/cg/avg.dll?p=avg&amp;sql=34: title="/>
> 
> Fetching either that page (or a page with the &amp;amp;amp; replaced by 
> a single &

(Also sorry. The original URI from dmoz.org had "&amp;" in it, so that 
part is not my fault. But I treated that as "&amp;amp;" when downloading 
pages, since I didn't unescape the XML-encoded list of URIs, and after 
noticing the problem I didn't want to re-download everything since it 
only affects 1% of the pages and the data is already noisy enough that 
it wouldn't make a significant difference. And then I didn't bother 
unescaping the XML-encoded output data when converting it to the .txt 
file, hence the "&amp;amp;amp;".)

> I find the following:
> <a href="/cg/avg.dll?p=avg&amp;sql=34:" title="New Releases" class="left">

On http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;amp;sql=1:162971 I 
also find:

<div class="bottom_tab"><a href="/cg/avg.dll?p=avg&amp;sql=34: 
title="click for full list"><img src="/img/nr_tab.gif" alt="full 
article" width="74" height="20px" /></a></div>

which has the missing quote.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Sunday, 3 February 2008 13:28:52 UTC