Re: Spaces in IRIs

Philip Taylor wrote:
> On 02/02/08 02:15, Sam Ruby wrote:
>>
>> Anne van Kesteren wrote:
>>> On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote:
>>>> I believe that advice applies here.  Spaces in IRI should be an error.
>>>
>>> You might want to have a look at the work on revising RFC 3987:
>>>
>>>   https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/
>>>
>>> It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows 
>>> spaces and various other characters. This syntax is primarily 
>>> designed for markup languages.
>>
>> IMHO, that would be unfortunate.  As I pointed out, a common error I 
>> see in feeds is when trying to detect a URI is relative reference (a 
>> common error in RSS feeds where such usage is ambiguous) is that URI 
>> can't be parsed as a URI at all.  Digging deeper, the problem often is 
>> a missing close quote (a missing open quote is another common error).  
>> I would be interested to see if Henri were to dig deeper into the 
>> specific errors he sees if this is also the case in his data.
> 
> http://philip.html5.org/data/spaced-uris.txt shows some offending URIs.
> 
> The 6149 values can be grouped in various ways (where each one might 
> overlap several categories):
> 
> 936 only have spaces at the very beginning or end of the string.
> 291 only have spaces after a '#'.
> 2860 only have spaces after a '?'.
> 576 start with "mailto:".
> 73 contain a '<'.
> 57 contain a '>'.
> 23 contain a '"'.
> 78 match / [A-Za-z]+=/.
> 
> So it looks like maybe 2-3% are accidentally missing quotes, and the 
> rest are intentionally using spaces in filenames or query strings or 
> fragment identifiers.

I would be curious to find out what parser you used to produce these 
results.  Noting that not a single tag shown contains a so much as a 
title or a class attribute, I picked one and dug in a bit further:

http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;amp;sql=1:162971
   <a href="/cg/avg.dll?p=avg&amp;sql=34: title="/>

Fetching either that page (or a page with the &amp;amp;amp; replaced by 
a single &) I find the following:

<a href="/cg/avg.dll?p=avg&amp;sql=34:" title="New Releases" class="left">

At the present time that tag has matched quotes and no spaces in the URI.

- Sam Ruby

Received on Sunday, 3 February 2008 12:51:34 UTC