Re: Spaces in IRIs from Philip Taylor on 2008-02-02 (public-html@w3.org from February 2008)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Sat, 02 Feb 2008 14:08:26 +0000
To: Sam Ruby <rubys@us.ibm.com>
CC: Anne van Kesteren <annevk@opera.com>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Message-ID: <47A4795A.6000707@cam.ac.uk>

On 02/02/08 02:15, Sam Ruby wrote:
> 
> Anne van Kesteren wrote:
>> On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote:
>>> I believe that advice applies here.  Spaces in IRI should be an error.
>>
>> You might want to have a look at the work on revising RFC 3987:
>>
>>   https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/
>>
>> It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows 
>> spaces and various other characters. This syntax is primarily designed 
>> for markup languages.
> 
> IMHO, that would be unfortunate.  As I pointed out, a common error I see 
> in feeds is when trying to detect a URI is relative reference (a common 
> error in RSS feeds where such usage is ambiguous) is that URI can't be 
> parsed as a URI at all.  Digging deeper, the problem often is a missing 
> close quote (a missing open quote is another common error).  I would be 
> interested to see if Henri were to dig deeper into the specific errors 
> he sees if this is also the case in his data.

http://philip.html5.org/data/spaced-uris.txt shows some offending URIs.

The 6149 values can be grouped in various ways (where each one might 
overlap several categories):

936 only have spaces at the very beginning or end of the string.
291 only have spaces after a '#'.
2860 only have spaces after a '?'.
576 start with "mailto:".
73 contain a '<'.
57 contain a '>'.
23 contain a '"'.
78 match / [A-Za-z]+=/.

So it looks like maybe 2-3% are accidentally missing quotes, and the 
rest are intentionally using spaces in filenames or query strings or 
fragment identifiers.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Saturday, 2 February 2008 14:12:03 UTC