W3C home > Mailing lists > Public > public-html@w3.org > February 2008

Re: Spaces in IRIs

From: Sam Ruby <rubys@us.ibm.com>
Date: Fri, 01 Feb 2008 21:15:27 -0500
Message-ID: <47A3D23F.5010206@us.ibm.com>
To: Anne van Kesteren <annevk@opera.com>
CC: Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>

Anne van Kesteren wrote:
> On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote:
>> I believe that advice applies here.  Spaces in IRI should be an error.
> 
> You might want to have a look at the work on revising RFC 3987:
> 
>   https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/
> 
> It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows spaces 
> and various other characters. This syntax is primarily designed for 
> markup languages.

IMHO, that would be unfortunate.  As I pointed out, a common error I see 
in feeds is when trying to detect a URI is relative reference (a common 
error in RSS feeds where such usage is ambiguous) is that URI can't be 
parsed as a URI at all.  Digging deeper, the problem often is a missing 
close quote (a missing open quote is another common error).  I would be 
interested to see if Henri were to dig deeper into the specific errors 
he sees if this is also the case in his data.

There also are languages like Ruby where URI.parse throws an exception 
when attempt is made to parse a purported URI that contains a space.

I'll also point out that as URIs don't (currently) allow unecoded spaces 
or quote characters, one generally doesn't need to worry about quoting 
such values in the HTML5 serialization.

Finally, I will point out that the exclusion of space characters wasn't 
an error or omission, it was a very explicit and consious decision. 
 From http://www.ietf.org/rfc/rfc1630.txt (1994):

       The use of white space characters is risky in URIs to be printed
       or sent by electronic mail, and the use of multiple white space
       characters is very risky.  This is because of the frequent
       introduction of extraneous white space when lines are wrapped by
       systems such as mail, or sheer necessity of narrow column width,
       and because of the inter-conversion of various forms of white
       space which occurs during character code conversion and the
       transfer of text between applications.  This is why the canonical
       form for URIs has all white spaces encoded.

- Sam Ruby
Received on Saturday, 2 February 2008 02:15:51 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:12 GMT