- From: Sam Ruby <rubys@us.ibm.com>
- Date: Fri, 01 Feb 2008 21:15:27 -0500
- To: Anne van Kesteren <annevk@opera.com>
- CC: Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Anne van Kesteren wrote: > On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote: >> I believe that advice applies here. Spaces in IRI should be an error. > > You might want to have a look at the work on revising RFC 3987: > > https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/ > > It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows spaces > and various other characters. This syntax is primarily designed for > markup languages. IMHO, that would be unfortunate. As I pointed out, a common error I see in feeds is when trying to detect a URI is relative reference (a common error in RSS feeds where such usage is ambiguous) is that URI can't be parsed as a URI at all. Digging deeper, the problem often is a missing close quote (a missing open quote is another common error). I would be interested to see if Henri were to dig deeper into the specific errors he sees if this is also the case in his data. There also are languages like Ruby where URI.parse throws an exception when attempt is made to parse a purported URI that contains a space. I'll also point out that as URIs don't (currently) allow unecoded spaces or quote characters, one generally doesn't need to worry about quoting such values in the HTML5 serialization. Finally, I will point out that the exclusion of space characters wasn't an error or omission, it was a very explicit and consious decision. From http://www.ietf.org/rfc/rfc1630.txt (1994): The use of white space characters is risky in URIs to be printed or sent by electronic mail, and the use of multiple white space characters is very risky. This is because of the frequent introduction of extraneous white space when lines are wrapped by systems such as mail, or sheer necessity of narrow column width, and because of the inter-conversion of various forms of white space which occurs during character code conversion and the transfer of text between applications. This is why the canonical form for URIs has all white spaces encoded. - Sam Ruby
Received on Saturday, 2 February 2008 02:15:51 UTC