- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Sat, 02 Feb 2008 14:08:26 +0000
- To: Sam Ruby <rubys@us.ibm.com>
- CC: Anne van Kesteren <annevk@opera.com>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
On 02/02/08 02:15, Sam Ruby wrote: > > Anne van Kesteren wrote: >> On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote: >>> I believe that advice applies here. Spaces in IRI should be an error. >> >> You might want to have a look at the work on revising RFC 3987: >> >> https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/ >> >> It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows >> spaces and various other characters. This syntax is primarily designed >> for markup languages. > > IMHO, that would be unfortunate. As I pointed out, a common error I see > in feeds is when trying to detect a URI is relative reference (a common > error in RSS feeds where such usage is ambiguous) is that URI can't be > parsed as a URI at all. Digging deeper, the problem often is a missing > close quote (a missing open quote is another common error). I would be > interested to see if Henri were to dig deeper into the specific errors > he sees if this is also the case in his data. http://philip.html5.org/data/spaced-uris.txt shows some offending URIs. The 6149 values can be grouped in various ways (where each one might overlap several categories): 936 only have spaces at the very beginning or end of the string. 291 only have spaces after a '#'. 2860 only have spaces after a '?'. 576 start with "mailto:". 73 contain a '<'. 57 contain a '>'. 23 contain a '"'. 78 match / [A-Za-z]+=/. So it looks like maybe 2-3% are accidentally missing quotes, and the rest are intentionally using spaces in filenames or query strings or fragment identifiers. -- Philip Taylor pjt47@cam.ac.uk
Received on Saturday, 2 February 2008 14:12:03 UTC