- From: Sam Ruby <rubys@us.ibm.com>
- Date: Sun, 03 Feb 2008 07:50:04 -0500
- To: Philip Taylor <pjt47@cam.ac.uk>
- CC: Anne van Kesteren <annevk@opera.com>, Henri Sivonen <hsivonen@iki.fi>, HTML WG <public-html@w3.org>
Philip Taylor wrote: > On 02/02/08 02:15, Sam Ruby wrote: >> >> Anne van Kesteren wrote: >>> On Fri, 01 Feb 2008 00:52:14 +0100, Sam Ruby <rubys@us.ibm.com> wrote: >>>> I believe that advice applies here. Spaces in IRI should be an error. >>> >>> You might want to have a look at the work on revising RFC 3987: >>> >>> https://datatracker.ietf.org/drafts/draft-duerst-iri-bis/ >>> >>> It introduces a "Legacy Extended IRI" (LEIRI) syntax that allows >>> spaces and various other characters. This syntax is primarily >>> designed for markup languages. >> >> IMHO, that would be unfortunate. As I pointed out, a common error I >> see in feeds is when trying to detect a URI is relative reference (a >> common error in RSS feeds where such usage is ambiguous) is that URI >> can't be parsed as a URI at all. Digging deeper, the problem often is >> a missing close quote (a missing open quote is another common error). >> I would be interested to see if Henri were to dig deeper into the >> specific errors he sees if this is also the case in his data. > > http://philip.html5.org/data/spaced-uris.txt shows some offending URIs. > > The 6149 values can be grouped in various ways (where each one might > overlap several categories): > > 936 only have spaces at the very beginning or end of the string. > 291 only have spaces after a '#'. > 2860 only have spaces after a '?'. > 576 start with "mailto:". > 73 contain a '<'. > 57 contain a '>'. > 23 contain a '"'. > 78 match / [A-Za-z]+=/. > > So it looks like maybe 2-3% are accidentally missing quotes, and the > rest are intentionally using spaces in filenames or query strings or > fragment identifiers. I would be curious to find out what parser you used to produce these results. Noting that not a single tag shown contains a so much as a title or a class attribute, I picked one and dug in a bit further: http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;sql=1:162971 <a href="/cg/avg.dll?p=avg&sql=34: title="/> Fetching either that page (or a page with the &amp;amp; replaced by a single &) I find the following: <a href="/cg/avg.dll?p=avg&sql=34:" title="New Releases" class="left"> At the present time that tag has matched quotes and no spaces in the URI. - Sam Ruby
Received on Sunday, 3 February 2008 12:51:34 UTC