URIs parsing in checklink from Dominique Hazael-Massieux on 2008-03-07 (public-qa-dev@w3.org from March 2008)

From: Dominique Hazael-Massieux <dom@w3.org>
Date: Fri, 07 Mar 2008 15:16:22 +0100
To: public-qa-dev <public-qa-dev@w3.org>
Message-Id: <1204899382.26655.116.camel@localhost>

Hi,

The mobileOK checker has currently a pretty crude algorithm when parsing
HTML pages and CSS style sheets to resolve URIs that it finds in there:
if the URI matches the syntax in the RFC, it proceeds, otherwise, it
reports an error.

This algorithm is pretty crude because many Web pages use URIs with
characters that ought to be escaped according to the RFC but aren't, and
most Web browsers deal alright with these cases.

So, I have a question and a suggestion:
 * the question is: how does the link checker parses URIs? I assume it
needs to do so when making relative URIs absolute, as well as when doing
HEAD/GET requests? How lenient is it with regard to what the RFC allows?
Where does it put the limit between a broken link and non-broken one for
URIs that don't match the RFC requirements ?

 * the suggestion is: maybe the link checker should warn its users about
links that don't match what's the RFC requires?

(of course, this probably opens us some dreaded cans of works about
URIs, IRIs and canonicalization)

Dom

Received on Friday, 7 March 2008 14:16:54 UTC